AWS Lake Formation relies on the interaction of several components to create and manage your data lake. Following are some important terms that you will encounter in this guide.
The data lake is your persistent data that is stored in Amazon S3 and managed by Lake Formation using a Data Catalog. A data lake typically stores the following:
- Structured and unstructured data
- Raw data and transformed data
For an Amazon S3 path to be within a data lake, it must be registered with Lake Formation.
Lake Formation provides secure and granular access to data through a new grant/revoke permissions model that augments AWS Identity and Access Management (IAM) policies.
Analysts and data scientists can use the full portfolio of AWS analytics and machine learning services, such as Amazon Athena, to access the data. The configured Lake Formation security policies help ensure that users can access only the data that they are authorized to access.
A blueprint is a data management template that enables you to easily ingest data into a data lake. Lake Formation provides several blueprints, each for a predefined source type, such as a relational database or AWS CloudTrail logs. From a blueprint, you can create a workflow. Workflows consist of AWS Glue crawlers, jobs, and triggers that are generated to orchestrate the loading and update of data. Blueprints take the data source, data target, and schedule as input to configure the workflow.
A workflow is a container for a set of related jobs, crawlers, and triggers in AWS Glue. In Lake Formation, you create a workflow from a blueprint. A workflow encapsulates a complex multi-job extract, transform, and load (ETL) activity that AWS Glue can execute and track as a single entity. You can visually track the status of the different nodes in the workflow on the AWS Management Console, making it easier to monitor progress and troubleshoot. You can also share parameters across entities in the workflow.
When you define a workflow, you select the blueprint upon which it is based. You can then run workflows on demand or on a schedule. Each workflow manages the execution and monitoring of all its defined jobs and crawlers.
When a workflow has completed, the user who ran the workflow is granted Lake Formation data permissions on the Data Catalog tables that the workflow creates. Workflows that you create in Lake Formation are visible in the AWS Glue console.
The Data Catalog is your persistent metadata store. It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore. It provides a uniform repository where disparate systems can store and find metadata to track data in data silos, and then use that metadata to query and transform the data. Lake Formation uses the AWS Glue Data Catalog to store metadata about data lakes, data sources, transforms, and targets.
Metadata about data sources and targets is in the form of databases and tables. Tables store schema information, location information, and more. Databases are collections of tables. Lake Formation provides a hierarchy of permissions to control access to databases and tables in the Data Catalog.
Each AWS account has one Data Catalog per AWS Region.
Underlying data refers to the source data or data within the data lakes that Data Catalog tables point to.
A principal is an AWS Identity and Access Management (IAM) user or role that does work in Lake Formation.
Data Lake Administrator
User who can register Amazon S3 locations, access the Data Catalog, create databases, create and run workflows, grant Lake Formation permissions to other users, and view AWS CloudTrail logs. Has fewer IAM permissions than the IAM administrator, but enough to administer the data lake. Cannot add other data lake administrators.