Blueprints enable data ingestion from common sources using automated workflows. At high level, Lake Formation
provides two type of blueprints:
- Database blueprints: This blueprints help ingest data from MySQL, PostgreSQL, Oracle, and SQL server databases to your data lake. You can ingest either as bulk load snapshot, or incrementally load new data over time.
- Log file blueprints: Ingest data from popular log file formats from AWS CloudTrail, Elastic Load Balancer, and Application Load Balancer logs.
In this exercise, we will use database snapshot as blueprint and will ingest entire TPC
database to your data lake.
- Click on the Blueprints option from the left navigation panel and then click on Use blueprint button.
- Select Database snapshot as the blueprint type.
- For the AWS Glue Database connection name, choose TPCGlueConnector which is created through CloudFormation to access the TPC database running on RDS.
- For the Source data path, enter "tpc/". Leave Exclude pattern options as default.
- Under Import target section, choose tpc as the target database. For the Target storage location, choose the S3 path which you used in the Data Lake Locations section. For Data format, choose Parquet as the format in which the data is written.
- Now move to Import options, enter a workflow name tpc-workflow. Choose
LF-GlueServiceRole for the IAM role and enter "dl" as the Table prefix. Leave the rest of the
fields as default.
- Choose Create. Wait for the status of the blueprints to go from Creating to Successfully
created workflow: tpc-workflow message.
- You select IAM role LF-GlueServiceRole when creating the blueprint, this role will create tables
under the database tpc created in the previous step. By default, this role does not have permissions
to create resources under tpc database. Follow these steps to provide the required access to IAM role
- In the navigation pane, under Databases, select tpc. Choose Grant from the
Actions drop-down list.
- From the Grant window, under Principals select IAM role LF-GlueServiceRole
from the drop-down list. Move to the Policy tags or catalog resources
section and select Named data catalog resources. Select tpc as the database and
proceed to the next section Database permissions. Check Create Table & Super
options and click on the Grant button.
- Now go back to the Blueprints using navigation, select the newly created workflow
tpc-workflow and start the workflow by selecting Start option from the Actions drop-down.
- It will take few minutes to ingest the TPC database to your data lake. During this phase, the Last run status column will reflect different phases of ingestion process. For this exercise, Discovering phase will take around ~4 minutes, Importing phase will take ~20 minutes.
- Move to the next chapter when the blueprint tpc-workflow is completed successfully.