Blueprints enable data ingestion from common sources using automated workflows. At high level, Lake Formation provides two type of blueprints:
  • Database blueprints: This blueprints help ingest data from MySQL, PostgreSQL, Oracle, and SQL server databases to your data lake. You can ingest either as bulk load snapshot, or incrementally load new data over time.
  • Log file blueprints: Ingest data from popular log file formats from AWS CloudTrail, Elastic Load Balancer, and Application Load Balancer logs.
In this exercise, we will use database snapshot as blueprint and will ingest entire TPC database to your data lake.
  1. Click on the Blueprints option from the left navigation panel and then click on Use blueprint button.
  2. Select Database snapshot as the blueprint type.
  3. For the AWS Glue Database connection name, choose TPCGlueConnector which is created through CloudFormation to access the TPC database running on RDS.
  4. For the Source data path, enter "tpc/". Leave Exclude pattern options as default.
  5. Under Import target section, choose tpc as the target database. For the Target storage location, choose the S3 path which you used in the Data Lake Locations section. For Data format, choose Parquet as the format in which the data is written.
  6. Now move to Import options, enter a workflow name tpc-workflow. Choose LF-GlueServiceRole for the IAM role and enter "dl" as the Table prefix. Leave the rest of the fields as default.
  7. Choose Create. Wait for the status of the blueprints to go from Creating to Successfully created workflow: tpc-workflow message.
  8. You select IAM role LF-GlueServiceRole when creating the blueprint, this role will create tables under the database tpc created in the previous step. By default, this role does not have permissions to create resources under tpc database. Follow these steps to provide the required access to IAM role LF-GlueServiceRole.
    1. In the navigation pane, under Databases, select tpc. Choose Grant from the Actions drop-down list.
    2. From the Grant window, under Principals select IAM role LF-GlueServiceRole from the drop-down list. Move to the Policy tags or catalog resources section and select Named data catalog resources. Select tpc as the database and proceed to the next section Database permissions. Check Create Table & Super options and click on the Grant button.
  9. Now go back to the Blueprints using navigation, select the newly created workflow tpc-workflow and start the workflow by selecting Start option from the Actions drop-down.
  10. It will take few minutes to ingest the TPC database to your data lake. During this phase, the Last run status column will reflect different phases of ingestion process. For this exercise, Discovering phase will take around ~4 minutes, Importing phase will take ~20 minutes.
  11. Move to the next chapter when the blueprint tpc-workflow is completed successfully.