Incremental Blueprints

Blueprints enable data ingestion from common sources using automated workflows. At high level, Lake Formation provides two type of blueprints:
  • Database blueprints: This blueprints help ingest data from MySQL, PostgreSQL, Oracle, and SQL server databases to your data lake. You can ingest either as bulk load snapshot, or incrementally load new data over time.
  • Log file blueprints: Ingest data from popular log file formats from AWS CloudTrail, Elastic Load Balancer, and Application Load Balancer logs.

Please make sure to finish the following chapter from the Intermediate Labs before proceeding.

In this exercise, we will use incremental database as blueprint and will ingest incremental data from TPC database to your data lake.
  1. Click on the Blueprints option from the left navigation panel and then click on Use blueprint button.
  2. Select Incremental database as the blueprint type.
  3. Under Import source section, for Database connection name, choose TPCGlueConnector which is created through CloudFormation to access the TPC database running on RDS.
  4. For the Source data path, enter "tpc/customer". Leave Exclude pattern options as default.
  5. Under Incremental data section, for the Table name enter customer , Bookmark keys enter c_customer_sk , bookmark order , choose Ascending .
  6. Under Import target section, choose tpc as the target database. For the Target storage location, choose the S3 path which you used in the Data Lake Locations section. For Data format, choose Parquet as the format in which the data is written.
  7. Now move to Import options, enter a workflow name tpc-increment. Choose LF-GlueServiceRole for the IAM role and enter "il" as the Table prefix. Leave the rest of the fields as default.
  8. Choose Create. Wait for the status of the blueprints to go from Creating to Successfully created... message.
  9. Now select the newly created workflow tpc-increment and start the workflow by selecting Start option from the Actions drop-down.
  10. It will take few minutes to ingest the TPC database to your data lake. During this phase, the Last run status column will reflect different phases of ingestion process. For this exercise, Discovering phase will take around ~4 minutes, Importing phase will take ~20 minutes.
  11. when the blueprint tpc-increment is completed successfully. Move to Athena console to query the data
  12. Query tpc database customer table from athena console and record the results.
  13. Insert new records in RDS MySQL tpc database customer table using the script.
            INSERT INTO tpc.customer (c_salutation,c_customer_sk,c_first_name,c_last_name) VALUES("Dr.",29999935,"Jill","Thomas");
            INSERT INTO tpc.customer (c_salutation,c_customer_sk,c_first_name,c_last_name) VALUES("Dr.",29999936,"Jill","Thomas");        
  14. Now move back to Lake Formation workflow console and start the workflow tpc-increment again by selecting Start option from the Actions drop-down.
  15. Once workflow is successfully completed move back to Athena console, query the tpc database customer table. You will see new records got added into table.