Incremental Blueprints

Blueprints enable data ingestion from common sources using automated workflows. At high level, Lake Formation provides two type of blueprints:
  • Database blueprints: This blueprints help ingest data from MySQL, PostgreSQL, Oracle, and SQL server databases to your data lake. You can ingest either as bulk load snapshot, or incrementally load new data over time.
  • Log file blueprints: Ingest data from popular log file formats from AWS CloudTrail, Elastic Load Balancer, and Application Load Balancer logs.

Please make sure to finish the following chapter from the Intermediate Labs before proceeding.

In this exercise, we will use incremental database as blueprint and will ingest incremental data from TPC database to your data lake.
  1. Click on the Blueprints option from the left navigation panel and then click on Use blueprint button.
  2. Select Incremental database as the blueprint type.
  3. Under Import source section, for Database connection name, choose TPCGlueConnector which is created through CloudFormation to access the TPC database running on RDS.
  4. For the Source data path, enter "tpc/customer". Leave Exclude pattern options as default.
  5. Under Incremental data section, for the Table name enter customer , Bookmark keys enter c_customer_sk , bookmark order , choose Ascending .
  6. Under Import target section, choose tpc as the target database. For the Target storage location, choose the S3 path which you used in the Data Lake Locations section. For Data format, choose Parquet as the format in which the data is written.
  7. Now move to Import options, enter a workflow name tpc-increment. Choose LF-GlueServiceRole for the IAM role and enter "il" as the Table prefix. Leave the rest of the fields as default.
  8. Choose Create. Wait for the status of the blueprints to go from Creating to Successfully created... message.
  9. Now select the newly created workflow tpc-increment and start the workflow by selecting Start option from the Actions drop-down.
  10. It will take few minutes to ingest the TPC database to your data lake. During this phase, the Last run status column will reflect different phases of ingestion process. For this exercise, Discovering phase will take around ~4 minutes, Importing phase will take ~20 minutes.
  11. when the blueprint tpc-increment is completed successfully. Move to Athena console to query the data
  12. Query tpc database customer table from athena console and record the results.
  13. Insert new records in RDS MySQL tpc database customer table using the script.
            INSERT INTO tpc.customer (c_salutation,c_customer_sk,c_first_name,c_last_name) VALUES("Dr.",29999935,"Jill","Thomas");
            INSERT INTO tpc.customer (c_salutation,c_customer_sk,c_first_name,c_last_name) VALUES("Dr.",29999936,"Jill","Thomas");        
    Follow the underline instructions to connect to mysql db:
    1. From the top AWS Web Console menu, select Services. In the search bar, begin typing Cloud9 and select Cloud9 to open the service console.
    2. On the AWS Cloud9 screen, select Create environment.
    3. On the Name environment screen, in the Name field type MyCloud9Env and select Next step.
    4. On the Configure setting screen, leave the environment type as Create a new instance for environment (EC2) and the Instance type as t2.micro.
    5. mySQL db do not have publicly accessible endpoints, your mySQL db can only be accessed from within the same VPC.
    6. To place MyCloud9Env in the same VPC as your mySQL db, scroll down the Configure setting screen and expand the Network settings (advanced) section. From the Network (VPC) drop down, select mySQL db VPC which you recorded.
    7. Select Next step. On the Review page, select Create environment. After your new Cloud9 environment has been created, proceed to the next step.

    In this step, you will enable network access from your Cloud9 environment to your mySQL db. To accomplish this task, you will add the security group assigned to MyCloud9Env as a traffic source for the security group assigned to MyClusterName. A security group is a virtual firewall that controls network traffic in your network.
    1. To make navigation easier, load the AWS Web Console in another browser tab by clicking here. In the new browser window, on the top menu bar, select Services then type VPC in the search bar and select VPC from the list.
    2. On the VPC Dashboard page, in the left navigation select Security groups. In the Group Name column, find the security group that begins with aws-cloud9-MyCloud9Env. Note the Group ID of this security group. As an example, in the screen shot to the right, you would note the security group that ends with 7431. Your Group ID will be different than is pictured in this screen shot.
    3. In the list of security groups, select the security group that begins with RDS-Launch-Wizard. Then select the Inbound Rules tab. Then select Edit.
    4. In the Inbound Rules tab, select Add another rule. In the Type column, select MySQL/Aurora (3306) from the drop down list. Then click into the Source column field and a drop-down list will appear. Select the security Group ID that you noted in step 4b. Then select Save. As an example, in the screen shot to the right, from the list you would select the security group that ends 7431. Your specific Group ID will be different than is pictured in this screen shot.
    In this step, from your Cloud9 environment you will access your Aurora Serverless DB cluster:
    1. Switch back to your MyCloud9Env browser window. In the bash terminal tab in MyCloud9Env, type in the following command. Substitute your Master username and database endpoint for the values in the command and press Enter. "mysql --user=[your Master username] --password -h [your database endpoint]". When prompted, enter your Master password and press Enter. You should now be connected to the MyClusterName Aurora Serverless DB cluster!

      For example: "mysql --user=tpcadmin --password -h tpc-database.cpoo1s5wvin9.us-east-1.rds.amazonaws.com"
  14. Now move back to Lake Formation workflow console and start the workflow tpc-increment again by selecting Start option from the Actions drop-down.
  15. Once workflow is successfully completed move back to Athena console, query the tpc database customer table. You will see new records got added into table.