Crawling JDBC

Let's create a JDBC crawler using the connection you just created to extract the schema from the TPC database.
  1. Click on the Crawlers option on the left and then click on the Add crawler button.
  2. Enter tpc-crawler as the Crawler name and click Next.
  3. Select Data stores as the Crawler source type.
  4. In this section you select the crawler type [S3, JDBC & DynamoDB]. Select JDBC as a data store, mysql-connector as the connection. You can crawl an entire schema/database or you can call any specific table or table with a prefix. For this exercise, we want to crawl just a single table. So, enter tpc/income_band in the Include path section. It will scan only the income_band table in the tpc schema.
  5. You are going to crawl only one data store, so select No from the option and click Next.
  6. Select IAM role LF-GlueServiceRole from the dropdown list and proceed next.
  7. You can schedule a crawler in several ways but in this exercise, you run it on demand. Select Run on demand and move next.
  8. In this step, you decide where to store the crawler's output. We want to update the database created in this exercise. Select glue-demo from the database list and enter jdbc_ as a prefix. Leave the rest of the options as default and move next.
  9. Verify all crawler information on the screen and click Finish to create the crawler.
  10. When the crawler is newly created, it will ask you if you want to run it now. Click on Run it now link. Alternatively, you can select the crawler and run the crawler from the Action.
  11. It may take up to 2 minutes for the crawler to finish crawling income_band table. You should be able to see a success message with the table count that got discovered through the crawler.
  12. Now go back to Databases -> Tables, you should be able to see the table jdbc_tpc_income_band. Click on it.
  13. Clicking on the table, it will display table metadata, inferred schema and it's properties. If the table had partitions or multiple versions, it would display here. By looking at the table properties, you will discover the schema and know more about the source.
  14. Proceed to the next exercise, where you will run the crawler again but this time it will crawl a file on S3 location to discover it's the schema.