Crawling S3

In this exercise, you will create one more crawler but this time, the crawler will discover schema from a file stored in S3.
  1. Click on the Crawlers option on the left and then click on Add crawler button.
  2. Enter nyctaxi-crawler as the Crawler name and click Next.
  3. Select Data stores as the Crawler source type.
  4. Select S3 as a data store and provide the input path which contains tripdata.csv file (s3://lf-workshop-<account-id>/glue/nyctaxi).
  5. You are going to crawl only one data store, so select No from the option and click Next.
  6. Select IAM role LF-GlueServiceRole from the dropdown list, this IAM role should have access to the S3 file you want to crawl.
  7. Like the last exercise, you will execute this command on demand. Select Run on demand and move next.
  8. You are going to populate this crawler output to the same database glue-demo. Enter s3_ as a prefix. Leave the rest of the options as default and move next.
  9. Verify all crawler information on the screen and click Finish to create the crawler.
  10. Click on Run it now link. Alternatively, you can select the crawler and run the crawler from the Action.
  11. It may take up to 2 minutes for the crawler to finish crawling nyctaxi.csv file. You should be able to see a success message with the table count that got discovered through this crawler.
  12. Now go back to Databases -> Tables, you should be able to see the table s3_nyctaxi. Click on it.
  13. Clicking on the table, it will display table metadata, inferred schema and it's properties. If the table had partitions or multiple versions, it would display here. By looking at the table properties, you will discover the schema and know more about the source.
  14. By running these two crawlers, you should get an understanding of how Crawler works and how you can use Glue Crawler to discover schema coming from different sources.