Spark Jobs

An Apache Spark ETL job consists of the business logic that performs ETL work in AWS Glue. You can monitor job runs to understand runtime metrics such as success, duration, and start time. The output of a job is your transformed data, written to a location that you specify.

In the previous step, you crawled S3 .csv file and discovered the schema of NYC taxi data. In this exercise, you will create a Spark job in Glue, which will read that source, write them to S3 in Parquet format.
  1. Click on the Jobs option on the left and then click on Add job button.
  2. Give a name for your job nyctaxi-csv-to-parquet, select IAM role LF-GlueServiceRole from the IAM role drop-down list. Since you are creating Spark ETL job, so leave the rest of the fields as default and move to the field where it asks you where to store your script and temporary directory. The CloudFormation template created the required bucket, here you enter the path for the script location s3://lf-workshop-<account-id>/glue/scripts and temporary directory s3://lf-workshop-<account-id>/glue/temp and proceed to the next window.
  3. In this section, you define your source. For you, the source is the s3_nyctaxi as you are going to transform that source to Parquet. Select s3_nyctaxi as the source.
  4. Here you are transforming your source to a different dataset. You can also find matching records using ML transforms. Select Change schema and proceed next.
  5. You have multiple options for target, you can update already existing catalog/table or you can create a new target. For this exercise, you are going to write into S3 in Parquet format. Enter values based on that, using the same bucket that got created for this Glue exercise.
  6. Glue automatically detects mapping between source and target. You can change the mapping based on your requirement, eg.g. change column names, drop, add or merge columns. For this exercise, leave the mappings as it is and click on Save job and edit script.
  7. Glue will generate the script based on your configuration. You have full authority to your script, you can add additional business logic, modify source or target, change mappings. Once you satisfy with your script, you run it by clicking on the Run job button.
  8. Clicking on the Run job should open up this window. Here, you provide additional configurations for the job. You can pass Tags, update different monitoring options, provide library path or configure worker type and resource size. For this exercise, leave all options as default and click on Run job button.
  9. When the job is running, you will notice that the Run job button is grayed out with a circular-spinner icon.
  10. Exit from the script window. You can monitor additional job metrics by going back to the Glue job console and select the individual job. Metrics tab displays job metrics, the History tab shows job history.
  11. Run status column shows job status, this Spark job may take from 2 to 12 minutes. Once the job is completed successfully, it will show up Succeed status.
  12. If the job is finished successfully, it should have created Parquet output in the target location selected in Step 5. Open up the S3 console and check the target location. The location should be populated with Parquet files as displayed below.
  13. You learn how Glue auto-generates PySpark script for you depending on your ETL job. It can also generate Scala codes by following the same steps. In the next exercise, you will learn how you can run Python-based jobs using the Glue Python shell.