You can also use a Python shell job to run Python scripts as a shell in AWS Glue. With a Python shell job, you can run scripts that are compatible with Python 2.7 or Python 3.6. The environment for running a Python shell job supports libraries such as: Boto3, collections, CSV, gzip, multiprocessing, NumPy, pandas, pickle, PyGreSQL, re, SciPy, sklearn, xml.etree.ElementTree, zipfile.
In the previous exercise, you used PySpark to transform NYC Taxi .csv data to Parquet. In this exercise, you will use Panda through Python Shell to covert the same input data to JSON.
- Click on the Jobs option on the left and then click on Add job button.
- Give a name for your Python Shell job nyctaxi-csv-to-json, select IAM role LF-GlueServiceRole from the IAM role drop-down list. Select Python Shell as your job type. The CloudFormation template already placed a Python script in your workshop bucket and showed the location as the CloudFormation output. Enter the Python script path as shown below:
- Leave the rest of the fields as it is and click Next. In the next section, it shows the connection options. If you want to use any existing Glue Connection in your script, you can do that as well. But, for this exercise, it doesn't use Glue Connection. Click Save job and edit script.
- It will open up the existing Python script on the Glue console. The script has one input parameter which is the name of the bucket. The key for the parameter is --bucket.
- Click Run job and expand the second toggle where it says job parameter.
- Go to the Job parameters section and enter --bucket as your key and bucket name (from the CloudFormation output) as the value and click Run job.
- You can exit from the script window and check the job status by selecting the job from the list.
- A successful job will create a JSON file in the path where the input is stored. Go to your S3 console to verify the output file.
- You can download the JSON file and verify the content by opening it up in a text editor.
- By running the last two exercises, now you should have an understanding of Glue ETL, how to create a different type of Glue jobs, configure and run them in a serverless manner.