You can use Glue Development Endpoint to iteratively develop and test your extract, transform, and load (ETL) scripts. You can create, edit, and delete development endpoints using the AWS Glue console or API. When you create a development endpoint, you provide configuration values to provision the development environment. These values tell AWS Glue how to set up the network so that you can access the endpoint securely and the endpoint can access your data stores.
You can then create a notebook that connects to the endpoint, and use your notebook to the author and test your ETL script. When you're satisfied with the results of your development process, you can create an ETL job that runs your script. With this process, you can add functions and debug your scripts in an interactive manner.
In this exercise, you will learn how to create Glue development endpoint and how to run or debug scripts there.
- Go to Glue console and click on Dev endpoints from the left and then click on Add endpoint
- Give a name to your Glue endpoint glue-dev and select the same IAM role that you used in other exercises and click Next.
- In this section, you provide your network configuration for the Glue endpoint. You already provided network configuration while creating Glue connection, you are going to use it. The Glue will capture the configuration from the connection and will towards your endpoint. Once you select mysql-connector, click Next to proceed.
- This section is optional. Sometimes, you may want to connect to the notebook from your laptop or local notebook client, in that case, you provide a public key, based on that public key, you get access to the remote notebook instance. Leave this option and proceed next.
- Review all your configuration and click Finish to create your Glue development endpoint.
- Glue endpoint creation process may take 4-8 minutes, you will see the status on the Glue endpoint console.
- Once it's completed, the status will change to READY.
- Now development endpoint is ready, you need an interface and interpreter to communicate and run your code. For that, you have two options - (a) Zeppelin notebook, and (b) SageMaker notebook. For this exercise, you will use SageMaker notebook. Click the development endpoint and select Create SageMaker notebook from the Action dropdown.
- Now configure your notebook, give a name for your notebook and create a new IAM role for this notebook.
- Glue will automatically pick up the rest of the configuration from your Glue endpoint. Leave the rest of the options unchanged and click on Create notebook.
- It will take a few minutes to start the SageMaker notebook. Once it's ready, the status will change from Starting to Running. Once it's ready, select the notebook instance from the list and click on Open notebook.
- The default SageMaker notebook comes with a few Glue Examples. In this exercise, you will create a new notebook to explore the data that you crawled in the previous exercise.
- Click New on the right corner and select Sparkmagic (PySpark). It will open a new notebook, give a name for your notebook.
- In the new notebook, you will create two sections - section (1) will initialize Spark and Glue context and section (2) to query NYC taxi data using an existing catalog.
- You can download the above sample script from the following attachment section.
- By running a few PySpark blocks, now you learn how to use Glue endpoint to develop your Glue ETL scripts and debug them interactively.