-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add documentation for glue examples
- Loading branch information
Showing
6 changed files
with
193 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
## Describe your changes | ||
|
||
--- | ||
|
||
## Issue ticket number and link | ||
|
||
--- | ||
|
||
## Checklist before requesting a review | ||
|
||
- [ ] I have performed a self-review of my code | ||
- [ ] If it is a core feature, I have added thorough tests | ||
- [ ] New feature / Breaking change | ||
- [ ] I have added tests that prove my fix is effective or that my feature works | ||
- [ ] New and existing unit tests pass locally with my changes | ||
|
||
--- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
## Writing PySpark Scripts in AWS Glue | ||
|
||
Until now, we have been writing PySpark scripts which are executed on a local machine. In this section, we will learn how to write PySpark scripts that can be executed on AWS Glue. AWS Glue is a fully managed ETL service that makes it easy to move data between different data stores, clean and transform data. Glue allows you to write ETL scripts in Python and Scala and execute them on a fully managed Spark environment. | ||
|
||
- Pricing: You pay only for the resources used while your jobs are running. Based on how many DPUs you allocate to your job, AWS Glue will allocate the number of Spark executors and cores accordingly. You can choose between 2 DPUs and 100 DPUs. The default is 10 DPUs. The cost per DPU-Hour is `$0.44`. So, always remember to check the number of DPUs allocated to your job before running it. | ||
|
||
- AWS Glue is serverless. You do not need to provision any Spark clusters or manage any Spark infrastructure. AWS Glue will automatically provision and scale the resources required to run your job. | ||
|
||
- AWS Glue is fully managed. You do not need to worry about patching, upgrading, or maintaining any servers. You just need to select the version of Glue you want to use and AWS will take care of the rest. The current version of Glue is `4.0`. | ||
|
||
|
||
## Modules | ||
|
||
There are a few modules that are imported from the `awsglue` package. Let's take a look at them: | ||
|
||
**SparkContext:** The `SparkContext` is responsible for managing the Spark cluster and its resources. | ||
|
||
**GlueContext:** The `GlueContext` class provides an interface for interacting with AWS Glue services using Spark. | ||
|
||
**Job:** An instance of the `Job` class is created and initialized using the Glue context. This job will represent the data transformation task performed by the script. | ||
You need to initialize the job using the `init()` method and pass in the name of the job. When the job is complete, you need to commit it using the `commit()` method. | ||
|
||
You will write all the transformation logic and the PySpark realted code between the `init()` and `commit()` methods of the job. See the example below. | ||
|
||
|
||
```python | ||
# Glue context setup | ||
glue_context = GlueContext(spark_context) | ||
|
||
# Spark session setup | ||
spark_session = glue_context.spark_session | ||
|
||
# Initialize glue job | ||
job = Job(glue_context) | ||
job.init("job-name") | ||
|
||
.... PySpark Script Goes Here .... | ||
|
||
job.commit() | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
## Creating a Glue DynamicFrame | ||
|
||
In this example, we will create a Glue dynamic frame using a Data Catalog as source. We will then convert the dynamic frame to a PySpark DataFrame and perform some transformations on it. | ||
|
||
## Glue Dynamic Frame vs Spark DataFrame | ||
|
||
A Glue dynamic frame is similar to a Spark DataFrame, except that each record is self-describing, so no schema is required initially. This makes it easier to deal with semi-structured data, such as JSON data. DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. | ||
|
||
On the other hand, Spark DataFrames are designed to provide a typed view of structured data, and structured data is the most common use case for Spark. Spark DataFrames are also more performant than DynamicFrames. Hence, you would typically convert a DynamicFrame to a Spark DataFrame as soon as possible in your ETL script. | ||
|
||
|
||
```python | ||
data_catalog = glue_context.create_dynamic_frame_from_catalog( | ||
database="data-engg-starter", # Database name in Data Catalog | ||
table_name="survey_results_public", # Table name in Data Catalog | ||
transformation_ctx="data_catalog", # Transformation context | ||
) | ||
``` | ||
|
||
You will notice the `create_dynamic_frame_from_catalog()` method which is creating a dynamic frame from a Data Catalog. You may wonder what Data Catalog is? | ||
|
||
When we are working locally, we can use a CSV file as source. But in production, we could be dealing with huge datasets which won't be a single file. In such cases, we can use a Data Catalog as source. A Data Catalog is a central repository of metadata about your data. Consider it as a database of your source data. Only difference is that it does not contain the actual data. Rather, it contains the metadata about your data like table definitions, schema, and other control information. You can use the Data Catalog as a source for your ETL jobs. | ||
|
||
So, if you have 10 Million records across 100 CSV files, you can create a Data Catalog table with the schema of the CSV files. Then, you can use this table as source for your ETL jobs. This way, you don't have to worry about the location of the CSV files or the schema of the CSV files. You just need to know the name of the table in the Data Catalog. | ||
|
||
Follow the onboarding document to learn more about Data Catalog. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
## Apply Mappings to a DynamicFrame | ||
|
||
In this example, we will apply mappings to a Glue DynamicFrame. This is useful when you want to change the schema of a DynamicFrame. For example, you may want to change the data type of a column or rename a column. Or if you want to change the data type of a column from string to a date or timestamp. This is a common use case when you are reading data from a database and the date or timestamp is stored as a string. | ||
|
||
In many scenarios, you will notice that Glue Crawlers will not infer the correct data type for a column. For example, if you have a column that contains a date or timestamp, the crawler will infer the data type as string. This is because the crawler does not know the format of the date or timestamp. In this case, you will need to apply mappings to the DynamicFrame to change the data type of the column. | ||
|
||
## Implementation | ||
|
||
AWS Glue `transformations` class provides a method to apply mappings to a DynamicFrame. This method takes the following parameters: | ||
|
||
- frame: The DynamicFrame to apply the mappings to. | ||
- mappings: A list of tuples that contains the mapping information. Each tuple contains the following information: | ||
- `source`: The name of the source column. | ||
- `sourceType`: The data type of the source column. | ||
- `target`: The name of the target column. | ||
- `targetType`: The data type of the target column. | ||
```json | ||
[ | ||
("source", "sourceType", "target", "targetType"), | ||
("source", "sourceType", "target", "targetType"), | ||
("source", "sourceType", "target", "targetType"), | ||
] | ||
``` | ||
- transformation_ctx: A unique string that is used to identify the transformation. | ||
|
||
|
||
```python | ||
df = ApplyMapping.apply( | ||
frame=data_catalog, | ||
mappings=[ | ||
("country", "string", "country", "string"), | ||
("dob", "string", "os", "timestamp"), | ||
("age", "string", "age", "int"), | ||
], | ||
transformation_ctx="df", | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
## Writing DynamicFiles to a Target | ||
|
||
This example demonstrates how to write a DynamicFile to a target. In this example, we will write the DynamicFrame to an Amazon S3 bucket. We will use the `write_dynamic_frame` method from the `glue_context` class to write the DynamicFrame to the target. | ||
|
||
### Implementation | ||
|
||
AWS Glue `glue_context` class provides a method to write a DynamicFrame to a target. This method takes the following parameters: | ||
|
||
- frame: The DynamicFrame to write to the target. | ||
- connection_type: The type of connection to use to write the DynamicFrame to the target. This can be one of the following values: | ||
- `s3`: Write the DynamicFrame to an Amazon S3 bucket. | ||
- `glue`: Write the DynamicFrame to an AWS Glue Data Catalog table. | ||
- `dynamodb`: Write the DynamicFrame to an Amazon DynamoDB table. | ||
- `jdbc`: Write the DynamicFrame to a JDBC connection. | ||
- `custom`: Write the DynamicFrame to a custom connection. | ||
|
||
- connection_options: A dictionary that contains the connection options. The connection options depend on the type of connection that you are using. For example, if you are using an Amazon S3 connection, you will need to provide the following options: | ||
|
||
- `path`: The path to the Amazon S3 bucket. | ||
- `partitionKeys`: A list of partition keys (optional) | ||
|
||
```python | ||
glue_context.write_dynamic_frame.from_options( | ||
frame=raw_data, | ||
connection_type="s3", | ||
connection_options={"path": "s3://sample-glue-wednesday/data/"}, | ||
format="parquet", | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
from awsglue.job import Job | ||
from awsglue.context import GlueContext | ||
from awsglue.transforms import ApplyMapping | ||
from pyspark.context import SparkContext | ||
|
||
|
||
# Spark context setup | ||
sc = SparkContext() | ||
|
||
# Glue context setup | ||
glue_context = GlueContext(sc) | ||
|
||
# Spark session setup | ||
spark_session = glue_context.spark_session | ||
|
||
# Initialize glue job | ||
job = Job(glue_context) | ||
job.init("sample") | ||
|
||
data_catalog = glue_context.create_dynamic_frame_from_catalog( | ||
database="data-engg-starter", | ||
table_name="survey_results_public", | ||
transformation_ctx="data_catalog", | ||
additional_options={}, | ||
) | ||
|
||
raw_data = ApplyMapping.apply( | ||
frame=data_catalog, | ||
mappings=[ | ||
("country", "string", "Country", "string"), | ||
("op_sys_personal_use", "string", "OpSysPersonal use", "string"), | ||
], | ||
transformation_ctx="raw_data", | ||
) | ||
|
||
# Write to a S3 bucket | ||
glue_context.write_dynamic_frame.from_options( | ||
frame=raw_data, | ||
connection_type="s3", | ||
connection_options={"path": "s3://sample-glue-wednesday/data/"}, | ||
format="parquet", | ||
) | ||
|
||
job.commit() |