Skip to content

Commit

Permalink
feat: add documentation for glue examples
Browse files Browse the repository at this point in the history
  • Loading branch information
idipanshu committed Aug 7, 2023
1 parent e4ff40e commit 1e8cae6
Show file tree
Hide file tree
Showing 6 changed files with 193 additions and 0 deletions.
17 changes: 17 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Describe your changes

---

## Issue ticket number and link

---

## Checklist before requesting a review

- [ ] I have performed a self-review of my code
- [ ] If it is a core feature, I have added thorough tests
- [ ] New feature / Breaking change
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes

---
40 changes: 40 additions & 0 deletions examples/07_pyspark_in_glue_jobs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
## Writing PySpark Scripts in AWS Glue

Until now, we have been writing PySpark scripts which are executed on a local machine. In this section, we will learn how to write PySpark scripts that can be executed on AWS Glue. AWS Glue is a fully managed ETL service that makes it easy to move data between different data stores, clean and transform data. Glue allows you to write ETL scripts in Python and Scala and execute them on a fully managed Spark environment.

- Pricing: You pay only for the resources used while your jobs are running. Based on how many DPUs you allocate to your job, AWS Glue will allocate the number of Spark executors and cores accordingly. You can choose between 2 DPUs and 100 DPUs. The default is 10 DPUs. The cost per DPU-Hour is `$0.44`. So, always remember to check the number of DPUs allocated to your job before running it.

- AWS Glue is serverless. You do not need to provision any Spark clusters or manage any Spark infrastructure. AWS Glue will automatically provision and scale the resources required to run your job.

- AWS Glue is fully managed. You do not need to worry about patching, upgrading, or maintaining any servers. You just need to select the version of Glue you want to use and AWS will take care of the rest. The current version of Glue is `4.0`.


## Modules

There are a few modules that are imported from the `awsglue` package. Let's take a look at them:

**SparkContext:** The `SparkContext` is responsible for managing the Spark cluster and its resources.

**GlueContext:** The `GlueContext` class provides an interface for interacting with AWS Glue services using Spark.

**Job:** An instance of the `Job` class is created and initialized using the Glue context. This job will represent the data transformation task performed by the script.
You need to initialize the job using the `init()` method and pass in the name of the job. When the job is complete, you need to commit it using the `commit()` method.

You will write all the transformation logic and the PySpark realted code between the `init()` and `commit()` methods of the job. See the example below.


```python
# Glue context setup
glue_context = GlueContext(spark_context)

# Spark session setup
spark_session = glue_context.spark_session

# Initialize glue job
job = Job(glue_context)
job.init("job-name")

.... PySpark Script Goes Here ....

job.commit()
```
26 changes: 26 additions & 0 deletions examples/08_glue_dynamic_frame/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## Creating a Glue DynamicFrame

In this example, we will create a Glue dynamic frame using a Data Catalog as source. We will then convert the dynamic frame to a PySpark DataFrame and perform some transformations on it.

## Glue Dynamic Frame vs Spark DataFrame

A Glue dynamic frame is similar to a Spark DataFrame, except that each record is self-describing, so no schema is required initially. This makes it easier to deal with semi-structured data, such as JSON data. DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema.

On the other hand, Spark DataFrames are designed to provide a typed view of structured data, and structured data is the most common use case for Spark. Spark DataFrames are also more performant than DynamicFrames. Hence, you would typically convert a DynamicFrame to a Spark DataFrame as soon as possible in your ETL script.


```python
data_catalog = glue_context.create_dynamic_frame_from_catalog(
database="data-engg-starter", # Database name in Data Catalog
table_name="survey_results_public", # Table name in Data Catalog
transformation_ctx="data_catalog", # Transformation context
)
```

You will notice the `create_dynamic_frame_from_catalog()` method which is creating a dynamic frame from a Data Catalog. You may wonder what Data Catalog is?

When we are working locally, we can use a CSV file as source. But in production, we could be dealing with huge datasets which won't be a single file. In such cases, we can use a Data Catalog as source. A Data Catalog is a central repository of metadata about your data. Consider it as a database of your source data. Only difference is that it does not contain the actual data. Rather, it contains the metadata about your data like table definitions, schema, and other control information. You can use the Data Catalog as a source for your ETL jobs.

So, if you have 10 Million records across 100 CSV files, you can create a Data Catalog table with the schema of the CSV files. Then, you can use this table as source for your ETL jobs. This way, you don't have to worry about the location of the CSV files or the schema of the CSV files. You just need to know the name of the table in the Data Catalog.

Follow the onboarding document to learn more about Data Catalog.
37 changes: 37 additions & 0 deletions examples/09_apply_mappings/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## Apply Mappings to a DynamicFrame

In this example, we will apply mappings to a Glue DynamicFrame. This is useful when you want to change the schema of a DynamicFrame. For example, you may want to change the data type of a column or rename a column. Or if you want to change the data type of a column from string to a date or timestamp. This is a common use case when you are reading data from a database and the date or timestamp is stored as a string.

In many scenarios, you will notice that Glue Crawlers will not infer the correct data type for a column. For example, if you have a column that contains a date or timestamp, the crawler will infer the data type as string. This is because the crawler does not know the format of the date or timestamp. In this case, you will need to apply mappings to the DynamicFrame to change the data type of the column.

## Implementation

AWS Glue `transformations` class provides a method to apply mappings to a DynamicFrame. This method takes the following parameters:

- frame: The DynamicFrame to apply the mappings to.
- mappings: A list of tuples that contains the mapping information. Each tuple contains the following information:
- `source`: The name of the source column.
- `sourceType`: The data type of the source column.
- `target`: The name of the target column.
- `targetType`: The data type of the target column.
```json
[
("source", "sourceType", "target", "targetType"),
("source", "sourceType", "target", "targetType"),
("source", "sourceType", "target", "targetType"),
]
```
- transformation_ctx: A unique string that is used to identify the transformation.


```python
df = ApplyMapping.apply(
frame=data_catalog,
mappings=[
("country", "string", "country", "string"),
("dob", "string", "os", "timestamp"),
("age", "string", "age", "int"),
],
transformation_ctx="df",
)
```
29 changes: 29 additions & 0 deletions examples/10_write_to_target/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## Writing DynamicFiles to a Target

This example demonstrates how to write a DynamicFile to a target. In this example, we will write the DynamicFrame to an Amazon S3 bucket. We will use the `write_dynamic_frame` method from the `glue_context` class to write the DynamicFrame to the target.

### Implementation

AWS Glue `glue_context` class provides a method to write a DynamicFrame to a target. This method takes the following parameters:

- frame: The DynamicFrame to write to the target.
- connection_type: The type of connection to use to write the DynamicFrame to the target. This can be one of the following values:
- `s3`: Write the DynamicFrame to an Amazon S3 bucket.
- `glue`: Write the DynamicFrame to an AWS Glue Data Catalog table.
- `dynamodb`: Write the DynamicFrame to an Amazon DynamoDB table.
- `jdbc`: Write the DynamicFrame to a JDBC connection.
- `custom`: Write the DynamicFrame to a custom connection.

- connection_options: A dictionary that contains the connection options. The connection options depend on the type of connection that you are using. For example, if you are using an Amazon S3 connection, you will need to provide the following options:

- `path`: The path to the Amazon S3 bucket.
- `partitionKeys`: A list of partition keys (optional)

```python
glue_context.write_dynamic_frame.from_options(
frame=raw_data,
connection_type="s3",
connection_options={"path": "s3://sample-glue-wednesday/data/"},
format="parquet",
)
```
44 changes: 44 additions & 0 deletions examples/10_write_to_target/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from awsglue.job import Job
from awsglue.context import GlueContext
from awsglue.transforms import ApplyMapping
from pyspark.context import SparkContext


# Spark context setup
sc = SparkContext()

# Glue context setup
glue_context = GlueContext(sc)

# Spark session setup
spark_session = glue_context.spark_session

# Initialize glue job
job = Job(glue_context)
job.init("sample")

data_catalog = glue_context.create_dynamic_frame_from_catalog(
database="data-engg-starter",
table_name="survey_results_public",
transformation_ctx="data_catalog",
additional_options={},
)

raw_data = ApplyMapping.apply(
frame=data_catalog,
mappings=[
("country", "string", "Country", "string"),
("op_sys_personal_use", "string", "OpSysPersonal use", "string"),
],
transformation_ctx="raw_data",
)

# Write to a S3 bucket
glue_context.write_dynamic_frame.from_options(
frame=raw_data,
connection_type="s3",
connection_options={"path": "s3://sample-glue-wednesday/data/"},
format="parquet",
)

job.commit()

0 comments on commit 1e8cae6

Please sign in to comment.