Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(glue): Job construct #12506

Merged
merged 67 commits into from
Sep 8, 2021
Merged
Show file tree
Hide file tree
Changes from 62 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
c158a05
feat(aws-glue): add Job construct (#12443)
humanzz Jan 10, 2021
9ea48b6
support job's event rules and rule-based metrics
humanzz Jan 14, 2021
638129c
add metric helper method
humanzz Jan 14, 2021
b99dbce
add JobSpecialArgumentNames for glue special parameters
humanzz Jan 15, 2021
06b4d55
rebase to use Connection and SecurityConfiguration
humanzz Feb 17, 2021
6bcdad8
address some comments
humanzz Jul 27, 2021
e0eb941
Merge branch 'master' into glue-job
humanzz Jul 27, 2021
b2d866f
improve docs
humanzz Jul 27, 2021
29ff157
rename JobCommandName constants and JobCommand methods
humanzz Jul 28, 2021
c615c8e
Merge branch 'master' into glue-job
humanzz Jul 29, 2021
c31d7e4
drop unnecessry toString() methods
humanzz Jul 30, 2021
f05afa7
indicate PythonVersion.TWO is the default for JobCommand
humanzz Jul 30, 2021
5a46dc5
make metricRule and buildJobArn protected
humanzz Jul 30, 2021
f59d688
address more comments
humanzz Jul 30, 2021
fba208d
drop jobRunId from metric()'s arguments
humanzz Aug 2, 2021
6032807
change how event.Rules caching is done
humanzz Aug 2, 2021
8b74200
Merge branch 'master' into glue-job
humanzz Aug 2, 2021
98c19df
drop JobSpecialArgumentNames
humanzz Aug 2, 2021
933ebcd
introduce JobExecutable and refactor accordingly
humanzz Aug 11, 2021
6b3d31e
refactor JobExecutable and add more tests
humanzz Aug 12, 2021
2f02798
add enableProfilingMetrics to JobProps
humanzz Aug 12, 2021
52ae192
Merge branch 'master' into glue-job
humanzz Aug 12, 2021
fe08089
add @aws-cdk/assert-internal to package.json after merge
humanzz Aug 12, 2021
55b5ee4
add sparkUI optional prop to JobProps
humanzz Aug 12, 2021
3235ae7
add continuousLogging optional prop to JobProps
humanzz Aug 13, 2021
ee74a8a
Merge branch 'master' into glue-job
humanzz Aug 13, 2021
1367813
Merge branch 'master' into glue-job
humanzz Aug 20, 2021
c3c9e80
Merge branch 'master' into glue-job
humanzz Aug 24, 2021
0ff9d01
add GlueVersion.V3_0
humanzz Aug 24, 2021
cd27ef7
Merge branch 'master' into glue-job
humanzz Aug 24, 2021
4353f22
Merge branch 'master' into glue-job
humanzz Aug 30, 2021
fb7e376
address smaller comments
humanzz Aug 30, 2021
818be70
Merge branch 'master' into glue-job
humanzz Aug 30, 2021
7072a6f
address metric comments
humanzz Aug 31, 2021
e860f4d
take 1 at glue.Code (not fulyl tested)
humanzz Aug 31, 2021
1068229
test glue.Code
humanzz Aug 31, 2021
bf896fa
Merge branch 'master' into glue-job
humanzz Aug 31, 2021
9f5f85a
address some comments
humanzz Aug 31, 2021
0f587c9
fix build issues from previous round of comments
humanzz Aug 31, 2021
87dee59
address comments
humanzz Aug 31, 2021
0ded0f2
refactor JobExecutableProps
humanzz Aug 31, 2021
82c1d98
drop @aws-cdk/aws-s3-assets from devDependencies
humanzz Aug 31, 2021
c81e736
restore docs about individual files support
humanzz Aug 31, 2021
8ce0fe8
apply suggestions from comments
humanzz Sep 1, 2021
2ed79e1
add optional role to JobAttributes
humanzz Sep 1, 2021
1eb9c20
drop @aws-cdk/assert-internal in favour of @aws-cdk/assertions
humanzz Sep 1, 2021
16e3265
Merge branch 'master' into glue-job
humanzz Sep 1, 2021
7df0c1f
update README
humanzz Sep 1, 2021
8884ad7
increase test coverage to 100% for the new files
humanzz Sep 1, 2021
13b03e8
increase test coverage to 100%
humanzz Sep 1, 2021
2b5f47e
Merge branch 'master' into glue-job
humanzz Sep 1, 2021
bc82d60
tweak tests
humanzz Sep 3, 2021
550b919
Merge branch 'master' into glue-job
humanzz Sep 3, 2021
70b3e24
tweak tests #2
humanzz Sep 3, 2021
a500cf2
Merge branch 'master' into glue-job
humanzz Sep 3, 2021
32ba2ae
remove role from IJob
BenChaimberg Sep 8, 2021
babd3ec
address some comments
humanzz Sep 8, 2021
7379066
Merge branch 'master' into glue-job
humanzz Sep 8, 2021
b62a868
simplify job.test.ts
humanzz Sep 8, 2021
80c3f15
simplify testing success/failure/timeout rules and metrics
humanzz Sep 8, 2021
094929c
better handling for extraPythonFiles with non-Python jobs
humanzz Sep 8, 2021
f01c0be
update integ.job.ts
humanzz Sep 8, 2021
cd2d2ee
fix issues identified trying to run jobs from integ tests
humanzz Sep 8, 2021
ea32eab
update integ test verification documentation
humanzz Sep 8, 2021
98cc575
update Code.bind signature and PythonShell supported glue versions
humanzz Sep 8, 2021
0328615
narrow the permissions granted by S3Code
humanzz Sep 8, 2021
6bb18eb
Merge branch 'master' into glue-job
humanzz Sep 8, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 72 additions & 10 deletions packages/@aws-cdk/aws-glue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,69 @@

This module is part of the [AWS Cloud Development Kit](https://github.com/aws/aws-cdk) project.

## Job

A `Job` encapsulates a script that connects to data sources, processes them, and then writes output to a data target.

There are 3 types of jobs supported by AWS Glue: Spark ETL, Spark Streaming, and Python Shell jobs.

The `glue.JobExecutable` allows you to specify the type of job, the language to use and the code assets required by the job.

`glue.Code` allows you to refer to the different code assets required by the job, either from an existing S3 location or from a local file path.

### Spark Jobs

These jobs run in an Apache Spark environment managed by AWS Glue.

#### ETL Jobs

An ETL job processes data in batches using Apache Spark.

```ts
new glue.Job(stack, 'ScalaSparkEtlJob', {
executable: glue.JobExecutable.scalaEtl({
glueVersion: glue.GlueVersion.V2_0,
script: glue.Code.fromBucket(bucket, 'src/com/example/HelloWorld.scala'),
className: 'com.example.HelloWorld',
extraJars: [glue.Code.fromBucket(bucket, 'jars/HelloWorld.jar')],
}),
description: 'an example Scala ETL job',
});
```

#### Streaming Jobs

A Streaming job is similar to an ETL job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs.

```ts
new glue.Job(stack, 'PythonSparkStreamingJob', {
executable: glue.JobExecutable.pythonStreaming({
glueVersion: glue.GlueVersion.V2_0,
pythonVersion: glue.PythonVersion.THREE,
script: glue.Code.fromAsset(path.join(__dirname, 'job-script/hello_world.py')),
}),
description: 'an example Python Streaming job',
});
```

### Python Shell Jobs

A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using.
This can be used to schedule and run tasks that don't require an Apache Spark environment.
BenChaimberg marked this conversation as resolved.
Show resolved Hide resolved

```ts
BenChaimberg marked this conversation as resolved.
Show resolved Hide resolved
new glue.Job(stack, 'PythonShellJob', {
executable: glue.JobExecutable.pythonShell({
glueVersion: glue.GlueVersion.V2_0,
pythonVersion: PythonVersion.THREE,
script: glue.Code.fromBucket(bucket, 'script.py'),
}),
description: 'an example Python Shell job',
});
```

See [documentation](https://docs.aws.amazon.com/glue/latest/dg/add-job.html) for more information on adding jobs in Glue.

## Connection

A `Connection` allows Glue jobs, crawlers and development endpoints to access certain types of data stores. For example, to create a network connection to connect to a data source within a VPC:
Expand All @@ -41,16 +104,6 @@ If you need to use a connection type that doesn't exist as a static member on `C

See [Adding a Connection to Your Data Store](https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html) and [Connection Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-connections.html#aws-glue-api-catalog-connections-Connection) documentation for more information on the supported data stores and their configurations.

## Database

A `Database` is a logical grouping of `Tables` in the Glue Catalog.

```ts
new glue.Database(stack, 'MyDatabase', {
databaseName: 'my_database'
});
```

## SecurityConfiguration

A `SecurityConfiguration` is a set of security properties that can be used by AWS Glue to encrypt data at rest.
Expand Down Expand Up @@ -84,6 +137,15 @@ new glue.SecurityConfiguration(stack, 'MySecurityConfiguration', {

See [documentation](https://docs.aws.amazon.com/glue/latest/dg/encryption-security-configuration.html) for more info for Glue encrypting data written by Crawlers, Jobs, and Development Endpoints.

## Database

A `Database` is a logical grouping of `Tables` in the Glue Catalog.

```ts
new glue.Database(stack, 'MyDatabase', {
databaseName: 'my_database'
});
```

## Table

Expand Down
110 changes: 110 additions & 0 deletions packages/@aws-cdk/aws-glue/lib/code.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import * as crypto from 'crypto';
import * as fs from 'fs';
import * as s3 from '@aws-cdk/aws-s3';
import * as s3assets from '@aws-cdk/aws-s3-assets';
import * as cdk from '@aws-cdk/core';
import * as constructs from 'constructs';

/**
* Represents a Glue Job's Code assets (an asset can be a scripts, a jar, a python file or any other file).
*/
export abstract class Code {

/**
* Job code as an S3 object.
* @param bucket The S3 bucket
* @param key The object key
*/
public static fromBucket(bucket: s3.IBucket, key: string): S3Code {
return new S3Code(bucket, key);
}

/**
* Job code from a local disk path.
*
* @param path code file (not a directory).
*/
public static fromAsset(path: string, options?: s3assets.AssetOptions): AssetCode {
return new AssetCode(path, options);
}

/**
* Called when the Job is initialized to allow this object to bind.
*/
public abstract bind(scope: constructs.Construct): CodeConfig;
}

/**
* Glue job Code from an S3 bucket.
*/
export class S3Code extends Code {
constructor(private readonly bucket: s3.IBucket, private readonly key: string) {
super();
}

public bind(_scope: constructs.Construct): CodeConfig {
return {
s3Location: {
bucketName: this.bucket.bucketName,
objectKey: this.key,
},
};
}
}

/**
* Job Code from a local file.
*/
export class AssetCode extends Code {
private asset?: s3assets.Asset;

/**
* @param path The path to the Code file.
*/
constructor(private readonly path: string, private readonly options: s3assets.AssetOptions = { }) {
super();

if (fs.lstatSync(this.path).isDirectory()) {
throw new Error(`Code path ${this.path} is a directory. Only files are supported`);
}
humanzz marked this conversation as resolved.
Show resolved Hide resolved
}

public bind(scope: constructs.Construct): CodeConfig {
// If the same AssetCode is used multiple times, retain only the first instantiation.
if (!this.asset) {
this.asset = new s3assets.Asset(scope, `Code${this.hashcode(this.path)}`, {
path: this.path,
...this.options,
});
} else if (cdk.Stack.of(this.asset) !== cdk.Stack.of(scope)) {
throw new Error(`Asset is already associated with another stack '${cdk.Stack.of(this.asset).stackName}'. ` +
'Create a new Code instance for every stack.');
}

return {
s3Location: {
bucketName: this.asset.s3BucketName,
objectKey: this.asset.s3ObjectKey,
},
};
}

/**
* Hash a string
*/
private hashcode(s: string): string {
const hash = crypto.createHash('md5');
hash.update(s);
return hash.digest('hex');
};
}

/**
* Result of binding `Code` into a `Job`.
*/
export interface CodeConfig {
/**
* The location of the code in S3.
*/
readonly s3Location: s3.Location;
}
3 changes: 3 additions & 0 deletions packages/@aws-cdk/aws-glue/lib/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ export * from './glue.generated';
export * from './connection';
export * from './data-format';
export * from './database';
export * from './job';
export * from './job-executable';
export * from './code';
export * from './schema';
export * from './security-configuration';
export * from './table';
Loading