diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..f4da28d --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +.vscode +build +__pycache__ +*.ipynb_checkpoints diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 0000000..3c2c231 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,16 @@ +default_language_version: + python: python3.7 + +repos: + - repo: https://github.com/pre-commit/pre-commit-hooks + rev: v2.3.0 + hooks: + - id: trailing-whitespace + - repo: local + hooks: + - id: lint + name: lint + always_run: true + entry: scripts/lint.sh + language: system + types: [python] diff --git a/LICENSE b/LICENSE index 1bb4f21..6aa0c45 100644 --- a/LICENSE +++ b/LICENSE @@ -12,4 +12,3 @@ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. - diff --git a/README.md b/README.md index 6be3441..d6b3a35 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,4 @@ # Amazon SageMaker Safe Deployment Pipeline - ## Introduction This is a sample solution to build a safe deployment pipeline for Amazon SageMaker. This example could be useful for any organization looking to operationalize machine learning with native AWS development tools such as AWS CodePipeline, AWS CodeBuild and AWS CodeDeploy. @@ -32,48 +31,52 @@ In the following diagram, you can view the continuous delivery stages of AWS Cod The following is the list of steps required to get up and running with this sample. -### Prepare an AWS Account +### Requirements + +* Create your AWS account at [http://aws.amazon.com](http://aws.amazon.com) by following the instructions on the site. +* A Studio user account, see [onboard to Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html) +### Enable Amazon SageMaker Studio Project -Create your AWS account at [http://aws.amazon.com](http://aws.amazon.com) by following the instructions on the site. +1. From AWS console navigate to Amazon SageMaker Studio and click on your studio user name (do **not** Open Studio now) and copy the name of execution role as shown below (similar to `AmazonSageMaker-ExecutionRole-20210112T085906`) -### *Optionally* fork this GitHub Repository and create an Access Token - -1. [Fork](https://github.com/aws-samples/sagemaker-safe-deployment-pipeline/fork) a copy of this repository into your own GitHub account by clicking the **Fork** in the upper right-hand corner. -2. Follow the steps in the [GitHub documentation](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line) to create a new (OAuth 2) token with the following scopes (permissions): `admin:repo_hook` and `repo`. If you already have a token with these permissions, you can use that. You can find a list of all your personal access tokens in [https://github.com/settings/tokens](https://github.com/settings/tokens). -3. Copy the access token to your clipboard. For security reasons, after you navigate off the page, you will not be able to see the token again. If you have lost your token, you can [regenerate](https://docs.aws.amazon.com/codepipeline/latest/userguide/GitHub-authentication.html#GitHub-rotate-personal-token-CLI) your token. +

+ role +

-### Launch the AWS CloudFormation Stack +2. Click on the launch button below to setup the stack -Click on the **Launch Stack** button below to launch the CloudFormation Stack to set up the SageMaker safe deployment pipeline. +

+ +

-[![Launch CFN stack](https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png)](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate?templateUrl=https%3A%2F%2Famazon-sagemaker-safe-deployment-pipeline.s3.amazonaws.com%2Fsfn%2Fpipeline.yml&stackName=nyctaxi¶m_GitHubBranch=master¶m_GitHubRepo=amazon-sagemaker-safe-deployment-pipeline¶m_GitHubUser=aws-samples¶m_ModelName=nyctaxi¶m_NotebookInstanceType=ml.t3.medium) +and paste the role name copied in step 1 as the value of the parameter `SageMakerStudioRoleName` as shown below and click **Create Stack** -Provide a stack name eg **sagemaker-safe-deployment-pipeline** and specify the parameters. +

+ role +

+ +*Alternatively*, one can use the provided `scripts/build.sh` (which required [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) installed with appropriate IAM permissions) as follows +``` +# bash scripts/build.sh S3_BUCKET_NAME STACK_NAME REGION STUDIO_ROLE_NAME +# REGION should match your default AWS CLI region +# STUDIO_ROLE_NAME is copied from step 1. Example: +bash scripts/build.sh example-studio example-pipeline us-east-1 AmazonSageMaker-ExecutionRole-20210112T085906 +``` -Parameters | Description ------------ | ----------- -Model Name | A unique name for this model (must be less than 15 characters long). -S3 Bucket for Dataset | The bucket containing the dataset (defaults to [nyc-tlc](https://registry.opendata.aws/nyc-tlc-trip-records-pds/)) -Notebook Instance Type | The [Amazon SageMaker instance type](https://aws.amazon.com/sagemaker/pricing/instance-types/). Default is ml.t3.medium. -GitHub Repository | The name (not URL) of the GitHub repository to pull from. -GitHub Branch | The name (not URL) of the GitHub repository’s branch to use. -GitHub Username | GitHub Username for this repository. Update this if you have forked the repository. -GitHub Access Token | The optional Secret OAuthToken with access to your GitHub repository. -Email Address | The optional Email address to notify on successful or failed deployments. +3. From the AWS console navigate to `cloudformation` and once the stack `STACK_NAME` is ready +4. Go to your SageMaker Studio and **Open Studio** (and possibly refresh your browser if you're already in Studio) and from the your left hand side panel, click on the inverted triangle. As with the screenshot below, under `Projects -> Create project -> Organization templates`, you should be able to see the added **SageMaker Safe Deployment Pipeline**. Click on the template name and **Select project template** -![code-pipeline](docs/stack-parameters.png) +

+ role +

-You can launch the same stack using the AWS CLI. Here's an example: +5. Choose a name for the project and can leave the rest of the fields with their default values (can use your own email for SNS notifications) and click on **Create project** +6. Once the project is created, it gives you the option to clone it locally from AWS CodeCommit by a single click. Click clone and it goes directly to the project +7. Navigate to the code base and go to `notebook/mlops.ipynb` +8. Choose a kernel from the prompt such as `Python 3 (Data Science)` +9. Assign your project name to the placeholder `PROJECT_NAME` in the first code cell of the `mlops.ipynb` notebook +10. Now you are ready to go through the rest of the cells in `notebook/mlops.ipynb` -` - aws cloudformation create-stack --stack-name sagemaker-safe-deployment \ - --template-body file://pipeline.yml \ - --capabilities CAPABILITY_IAM \ - --parameters \ - ParameterKey=ModelName,ParameterValue=mymodelname \ - ParameterKey=GitHubUser,ParameterValue=youremailaddress@example.com \ - ParameterKey=GitHubToken,ParameterValue=YOURGITHUBTOKEN12345ab1234234 -` ### Start, Test and Approve the Deployment @@ -81,15 +84,11 @@ Once the deployment is complete, there will be a new AWS CodePipeline created, w ![code-pipeline](docs/data-source-before.png) -Launch the newly created SageMaker Notebook in your [AWS console](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/), navigate to the `notebook` directory and opening the notebook by clicking on the `mlops.ipynb` link. - -![code-pipeline](docs/sagemaker-notebook.png) - Once the notebook is running, you will be guided through a series of steps starting with downloading the [New York City Taxi](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) dataset, uploading this to an Amazon SageMaker S3 bucket along with the data source meta data to trigger a new build in the AWS CodePipeline. ![code-pipeline](docs/datasource-after.png) -Once your pipeline is kicked off it will run model training and deploy a development SageMaker Endpoint. +Once your pipeline is kicked off it will run model training and deploy a development SageMaker Endpoint. There is a manual approval step which you can action directly within the SageMaker Notebook to promote this to production, send some traffic to the live endpoint and create a REST API. @@ -138,15 +137,20 @@ This project is written in Python, and design to be customized for your own mode │   ├── buildspec.yml │   ├── dashboard.json │   ├── requirements.txt -│   └── run.py +│   └── run_pipeline.py ├── notebook -│   ├── canary.js │   ├── dashboard.json +| ├── workflow.ipynb │   └── mlops.ipynb -└── pipeline.yml +├── scripts +| ├── build.sh +| ├── lint.sh +| └── set_kernelspec.py +├── pipeline.yml +└── studio.yml ``` -Edit the `get_training_params` method in the `model/run.py` script that is run as part of the AWS CodeBuild step to add your own estimator or model definition. +Edit the `get_training_params` method in the `model/run_pipeline.py` script that is run as part of the AWS CodeBuild step to add your own estimator or model definition. Extend the AWS Lambda hooks in `api/pre_traffic_hook.py` and `api/post_traffic_hook.py` to add your own validation or inference against the deployed Amazon SageMaker endpoints. You can also edit the `api/app.py` lambda to add any enrichment or transformation to the request/response payload. @@ -158,8 +162,7 @@ This section outlines cost considerations for running the SageMaker Safe Deploym - **CodeCommit** – $1/month if you didn't opt to use your own GitHub repository. - **CodeDeploy** – No cost with AWS Lambda. - **CodePipeline** – CodePipeline costs $1 per active pipeline* per month. Pipelines are free for the first 30 days after creation. More can be found at [AWS CodePipeline Pricing](https://aws.amazon.com/codepipeline/pricing/). -- **CloudWatch** - This template includes a [Canary](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html), 1 dashboard and 4 alarms (2 for deployment, 1 for model drift and 1 for canary) which costs less than $10 per month. - - Canaries cost $0.0012 per run, or $5/month if they run every 10 minutes. +- **CloudWatch** - This template includes 1 dashboard and 3 alarms (2 for deployment and 1 for model drift) which costs less than $10 per month. - Dashboards cost $3/month. - Alarm metrics cost $0.10 per alarm. - **CloudTrail** - Low cost, $0.10 per 100,000 data events to enable [S3 CloudWatch Event](https://docs.aws.amazon.com/codepipeline/latest/userguide/create-cloudtrail-S3-source-console.html). For more information, see [AWS CloudTrail Pricing](https://aws.amazon.com/cloudtrail/pricing/) @@ -170,7 +173,7 @@ This section outlines cost considerations for running the SageMaker Safe Deploym - The `ml.t3.medium` instance *notebook* costs $0.0582 an hour. - The `ml.m4.xlarge` instance for the *training* job costs $0.28 an hour. - The `ml.m5.xlarge` instance for the *monitoring* baseline costs $0.269 an hour. - - The `ml.t2.medium` instance for the dev *hosting* endpoint costs $0.065 an hour. + - The `ml.t2.medium` instance for the dev *hosting* endpoint costs $0.065 an hour. - The two `ml.m5.large` instances for production *hosting* endpoint costs 2 x $0.134 per hour. - The `ml.m5.xlarge` instance for the hourly scheduled *monitoring* job costs $0.269 an hour. - **S3** – Prices will vary depending on the size of the model/artifacts stored. The first 50 TB each month will cost only $0.023 per GB stored. For more information, see [Amazon S3 Pricing](https://aws.amazon.com/s3/pricing/). @@ -193,4 +196,3 @@ See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more inform ## License This library is licensed under the MIT-0 License. See the LICENSE file. - diff --git a/api/app.py b/api/app.py index 86b6c34..c499b17 100644 --- a/api/app.py +++ b/api/app.py @@ -50,8 +50,6 @@ def lambda_handler(event, context): "body": predictions, } except ClientError as e: - logger.error( - "Unexpected sagemaker error: {}".format(e.response["Error"]["Message"]) - ) + logger.error("Unexpected sagemaker error: {}".format(e.response["Error"]["Message"])) logger.error(e) return {"statusCode": 500, "message": "Unexpected sagemaker error"} diff --git a/api/pre_traffic_hook.py b/api/pre_traffic_hook.py index e6fded5..03780fa 100644 --- a/api/pre_traffic_hook.py +++ b/api/pre_traffic_hook.py @@ -29,16 +29,9 @@ def lambda_handler(event, context): else: # Validate that endpoint config has data capture enabled endpoint_config_name = response["EndpointConfigName"] - response = sm.describe_endpoint_config( - EndpointConfigName=endpoint_config_name - ) - if ( - "DataCaptureConfig" in response - and response["DataCaptureConfig"]["EnableCapture"] - ): - logger.info( - "data capture enabled for endpoint config %s", endpoint_config_name - ) + response = sm.describe_endpoint_config(EndpointConfigName=endpoint_config_name) + if "DataCaptureConfig" in response and response["DataCaptureConfig"]["EnableCapture"]: + logger.info("data capture enabled for endpoint config %s", endpoint_config_name) else: error_message = "SageMaker data capture not enabled for endpoint config" # TODO: Invoke endpoint if don't have canary / live traffic diff --git a/assets/deploy-model-dev.yml b/assets/deploy-model-dev.yml index e73eafd..3b86b03 100644 --- a/assets/deploy-model-dev.yml +++ b/assets/deploy-model-dev.yml @@ -23,10 +23,10 @@ Resources: Model: Type: "AWS::SageMaker::Model" Properties: - ModelName: !Sub mlops-${ModelName}-dev-${TrainJobId} + ModelName: !Sub ${ModelName}-dev-${TrainJobId} PrimaryContainer: Image: !Ref ImageRepoUri - ModelDataUrl: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/mlops-${ModelName}-${TrainJobId}/output/model.tar.gz + ModelDataUrl: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/${ModelName}-${TrainJobId}/output/model.tar.gz ExecutionRoleArn: !Ref DeployRoleArn EndpointConfig: @@ -38,11 +38,11 @@ Resources: InstanceType: ml.t2.medium ModelName: !GetAtt Model.ModelName VariantName: !Sub ${ModelVariant}-${ModelName} - EndpointConfigName: !Sub mlops-${ModelName}-dec-${TrainJobId} + EndpointConfigName: !Sub ${ModelName}-dec-${TrainJobId} KmsKeyId: !Ref KmsKeyId Endpoint: Type: "AWS::SageMaker::Endpoint" Properties: - EndpointName: !Sub mlops-${ModelName}-dev-${TrainJobId} + EndpointName: !Sub ${ModelName}-dev-${TrainJobId} EndpointConfigName: !GetAtt EndpointConfig.EndpointConfigName diff --git a/assets/deploy-model-prd.yml b/assets/deploy-model-prd.yml index 73f7303..42aee98 100644 --- a/assets/deploy-model-prd.yml +++ b/assets/deploy-model-prd.yml @@ -81,10 +81,10 @@ Resources: Model: Type: "AWS::SageMaker::Model" Properties: - ModelName: !Sub mlops-${ModelName}-prd-${TrainJobId} + ModelName: !Sub ${ModelName}-prd-${TrainJobId} PrimaryContainer: Image: !Ref ImageRepoUri - ModelDataUrl: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/mlops-${ModelName}-${TrainJobId}/output/model.tar.gz + ModelDataUrl: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/${ModelName}-${TrainJobId}/output/model.tar.gz ExecutionRoleArn: !Ref DeployRoleArn EndpointConfig: @@ -109,19 +109,19 @@ Resources: EnableCapture: True InitialSamplingPercentage: 100 KmsKeyId: !Ref KmsKeyId - EndpointConfigName: !Sub mlops-${ModelName}-pec-${TrainJobId} + EndpointConfigName: !Sub ${ModelName}-pec-${TrainJobId} KmsKeyId: !Ref KmsKeyId Endpoint: Type: "AWS::SageMaker::Endpoint" Properties: - EndpointName: !Sub mlops-${ModelName}-prd-${TrainJobId} + EndpointName: !Sub ${ModelName}-prd-${TrainJobId} EndpointConfigName: !GetAtt EndpointConfig.EndpointConfigName ApiFunction: Type: AWS::Serverless::Function Properties: - FunctionName: !Sub mlops-${ModelName}-api + FunctionName: !Sub ${ModelName}-api CodeUri: ../api Handler: app.lambda_handler Runtime: python3.7 @@ -177,7 +177,7 @@ Resources: Effect: Allow Action: - sagemaker:InvokeEndpoint - Resource: "arn:aws:sagemaker:*:*:endpoint/mlops-*" + Resource: "arn:aws:sagemaker:*:*:endpoint/*" - Sid: AlowSNS Effect: Allow Action: @@ -205,7 +205,7 @@ Resources: - sagemaker:DescribeEndpointConfig - sagemaker:InvokeEndpoint Resource: - - "arn:aws:sagemaker:*:*:*/mlops-*" + - "arn:aws:sagemaker:*:*:*/*" - Sid: AllowCodeDeploy Effect: Allow Action: @@ -261,9 +261,9 @@ Resources: MonitoringJobDefinition: BaselineConfig: ConstraintsResource: - S3Uri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/baseline/mlops-${ModelName}-pbl-${TrainJobId}/constraints.json + S3Uri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/baseline/${ModelName}-pbl-${TrainJobId}/constraints.json StatisticsResource: - S3Uri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/baseline/mlops-${ModelName}-pbl-${TrainJobId}/statistics.json + S3Uri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/baseline/${ModelName}-pbl-${TrainJobId}/statistics.json MonitoringAppSpecification: ImageUri: !FindInMap [ModelAnalyzerMap, !Ref "AWS::Region", "ImageUri"] @@ -287,19 +287,19 @@ Resources: MaxRuntimeInSeconds: 1800 ScheduleConfig: ScheduleExpression: "cron(0 * ? * * *)" - MonitoringScheduleName: !Sub mlops-${ModelName}-pms + MonitoringScheduleName: !Sub ${ModelName}-pms SagemakerScheduleAlarm: Type: "AWS::CloudWatch::Alarm" Properties: - AlarmName: !Sub mlops-${ModelName}-metric-gt-threshold + AlarmName: !Sub ${ModelName}-metric-gt-threshold AlarmDescription: Schedule Metric > Threshold ComparisonOperator: GreaterThanThreshold Dimensions: - Name: Endpoint Value: !GetAtt Endpoint.EndpointName - Name: MonitoringSchedule - Value: !Sub mlops-${ModelName}-pms + Value: !Sub ${ModelName}-pms EvaluationPeriods: 1 DatapointsToAlarm: 1 MetricName: !Ref ScheduleMetricName @@ -311,7 +311,7 @@ Resources: AliasErrorMetricGreaterThanZeroAlarm: Type: "AWS::CloudWatch::Alarm" Properties: - AlarmName: !Sub mlops-${ModelName}-alias-gt-zero + AlarmName: !Sub ${ModelName}-alias-gt-zero AlarmDescription: Lambda Function Error > 0 ComparisonOperator: GreaterThanThreshold Dimensions: @@ -329,7 +329,7 @@ Resources: LatestVersionErrorMetricGreaterThanZeroAlarm: Type: "AWS::CloudWatch::Alarm" Properties: - AlarmName: !Sub mlops-${ModelName}-version-gt-zero + AlarmName: !Sub ${ModelName}-version-gt-zero AlarmDescription: Lambda Function Error > 0 ComparisonOperator: GreaterThanThreshold Dimensions: @@ -351,7 +351,7 @@ Resources: Properties: MaxCapacity: 10 MinCapacity: 2 - ResourceId: !Sub endpoint/mlops-${ModelName}-prd-${TrainJobId}/variant/${ModelVariant}-${ModelName} + ResourceId: !Sub endpoint/${ModelName}-prd-${TrainJobId}/variant/${ModelVariant}-${ModelName} RoleARN: !Sub arn:aws:iam::${AWS::AccountId}:role/MLOps ScalableDimension: sagemaker:variant:DesiredInstanceCount ServiceNamespace: sagemaker @@ -362,7 +362,7 @@ Resources: Properties: PolicyName: SageMakerVariantInvocationsPerInstance PolicyType: TargetTrackingScaling - ResourceId: !Sub endpoint/mlops-${ModelName}-prd-${TrainJobId}/variant/${ModelVariant}-${ModelName} + ResourceId: !Sub endpoint/${ModelName}-prd-${TrainJobId}/variant/${ModelVariant}-${ModelName} ScalableDimension: sagemaker:variant:DesiredInstanceCount ServiceNamespace: sagemaker TargetTrackingScalingPolicyConfiguration: diff --git a/assets/suggest-baseline.yml b/assets/suggest-baseline.yml index 20fc986..e70787c 100644 --- a/assets/suggest-baseline.yml +++ b/assets/suggest-baseline.yml @@ -1,5 +1,11 @@ Description: Suggest baseline for training job Parameters: + ProjectPrefix: + Type: String + Description: | + Makes resource privileges for services using this template scoped-limited. + Changing the default must be done with care + Default: PROJECT_PREFIX ModelName: Type: String Description: Name of the model @@ -20,10 +26,10 @@ Resources: SagemakerSuggestBaseline: Type: Custom::SuggestBaseline Properties: - ServiceToken: !Sub arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:sagemaker-cfn-suggest-baseline - ProcessingJobName: !Sub mlops-${ModelName}-pbl-${TrainJobId} + ServiceToken: !Sub arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${ProjectPrefix}-sagemaker-cfn-suggest-baseline + ProcessingJobName: !Sub ${ProjectPrefix}-${ModelName}-pbl-${TrainJobId} BaselineInputUri: !Ref BaselineInputUri - BaselineResultsUri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/baseline/mlops-${ModelName}-pbl-${TrainJobId} + BaselineResultsUri: !Sub s3://sagemaker-${AWS::Region}-${AWS::AccountId}/${ModelName}/monitoring/baseline/${ProjectPrefix}-${ModelName}-pbl-${TrainJobId} KmsKeyId: !Ref KmsKeyId PassRoleArn: !Ref MLOpsRoleArn ExperimentName: !Ref ModelName diff --git a/assets/training-job.yml b/assets/training-job.yml index b782a1f..e207d0f 100644 --- a/assets/training-job.yml +++ b/assets/training-job.yml @@ -1,5 +1,11 @@ Description: Wait for a training job Parameters: + ProjectPrefix: + Type: String + Description: | + Makes resource privileges for services using this template scoped-limited. + Changing the default must be done with care + Default: PROJECT_PREFIX ModelName: Type: String Description: Name of the model @@ -17,8 +23,8 @@ Resources: SagemakerTrainingJob: Type: Custom::TrainingJob Properties: - ServiceToken: !Sub "arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:sagemaker-cfn-training-job" - TrainingJobName: !Sub mlops-${ModelName}-${TrainJobId} + ServiceToken: !Sub "arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${ProjectPrefix}-sagemaker-cfn-training-job" + TrainingJobName: !Sub ${ProjectPrefix}-${ModelName}-${TrainJobId} TrainingJobRequest: !Ref TrainJobRequest ExperimentName: !Ref ModelName TrialName: !Ref TrainJobId diff --git a/custom_resource/sagemaker_add_transform_header.py b/custom_resource/sagemaker_add_transform_header.py index 8cd58df..d1e22d9 100644 --- a/custom_resource/sagemaker_add_transform_header.py +++ b/custom_resource/sagemaker_add_transform_header.py @@ -37,4 +37,3 @@ def lambda_handler(event, context): return new_obj.put( Body=body.encode("utf-8"), ContentType="text/csv", Metadata={"header": "true"} ) - diff --git a/custom_resource/sagemaker_create_experiment.py b/custom_resource/sagemaker_create_experiment.py index af6c2ab..3d9e3d9 100644 --- a/custom_resource/sagemaker_create_experiment.py +++ b/custom_resource/sagemaker_create_experiment.py @@ -46,5 +46,8 @@ def lambda_handler(event, context): raise error return { "statusCode": 200, - "results": {"ExperimentCreated": experiment_created, "TrialCreated": trial_created,}, + "results": { + "ExperimentCreated": experiment_created, + "TrialCreated": trial_created, + }, } diff --git a/custom_resource/sagemaker_query_drift.py b/custom_resource/sagemaker_query_drift.py index 6e33af4..ec6884d 100644 --- a/custom_resource/sagemaker_query_drift.py +++ b/custom_resource/sagemaker_query_drift.py @@ -23,7 +23,8 @@ def get_processing_job(processing_job_name): def get_s3_results_json(result_bucket, result_path, filename): s3_object = s3_client.get_object( - Bucket=result_bucket, Key=os.path.join(result_path.lstrip("/"), filename), + Bucket=result_bucket, + Key=os.path.join(result_path.lstrip("/"), filename), ) return json.loads(s3_object["Body"].read()) @@ -81,4 +82,3 @@ def lambda_handler(event, context): message = "Failed to read processing status!" print(e) return {"statusCode": 500, "error": message} - diff --git a/custom_resource/sagemaker_suggest_baseline.py b/custom_resource/sagemaker_suggest_baseline.py index e2d0efe..40fc9a3 100644 --- a/custom_resource/sagemaker_suggest_baseline.py +++ b/custom_resource/sagemaker_suggest_baseline.py @@ -53,7 +53,7 @@ def poll_create(event, context): @helper.poll_delete def poll_delete(event, context): """ - Return true if the resource has been stopped. + Return true if the resource has been stopped. """ processing_job_name = get_processing_job_name(event) logger.info("Polling for stopped processing job: %s", processing_job_name) @@ -64,9 +64,7 @@ def poll_delete(event, context): def get_model_monitor_container_uri(region): - container_uri_format = ( - "{0}.dkr.ecr.{1}.amazonaws.com/sagemaker-model-monitor-analyzer" - ) + container_uri_format = "{0}.dkr.ecr.{1}.amazonaws.com/sagemaker-model-monitor-analyzer" regions_to_accounts = { "eu-north-1": "895015795356", @@ -113,9 +111,7 @@ def is_processing_job_ready(processing_job_name): ) else: raise Exception( - "Processing Job ({}) has unexpected status: {}".format( - processing_job_name, status - ) + "Processing Job ({}) has unexpected status: {}".format(processing_job_name, status) ) return is_ready @@ -140,9 +136,7 @@ def create_processing_job(event): def stop_processing_job(processing_job_name): try: - processing_job = sm.describe_processing_job( - ProcessingJobName=processing_job_name - ) + processing_job = sm.describe_processing_job(ProcessingJobName=processing_job_name) status = processing_job["ProcessingJobStatus"] if status == "InProgress": logger.info("Stopping InProgress processing job: %s", processing_job_name) @@ -165,8 +159,7 @@ def stop_processing_job(processing_job_name): class DatasetFormat(object): - """Represents a Dataset Format that is used when calling a DefaultModelMonitor. - """ + """Represents a Dataset Format that is used when calling a DefaultModelMonitor.""" @staticmethod def csv(header=True, output_columns_position="START"): @@ -203,9 +196,7 @@ def sagemaker_capture_json(): dict: JSON string containing DatasetFormat to be used by DefaultModelMonitor. """ return { - "sagemaker_capture_json": { - "captureIndexNames": ["endpointInput", "endpointOutput"] - } + "sagemaker_capture_json": {"captureIndexNames": ["endpointInput", "endpointOutput"]} } @@ -256,22 +247,16 @@ def get_processing_request(event, dataset_format=DatasetFormat.csv()): } }, "StoppingCondition": { - "MaxRuntimeInSeconds": int( - props.get("MaxRuntimeInSeconds", 1800) - ) # 30 minutes + "MaxRuntimeInSeconds": int(props.get("MaxRuntimeInSeconds", 1800)) # 30 minutes }, "AppSpecification": { - "ImageUri": props.get( - "ImageURI", get_model_monitor_container_uri(helper._region) - ), + "ImageUri": props.get("ImageURI", get_model_monitor_container_uri(helper._region)), }, "Environment": { "dataset_format": json.dumps(dataset_format), "dataset_source": "/opt/ml/processing/input/baseline_dataset_input", "output_path": "/opt/ml/processing/output", - "publish_cloudwatch_metrics": props.get( - "PublishCloudwatchMetrics", "Disabled" - ), + "publish_cloudwatch_metrics": props.get("PublishCloudwatchMetrics", "Disabled"), }, "RoleArn": props["PassRoleArn"], } @@ -279,9 +264,7 @@ def get_processing_request(event, dataset_format=DatasetFormat.csv()): # Add the KmsKeyId to monitoring outputs and cluster volume if provided if props.get("KmsKeyId") is not None: request["ProcessingOutputConfig"]["KmsKeyId"] = props["KmsKeyId"] - request["ProcessingResources"]["ClusterConfig"]["VolumeKmsKeyId"] = props[ - "KmsKeyId" - ] + request["ProcessingResources"]["ClusterConfig"]["VolumeKmsKeyId"] = props["KmsKeyId"] # Add experiment tracking request["ExperimentConfig"] = { @@ -295,9 +278,7 @@ def get_processing_request(event, dataset_format=DatasetFormat.csv()): if props.get("RecordPreprocessorSourceUri"): env = request["Environment"] fn = get_file_name(props["RecordPreprocessorSourceUri"]) - env["record_preprocessor_script"] = ( - "/opt/ml/processing/code/postprocessing/" + fn - ) + env["record_preprocessor_script"] = "/opt/ml/processing/code/postprocessing/" + fn request["ProcessingInputs"].append( { "InputName": "pre_processor_script", @@ -315,9 +296,7 @@ def get_processing_request(event, dataset_format=DatasetFormat.csv()): if props.get("PostAnalyticsProcessorSourceUri"): env = request["Environment"] fn = get_file_name(props["PostAnalyticsProcessorSourceUri"]) - env["post_analytics_processor_script"] = ( - "/opt/ml/processing/code/postprocessing/" + fn - ) + env["post_analytics_processor_script"] = "/opt/ml/processing/code/postprocessing/" + fn request["ProcessingInputs"].append( { "InputName": "post_processor_script", @@ -339,9 +318,7 @@ def get_processing_request(event, dataset_format=DatasetFormat.csv()): # Add baseline constraints logger.debug("Update with constraints: %s", data["BaselineConstraintsUri"]) env = request["Environment"] - env[ - "baseline_constraints" - ] = "/opt/ml/processing/baseline/constraints/constraints.json" + env["baseline_constraints"] = "/opt/ml/processing/baseline/constraints/constraints.json" request["ProcessingInputs"].append( { "InputName": "constraints", diff --git a/custom_resource/sagemaker_training_job.py b/custom_resource/sagemaker_training_job.py index 9fd8c04..bd4b0ed 100644 --- a/custom_resource/sagemaker_training_job.py +++ b/custom_resource/sagemaker_training_job.py @@ -53,7 +53,7 @@ def poll_create(event, context): @helper.poll_delete def poll_delete(event, context): """ - Return true if the resource has been stopped. + Return true if the resource has been stopped. """ training_job_name = get_training_job_name(event) logger.info("Polling for stopped training job: %s", training_job_name) @@ -87,9 +87,7 @@ def is_training_job_ready(training_job_name): ) else: raise Exception( - "Training job ({}) has unexpected status: {}".format( - training_job_name, status - ) + "Training job ({}) has unexpected status: {}".format(training_job_name, status) ) return is_ready diff --git a/docs/cloudwatch-dashboard.png b/docs/cloudwatch-dashboard.png index 84ebe7a..ef17cb3 100644 Binary files a/docs/cloudwatch-dashboard.png and b/docs/cloudwatch-dashboard.png differ diff --git a/docs/sagemaker-notebook.png b/docs/sagemaker-notebook.png deleted file mode 100644 index dd79e0d..0000000 Binary files a/docs/sagemaker-notebook.png and /dev/null differ diff --git a/docs/studio-cft.png b/docs/studio-cft.png new file mode 100644 index 0000000..fa4511d Binary files /dev/null and b/docs/studio-cft.png differ diff --git a/docs/studio-execution-role.png b/docs/studio-execution-role.png new file mode 100644 index 0000000..81026e0 Binary files /dev/null and b/docs/studio-execution-role.png differ diff --git a/docs/studio-sagemaker-project-template.png b/docs/studio-sagemaker-project-template.png new file mode 100644 index 0000000..191ffdd Binary files /dev/null and b/docs/studio-sagemaker-project-template.png differ diff --git a/model/buildspec.yml b/model/buildspec.yml index f5a6462..5d6a565 100644 --- a/model/buildspec.yml +++ b/model/buildspec.yml @@ -6,7 +6,7 @@ phases: python: 3.7 commands: - echo "Installing requirements" - - pip install -U boto3 awscli # Upgrade boto3 and awscli + - pip install --upgrade --force-reinstall boto3 awscli # Upgrade boto3 and awscli - pip install -r $CODEBUILD_SRC_DIR/model/requirements.txt - pip install crhelper -t $CODEBUILD_SRC_DIR/custom_resource # Install custom resource helper into the CFN directory @@ -22,7 +22,27 @@ phases: - echo Build started on `date` - echo Run the workflow script - cd $CODEBUILD_SRC_DIR - - python model/run.py --git-branch=$GIT_BRANCH --codebuild-id=$CODEBUILD_BUILD_ID --pipeline-name=$PIPELINE_NAME --model-name=$MODEL_NAME --deploy-role=$DEPLOY_ROLE_ARN --sagemaker-role=$SAGEMAKER_ROLE_ARN --sagemaker-bucket=$SAGEMAKER_BUCKET --data-dir=$CODEBUILD_SRC_DIR_DataSourceOutput --output-dir=$CODEBUILD_SRC_DIR/assets --kms-key-id=$KMS_KEY_ID --workflow-role-arn=$WORKFLOW_ROLE_ARN --notification-arn=$NOTIFICATION_ARN # --ecr-dir=$CODEBUILD_SRC_DIR_EcrSourceOutput + - export PYTHONUNBUFFERED=TRUE + - export SAGEMAKER_PROJECT_NAME_ID="${SAGEMAKER_PROJECT_NAME}-${SAGEMAKER_PROJECT_ID}" + - export PREFIXED_PIPELINE_NAME="${PREFIX}-${PIPELINE_NAME}" + - export PREFIXED_MODEL_NAME="${PREFIX}-${MODEL_NAME}" + - | # TODO: split pipeline def from exec + python model/run_pipeline.py \ + --role-arn=$SAGEMAKER_ROLE_ARN \ + --tags "[{\"Key\":\"sagemaker:project-name\", \"Value\":\"${SAGEMAKER_PROJECT_NAME}\"}, {\"Key\":\"sagemaker:project-id\", \"Value\":\"${SAGEMAKER_PROJECT_ID}\"}]" \ + --git-branch=$GIT_BRANCH \ + --codebuild-id=$CODEBUILD_BUILD_ID \ + --pipeline-name=$PIPELINE_NAME \ + --model-name=$PREFIXED_MODEL_NAME \ + --deploy-role=$DEPLOY_ROLE_ARN \ + --sagemaker-role=$SAGEMAKER_ROLE_ARN \ + --sagemaker-bucket=$SAGEMAKER_BUCKET \ + --data-dir=$CODEBUILD_SRC_DIR_DataSourceOutput \ + --output-dir=$CODEBUILD_SRC_DIR/assets \ + --kms-key-id=$KMS_KEY_ID \ + --workflow-role-arn=$WORKFLOW_ROLE_ARN \ + --notification-arn=$NOTIFICATION_ARN \ + --sagemaker-project-id=$SAGEMAKER_PROJECT_ID - echo Set unique commit in api to ensure re-deploy - echo $CODEBUILD_RESOLVED_SOURCE_VERSION > api/commit.txt - echo $CODEBUILD_BUILD_ID >> api/commit.txt # Add build ID when commit doesn't change diff --git a/model/dashboard.json b/model/dashboard.json index b897a48..3c379b1 100644 --- a/model/dashboard.json +++ b/model/dashboard.json @@ -214,4 +214,4 @@ } } ] -} \ No newline at end of file +} diff --git a/model/requirements.txt b/model/requirements.txt index 09dc519..5252677 100644 --- a/model/requirements.txt +++ b/model/requirements.txt @@ -1,2 +1,2 @@ sagemaker>=2.1.0<3 -stepfunctions==2.0.0 \ No newline at end of file +stepfunctions==2.0.0 diff --git a/model/run.py b/model/run_pipeline.py similarity index 89% rename from model/run.py rename to model/run_pipeline.py index 6d59679..8185f07 100644 --- a/model/run.py +++ b/model/run_pipeline.py @@ -22,7 +22,10 @@ def create_experiment_step(create_experiment_function_name): "Create Experiment", parameters={ "FunctionName": create_experiment_function_name, - "Payload": {"ExperimentName.$": "$.ExperimentName", "TrialName.$": "$.TrialName",}, + "Payload": { + "ExperimentName.$": "$.ExperimentName", + "TrialName.$": "$.TrialName", + }, }, result_path="$.CreateTrialResults", ) @@ -236,7 +239,7 @@ def create_graph(create_experiment_step, baseline_step, training_step): return steps.states.Chain([create_experiment_step, sagemaker_jobs]) -def get_dev_config(model_name, job_id, role, image_uri, kms_key_id): +def get_dev_config(model_name, job_id, role, image_uri, kms_key_id, sagemaker_project_id): return { "Parameters": { "ImageRepoUri": image_uri, @@ -246,21 +249,27 @@ def get_dev_config(model_name, job_id, role, image_uri, kms_key_id): "ModelVariant": "dev", "KmsKeyId": kms_key_id, }, - "Tags": {"mlops:model-name": model_name, "mlops:stage": "dev"}, + "Tags": { + "mlops:model-name": model_name, + "mlops:stage": "dev", + "SageMakerProjectId": sagemaker_project_id, + }, } -def get_prd_config(model_name, job_id, role, image_uri, kms_key_id, notification_arn): - dev_config = get_dev_config(model_name, job_id, role, image_uri, kms_key_id) +def get_prd_config( + model_name, job_id, role, image_uri, kms_key_id, notification_arn, sagemaker_project_id +): + dev_config = get_dev_config( + model_name, job_id, role, image_uri, kms_key_id, sagemaker_project_id + ) prod_params = { "ModelVariant": "prd", "ScheduleMetricName": "feature_baseline_drift_total_amount", "ScheduleMetricThreshold": str("0.20"), "NotificationArn": notification_arn, } - prod_tags = { - "mlops:stage": "prd", - } + prod_tags = {"mlops:stage": "prd", "SageMakerProjectId": sagemaker_project_id} return { "Parameters": dict(dev_config["Parameters"], **prod_params), "Tags": dict(dev_config["Tags"], **prod_tags), @@ -305,6 +314,8 @@ def main( kms_key_id, workflow_role_arn, notification_arn, + sagemaker_project_id, + tags, ): # Define the function names create_experiment_function_name = "mlops-create-experiment" @@ -341,7 +352,7 @@ def main( # Set the output Data output_data = { "ModelOutputUri": "s3://{}/{}".format(sagemaker_bucket, model_name), - "BaselineOutputUri": f"s3://{sagemaker_bucket}/{model_name}/monitoring/baseline/mlops-{model_name}-pbl-{job_id}", + "BaselineOutputUri": f"s3://{sagemaker_bucket}/{model_name}/monitoring/baseline/{model_name}-pbl-{job_id}", } print("model output uri: {}".format(output_data["ModelOutputUri"])) @@ -384,7 +395,7 @@ def main( # Create the workflow as the model name workflow = Workflow(model_name, workflow_definition, workflow_role_arn) - print("Creating workflow: {}".format(model_name)) + print("Creating workflow: {0}-{1}".format(model_name, sagemaker_project_id)) # Create output directory if not os.path.exists(output_dir): @@ -401,30 +412,52 @@ def main( # Write the workflow inputs to file with open(os.path.join(output_dir, "workflow-input.json"), "w") as f: workflow_inputs = { - "ExperimentName": "mlops-{}".format(model_name), - "TrialName": "mlops-{}-{}".format(model_name, job_id), + "ExperimentName": "{}".format(model_name), + "TrialName": "{}-{}".format(model_name, job_id), "GitBranch": git_branch, "GitCommitHash": git_commit_id, "DataVersionId": data_verison_id, - "BaselineJobName": "mlops-{}-pbl-{}".format(model_name, job_id), + "BaselineJobName": "{}-pbl-{}".format(model_name, job_id), "BaselineOutputUri": output_data["BaselineOutputUri"], - "TrainingJobName": "mlops-{}-{}".format(model_name, job_id), + "TrainingJobName": "{}-{}".format(model_name, job_id), } json.dump(workflow_inputs, f) # Write the dev & prod params for CFN with open(os.path.join(output_dir, "deploy-model-dev.json"), "w") as f: - config = get_dev_config(model_name, job_id, deploy_role, image_uri, kms_key_id) + config = get_dev_config( + model_name, job_id, deploy_role, image_uri, kms_key_id, sagemaker_project_id + ) json.dump(config, f) with open(os.path.join(output_dir, "deploy-model-prd.json"), "w") as f: config = get_prd_config( - model_name, job_id, deploy_role, image_uri, kms_key_id, notification_arn + model_name, + job_id, + deploy_role, + image_uri, + kms_key_id, + notification_arn, + sagemaker_project_id, ) json.dump(config, f) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Load parameters") + parser.add_argument( + "-role-arn", + "--role-arn", + dest="sagemaker_role", + type=str, + help="The role arn for the pipeline service execution role.", + ) + parser.add_argument( + "-tags", + "--tags", + dest="tags", + default=None, + help="""List of dict strings of '[{"Key": "string", "Value": "string"}, ..]'""", + ) parser.add_argument("--codebuild-id", required=True) parser.add_argument("--data-dir", required=True) parser.add_argument("--output-dir", required=True) @@ -438,6 +471,7 @@ def main( parser.add_argument("--git-branch", required=True) parser.add_argument("--workflow-role-arn", required=True) parser.add_argument("--notification-arn", required=True) + parser.add_argument("--sagemaker-project-id", required=True) args = vars(parser.parse_args()) print("args: {}".format(args)) main(**args) diff --git a/notebook/canary.js b/notebook/canary.js deleted file mode 100644 index 96afdb7..0000000 --- a/notebook/canary.js +++ /dev/null @@ -1,37 +0,0 @@ - -var synthetics = require('Synthetics'); -const log = require('SyntheticsLogger'); -const https = require('https'); - -const apiCanaryBlueprint = async function (hostname, path, postData) { - const verifyRequest = async function (requestOption) { - return new Promise((resolve, reject) => { - log.info("Making request with options: " + JSON.stringify(requestOption)); - const req = https.request(requestOption); - req.on('response', (res) => { - log.info('Status Code:',res.statusCode); - log.info('Response Headers:',JSON.stringify(res.headers)); - if (res.statusCode !== 200) { - reject("Failed: " + requestOption.path); - } - res.on('data', (d) => { - log.info("Response: " + d); - }); - res.on('end', resolve); - }); - req.on('error', reject); - req.write(postData); - req.end(); - }); - } - - const headers = {"Content-Type":"text/csv"} - headers['User-Agent'] = [synthetics.getCanaryUserAgentString(), headers['User-Agent']].join(' '); - const requestOptions = {"hostname":hostname,"method":"POST","path":path,"port":443} - requestOptions['headers'] = headers; - await verifyRequest(requestOptions); -}; - -exports.handler = async () => { - return await apiCanaryBlueprint("${hostname}", "${path}", "${data}"); -}; diff --git a/notebook/dashboard.json b/notebook/dashboard.json index 4cd0d29..96e2ef8 100644 --- a/notebook/dashboard.json +++ b/notebook/dashboard.json @@ -343,23 +343,6 @@ "view": "timeSeries", "stacked": false } - }, - { - "type": "metric", - "x": 16, - "y": 20, - "width": 8, - "height": 7, - "properties": { - "title": "CloudWatch Synthetics Alarm", - "annotations": { - "alarms": [ - "arn:aws:cloudwatch:${region}:${account_id}:alarm:mlops-${model_name}-synth-lt-threshold" - ] - }, - "view": "timeSeries", - "stacked": false - } } ] } diff --git a/notebook/mlops.ipynb b/notebook/mlops.ipynb index 6bd4be8..472eade 100644 --- a/notebook/mlops.ipynb +++ b/notebook/mlops.ipynb @@ -1,1905 +1 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Safe MLOps Deployment Pipeline\n", - "\n", - "\n", - "## Overview\n", - "\n", - "In this notebook you will step through an MLOps pipeline to build, train, deploy and monitor an XGBoost regression model for predicting the expected taxi fare using the New York City Taxi [dataset](https://registry.opendata.aws/nyc-tlc-trip-records-pds/)⇗. This safe pipeline features a [canary deployment](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/canary-deployment.html)⇗ strategy with rollback on error. You will learn how to trigger and monitor the pipeline, inspect the training workflow, use model monitor to set up alerts, and create a canary deployment.\n", - "\n", - "
\n", - " Note: This notebook assumes prior familiarity with the basics training ML models on Amazon SageMaker. Data preparation and visualization, although present, will be kept to a minimum. If you are not familiar with the basic concepts and features of SageMaker, we recommend reading the SageMaker documentation⇗ and completing the workshops and samples in AWS SageMaker Examples GitHub⇗ and AWS Samples GitHub⇗. \n", - "
\n", - "\n", - "### Contents\n", - "\n", - "This notebook has the following key sections:\n", - "\n", - "1. [Data Prep](#Data-Prep)\n", - "2. [Build](#Build)\n", - "3. [Train Model](#Train-Model)\n", - "4. [Deploy Dev](#Deploy-Dev)\n", - "5. [Deploy Prod](#Deploy-Prod)\n", - "6. [Monitor](#Monitor)\n", - "6. [Cleanup](#Cleanup)\n", - "\n", - "### Architecture\n", - "\n", - "The architecture diagram below shows the entire MLOps pipeline at a high level.\n", - "\n", - "Use the CloudFormation template provided in this repository (`pipeline.yml`) to build the demo in your own AWS account. If you are currently viewing this notebook from SageMaker in your AWS account, then you have already completed this step. CloudFormation deploys several resources:\n", - " \n", - "1. A customer-managed encryption key in in Amazon KMS for encrypting data and artifacts.\n", - "1. A secret in Amazon Secrets Manager to securely store your GitHub Access Token.\n", - "1. Several AWS IAM roles so CloudFormation, SageMaker, and other AWS services can perform actions in your AWS account, following the principle of [least privilege](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#grant-least-privilege)⇗.\n", - "1. A messaging service in Amazon SNS to notify you when CodeDeploy has successfully deployed the API, and to receive alerts for retraining and drift detection (signing up for these notifications is optional).\n", - "1. Two Amazon CloudWatch event rules: one which schedules the pipeline to run every month, and one which triggers the pipeline to run when SageMaker Model Monitor detects certain metrics.\n", - "1. An Amazon SageMaker Jupyter notebook with this workshop content pre-loaded.\n", - "1. An Amazon S3 bucket for storing model artifacts.\n", - "1. An AWS CodePipeline instance with several pre-defined stages. \n", - "\n", - "Take a moment to look at all of these resources now deployed in your account. \n", - "\n", - "![MLOps pipeline architecture](../docs/mlops-architecture.png)\n", - "\n", - "In this notebook, you will work through the CodePipeline instance created by the CloudFormation template. It has several stages:\n", - "\n", - "1. **Source** - The pipeline is already configured with two sources. If you upload a new dataset to a specific location in the S3 data bucket, this will trigger the pipeline to run. The Git source can be GitHub, or CodeCommit if you don’t supply your access token. If you commit new code to your repository, this will trigger the pipeline to run. \n", - "1. **Build** - In this stage, CodeBuild configured by the build specification `model/buildspec.yml` will execute `model/run.py` to generate AWS CloudFormation templates for creating the AWS Step Function (including AWS Lambda custom resources), and deployment templates used in the following stages based on the data sets and hyperparameters specified for this pipeline run. You will take a closer look at these files later in this notebook. \n", - "1. **Train** The Step Functions workflow created in the Build stage is run in this stage. The workflow creates a baseline for the model monitor using a SageMaker processing job, and trains an XGBoost model on the taxi ride dataset using a SageMaker training job.\n", - "1. **Deploy Dev** In this stage, a CloudFormation template created in the build stage (from `assets/deploy-model-dev.yml`) deploys a dev endpoint. This will allow you to run tests on the model and decide if the model is of sufficient quality to deploy into production.\n", - "1. **Deploy Production** The final stage of the pipeline is the only stage which does not run automatically as soon as the previous stage is complete. It waits for a user to manually approve the model which was previously deployed to dev. As soon as the model is approved, a CloudFormation template (packaged from `assets/deploy-model-prod.yml` to include the Lambda functions saved and uploaded as ZIP files in S3) deploys the production endpoint. It configures autoscaling and enables data capture. It creates a model monitoring schedule and sets CloudWatch alarms for certain metrics. It also sets up an AWS CodeDeploy instance which deploys a set of AWS Lambda functions and an Amazon API Gateway to sit in front of the SageMaker endpoint. This stage can make use of canary deployment to safely switch from an old model to a new model. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Import the latest sagemaker and boto3 SDKs.\n", - "import sys\n", - "!{sys.executable} -m pip install --upgrade pip\n", - "!{sys.executable} -m pip install -qU awscli boto3 \"sagemaker>=2.1.0<3\" tqdm\n", - "!{sys.executable} -m pip install -qU \"stepfunctions==2.0.0\"\n", - "!{sys.executable} -m pip show sagemaker stepfunctions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Restart your SageMaker kernel then continue with this notebook." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data Prep\n", - " \n", - "In this section of the notebook, you will download the publicly available New York Taxi dataset in preparation for uploading it to S3.\n", - "\n", - "### Download Dataset\n", - "\n", - "First, download a sample of the New York City Taxi [dataset](https://registry.opendata.aws/nyc-tlc-trip-records-pds/)⇗ to this notebook instance. This dataset contains information on trips taken by taxis and for-hire vehicles in New York City, including pick-up and drop-off times and locations, fares, distance traveled, and more. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!aws s3 cp 's3://nyc-tlc/trip data/green_tripdata_2018-02.csv' 'nyc-tlc.csv'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now load the dataset into a pandas data frame, taking care to parse the dates correctly." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "parse_dates= ['lpep_dropoff_datetime', 'lpep_pickup_datetime']\n", - "trip_df = pd.read_csv('nyc-tlc.csv', parse_dates=parse_dates)\n", - "\n", - "trip_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Data manipulation\n", - "\n", - "Instead of the raw date and time features for pick-up and drop-off, let's use these features to calculate the total time of the trip in minutes, which will be easier to work with for our model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "trip_df['duration_minutes'] = (trip_df['lpep_dropoff_datetime'] - trip_df['lpep_pickup_datetime']).dt.seconds/60" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The dataset contains a lot of columns we don't need, so let's select a sample of columns for our machine learning model. Keep only `total_amount` (fare), `duration_minutes`, `passenger_count`, and `trip_distance`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "cols = ['total_amount', 'duration_minutes', 'passenger_count', 'trip_distance']\n", - "data_df = trip_df[cols]\n", - "print(data_df.shape)\n", - "data_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Generate some quick statistics for the dataset to understand the quality." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_df.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The table above shows some clear outliers, e.g. -400 or 2626 as fare, or 0 passengers. There are many intelligent methods for identifying and removing outliers, but data cleaning is not the focus of this notebook, so just remove the outliers by setting some min and max values which seem more reasonable. Removing the outliers results in a final dataset of 754,671 rows." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_df = data_df[(data_df.total_amount > 0) & (data_df.total_amount < 200) & \n", - " (data_df.duration_minutes > 0) & (data_df.duration_minutes < 120) & \n", - " (data_df.trip_distance > 0) & (data_df.trip_distance < 121) & \n", - " (data_df.passenger_count > 0)].dropna()\n", - "print(data_df.shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Data visualization\n", - "\n", - "Since this notebook will build a regression model for the taxi data, it's a good idea to check if there is any correlation between the variables in our data. Use scatter plots on a sample of the data to compare trip distance with duration in minutes, and total amount (fare) with duration in minutes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns \n", - "\n", - "sample_df = data_df.sample(1000)\n", - "sns.scatterplot(data=sample_df, x='duration_minutes', y='trip_distance')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sns.scatterplot(data=sample_df, x='duration_minutes', y='total_amount')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "These scatter plots look fine and show at least some correlation between our variables. \n", - "\n", - "### Data splitting and saving\n", - "\n", - "We are now ready to split the dataset into train, validation, and test sets. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.model_selection import train_test_split\n", - "train_df, val_df = train_test_split(data_df, test_size=0.20, random_state=42)\n", - "val_df, test_df = train_test_split(val_df, test_size=0.05, random_state=42)\n", - "\n", - "# Reset the index for our test dataframe\n", - "test_df.reset_index(inplace=True, drop=True)\n", - "\n", - "print('Size of\\n train: {},\\n val: {},\\n test: {} '.format(train_df.shape[0], val_df.shape[0], test_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Save the train, validation, and test files as CSV locally on this notebook instance. Notice that you save the train file twice - once as the training data file and once as the baseline data file. The baseline data file will be used by [SageMaker Model Monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html)⇗ to detect data drift. Data drift occurs when the statistical nature of the data that your model receives while in production drifts away from the nature of the baseline data it was trained on, which means the model begins to lose accuracy in its predictions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train_cols = ['total_amount', 'duration_minutes','passenger_count','trip_distance']\n", - "train_df.to_csv('train.csv', index=False, header=False)\n", - "val_df.to_csv('validation.csv', index=False, header=False)\n", - "test_df.to_csv('test.csv', index=False, header=False)\n", - "\n", - "# Save test and baseline with headers\n", - "train_df.to_csv('baseline.csv', index=False, header=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now upload these CSV files to your default SageMaker S3 bucket. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import sagemaker\n", - "\n", - "# Get the session and default bucket\n", - "session = sagemaker.session.Session()\n", - "bucket = session.default_bucket()\n", - "\n", - "# Specify data prefix and version\n", - "prefix = 'nyc-tlc/v1'\n", - "\n", - "s3_train_uri = session.upload_data('train.csv', bucket, prefix + '/data/training')\n", - "s3_val_uri = session.upload_data('validation.csv', bucket, prefix + '/data/validation')\n", - "s3_test_uri = session.upload_data('test.csv', bucket, prefix + '/data/test')\n", - "s3_baseline_uri = session.upload_data('baseline.csv', bucket, prefix + '/data/baseline')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You will use the datasets which you have prepared and saved in this section to trigger the pipeline to train and deploy a model in the next section." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Build\n", - "\n", - "If you navigate to the CodePipeline instance created for this workshop, you will notice that the Source stage is initially in a `Failed` state. This happens because the dataset, which is one of the sources that can trigger the pipeline, has not yet been uploaded to the S3 location expected by the pipeline.\n", - "\n", - "![Failed code pipeline](../docs/pipeline_failed.png)\n", - "\n", - "### Trigger Build\n", - "\n", - "In this section, you will start a model build and deployment pipeline by packaging up the datasets you prepared in the previous section and uploading these to the S3 source location which triggers the CodePipeline instance created for this workshop. \n", - "\n", - "First, import some libraries and load some environment variables which you will need. These environment variables have been set through a [lifecycle configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html)⇗ script attached to this notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "import os\n", - "import time\n", - "\n", - "region = boto3.Session().region_name\n", - "artifact_bucket = os.environ['ARTIFACT_BUCKET']\n", - "pipeline_name = os.environ['PIPELINE_NAME']\n", - "model_name = os.environ['MODEL_NAME']\n", - "workflow_pipeline_arn = os.environ['WORKFLOW_PIPELINE_ARN']\n", - "\n", - "print('region: {}'.format(region))\n", - "print('artifact bucket: {}'.format(artifact_bucket))\n", - "print('pipeline: {}'.format(pipeline_name))\n", - "print('model name: {}'.format(model_name))\n", - "print('workflow: {}'.format(workflow_pipeline_arn))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "From the AWS CodePipeline [documentation](https://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-simple-s3.html)⇗:\n", - "\n", - "> When Amazon S3 is the source provider for your pipeline, you may zip your source file or files into a single .zip and upload the .zip to your source bucket. You may also upload a single unzipped file; however, downstream actions that expect a .zip file will fail.\n", - "\n", - "To train a model, you need multiple datasets (train, validation, and test) along with a file specifying the hyperparameters. In this example, you will create one JSON file which contains the S3 dataset locations and one JSON file which contains the hyperparameter values. Then you compress both files into a zip package to be used as input for the pipeline run. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from io import BytesIO\n", - "import zipfile\n", - "import json\n", - "\n", - "input_data = {\n", - " 'TrainingUri': s3_train_uri,\n", - " 'ValidationUri': s3_val_uri,\n", - " 'TestUri': s3_test_uri,\n", - " 'BaselineUri': s3_baseline_uri\n", - "}\n", - "\n", - "hyperparameters = {\n", - " 'num_round': 50\n", - "}\n", - "\n", - "zip_buffer = BytesIO()\n", - "with zipfile.ZipFile(zip_buffer, 'a') as zf:\n", - " zf.writestr('inputData.json', json.dumps(input_data))\n", - " zf.writestr('hyperparameters.json', json.dumps(hyperparameters))\n", - "zip_buffer.seek(0)\n", - "\n", - "data_source_key = '{}/data-source.zip'.format(pipeline_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now upload the zip package to your artifact S3 bucket - this action will trigger the pipeline to train and deploy a model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "s3 = boto3.client('s3')\n", - "s3.put_object(Bucket=artifact_bucket, Key=data_source_key, Body=bytearray(zip_buffer.read()))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Click the link below to open the AWS console at the Code Pipeline if you don't have it open in another tab.\n", - "\n", - "
\n", - " Tip: You may need to wait a minute to see the DataSource stage turn green. The page will refresh automatically.\n", - "
\n", - "\n", - "![Source Green](../docs/datasource-after.png)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.core.display import HTML\n", - "\n", - "HTML('Code Pipeline'.format(region, pipeline_name))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Inspect Build Logs\n", - "\n", - "Once the build stage is running, you will see the AWS CodeBuild job turn blue with a status of **In progress**.\n", - "\n", - "![Failed code pipeline](../docs/codebuild-inprogress.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can click on the **Details** link displayed in the CodePipeline UI or click the link below to jump directly to the CodeBuild logs.\n", - "\n", - "
\n", - " Tip: You may need to wait a few seconds for the pipeline to transition into the active (blue) state and for the build to start.\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "codepipeline = boto3.client('codepipeline')\n", - "\n", - "def get_pipeline_stage(pipeline_name, stage_name):\n", - " response = codepipeline.get_pipeline_state(name=pipeline_name)\n", - " for stage in response['stageStates']:\n", - " if stage['stageName'] == stage_name:\n", - " return stage\n", - "\n", - "# Get last execution id\n", - "build_stage = get_pipeline_stage(pipeline_name, 'Build') \n", - "if not 'latestExecution' in build_stage:\n", - " raise(Exception('Please wait. Build not started'))\n", - "\n", - "build_url = build_stage['actionStates'][0]['latestExecution']['externalExecutionUrl']\n", - "\n", - "# Out a link to the code build logs\n", - "HTML('Code Build Logs'.format(build_url))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The AWS CodeBuild process is responsible for creating a number of AWS CloudFormation templates which we will explore in more detail in the next section. Two of these templates are used to set up the **Train** step by creating the AWS Step Functions worklow and the custom AWS Lambda functions used within this workflow." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train Model\n", - "\n", - "### Inspect Training Job\n", - "\n", - "Wait until the pipeline has started running the Train step (see screenshot) before continuing with the next cells in this notebook. \n", - "\n", - "![Training in progress](../docs/train-in-progress.png)\n", - "\n", - "When the pipeline has started running the train step, you can click on the **Details** link displayed in the CodePipeline UI (see screenshot above) to view the Step Functions workflow which is running the training job. \n", - "\n", - "Alternatively, you can click on the Workflow link from the cell output below once it's available." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from stepfunctions.workflow import Workflow\n", - "while True:\n", - " try:\n", - " workflow = Workflow.attach(workflow_pipeline_arn)\n", - " break\n", - " except ClientError as e:\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " time.sleep(10)\n", - "\n", - "workflow" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Or simply run the cell below to display the Step Functions workflow, and re-run it after a few minutes to see the progress." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "executions = workflow.list_executions()\n", - "if not executions:\n", - " raise(Exception('Please wait. Training not started'))\n", - " \n", - "executions[0].render_progress()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Review Build Script\n", - "\n", - "While you wait for the training job to complete, let's take a look at the `run.py` code which was used by the AWS CodeBuild process.\n", - "\n", - "This script takes all of the input parameters, including the dataset locations and hyperparameters which you saved to JSON files earlier in this notebook, and uses them to generate the templates which the pipeline needs to run the training job. It *does not* create the actual Step Functions instance - it only generates the templates which define the Step Functions workflow, as well as the CloudFormation input templates which CodePipeline uses to instantiate the Step Functions instance.\n", - "\n", - "Step-by-step, the script does the following:\n", - "\n", - "1. It collects all the input parameters it needs to generate the templates. This includes information about the environment container needed to run the training job, the input and output data locations, IAM roles needed by various components, encryption keys, and more. It then sets up some basic parameters like the AWS region and the function names.\n", - "1. If the input parameters specify an environment container stored in ECR, it fetches that container. Otherwise, it fetches the URI of the AWS managed environment container needed for the training job.\n", - "1. It reads the input data JSON file which you generated earlier in this notebook (and which was included in the zip source for the pipeline), thereby fetching the locations of the train, validation, and baseline data files. Then it formats more parameters which will be needed later in the script, including version IDs and output data locations.\n", - "1. It reads the hyperparameter JSON file which you generated earlier in this notebook.\n", - "1. It defines the Step Functions workflow, starting with the input schema, followed by each step of the workflow (i.e. Create Experiment, Baseline Job, Training Job), and finally combines those steps into a workflow graph. \n", - "1. The workflow graph is saved to file, along with a file containing all of the input parameters saved according to the schema defined in the workflow.\n", - "1. It saves parameters to file which will be used by CloudFormation to instantiate the Step Functions workflow." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize ../model/run.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Customize Workflow (Optional)\n", - "\n", - "If you are interested in customising the workflow used in the Build Script, store the `input_data` to be used within the local [workflow.ipynb](workflow.ipynb) notebook. The workflow notebook can be used to experiment with the Step Functions workflow and training job definitions for your model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "%store input_data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Training Analytics\n", - "\n", - "Once the training and baseline jobs are complete (meaning they are displayed in a green color in the Step Functions workflow, this takes around 5 minutes), you can inspect the experiment metrics. The code below will display all experiments in a table. Note that the baseline processing job won't have RMSE metrics - it calculates metrics based on the training data, but does not train a machine learning model. \n", - "\n", - "You will [explore the baseline](#Explore-Baseline) results later in this notebook. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker import analytics\n", - "experiment_name = 'mlops-{}'.format(model_name)\n", - "model_analytics = analytics.ExperimentAnalytics(experiment_name=experiment_name)\n", - "analytics_df = model_analytics.dataframe()\n", - "\n", - "if (analytics_df.shape[0] == 0):\n", - " raise(Exception('Please wait. No training or baseline jobs'))\n", - "\n", - "pd.set_option('display.max_colwidth', 100) # Increase column width to show full copmontent name\n", - "cols = ['TrialComponentName', 'DisplayName', 'SageMaker.InstanceType', \n", - " 'train:rmse - Last', 'validation:rmse - Last'] # return the last rmse for training and validation\n", - "analytics_df[analytics_df.columns & cols].head(2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Deploy Dev\n", - "\n", - "### Test Dev Deployment\n", - "\n", - "When the pipeline has finished training a model, it automatically moves to the next step, where the model is deployed as a SageMaker Endpoint. This endpoint is part of your dev deployment, therefore, in this section, you will run some tests on the endpoint to decide if you want to deploy this model into production.\n", - "\n", - "First, run the cell below to fetch the name of the SageMaker Endpoint." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "codepipeline = boto3.client('codepipeline')\n", - "\n", - "deploy_dev = get_pipeline_stage(pipeline_name, 'DeployDev')\n", - "if not 'latestExecution' in deploy_dev:\n", - " raise(Exception('Please wait. Deploy dev not started'))\n", - " \n", - "execution_id = deploy_dev['latestExecution']['pipelineExecutionId']\n", - "dev_endpoint_name = 'mlops-{}-dev-{}'.format(model_name, execution_id)\n", - "\n", - "print('endpoint name: {}'.format(dev_endpoint_name))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you moved through the previous section very quickly, you will need to wait until the dev endpoint has been successfully deployed and the pipeline is waiting for approval to deploy to production (see screenshot). It can take up to 10 minutes for SageMaker to create an endpoint.\n", - "\n", - "![Deploying dev endpoint in code pipeline](../docs/dev-deploy-ready.png)\n", - "\n", - "Alternatively, run the code below to check the status of your endpoint. Wait until the status of the endpoint is 'InService'." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sm = boto3.client('sagemaker')\n", - "\n", - "while True:\n", - " try:\n", - " response = sm.describe_endpoint(EndpointName=dev_endpoint_name)\n", - " print(\"Endpoint status: {}\".format(response['EndpointStatus']))\n", - " if response['EndpointStatus'] == 'InService':\n", - " break\n", - " except ClientError as e:\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " time.sleep(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that your endpoint is ready, let's write some code to run the test data (which you split off from the dataset and saved to file at the start of this notebook) through the endpoint for inference. The code below supports both v1 and v2 of the SageMaker SDK, but we recommend using v2 of the SDK in all of your future projects." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "from tqdm import tqdm\n", - "\n", - "try:\n", - " # Support SageMaker v2 SDK: https://sagemaker.readthedocs.io/en/stable/v2.html\n", - " from sagemaker.predictor import Predictor\n", - " from sagemaker.serializers import CSVSerializer\n", - " def get_predictor(endpoint_name):\n", - " xgb_predictor = Predictor(endpoint_name)\n", - " xgb_predictor.serializer = CSVSerializer()\n", - " return xgb_predictor\n", - "except:\n", - " # Fallback to SageMaker v1.70 SDK\n", - " from sagemaker.predictor import RealTimePredictor, csv_serializer\n", - " def get_predictor(endpoint_name):\n", - " xgb_predictor = RealTimePredictor(endpoint_name)\n", - " xgb_predictor.content_type = 'text/csv'\n", - " xgb_predictor.serializer = csv_serializer\n", - " return xgb_predictor\n", - "\n", - "def predict(predictor, data, rows=500):\n", - " split_array = np.array_split(data, round(data.shape[0] / float(rows)))\n", - " predictions = ''\n", - " for array in tqdm(split_array):\n", - " predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])\n", - " return np.fromstring(predictions[1:], sep=',')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now use the `predict` function, which was defined in the code above, to run the test data through the endpoint and generate the predictions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dev_predictor = get_predictor(dev_endpoint_name)\n", - "predictions = predict(dev_predictor, test_df[test_df.columns[1:]].values)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, load the predictions into a data frame, and join it with your test data. Then, calculate absolute error as the difference between the actual taxi fare and the predicted taxi fare. Display the results in a table, sorted by the highest absolute error values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pred_df = pd.DataFrame({'total_amount_predictions': predictions })\n", - "pred_df = test_df.join(pred_df) # Join on all\n", - "pred_df['error'] = abs(pred_df['total_amount']-pred_df['total_amount_predictions'])\n", - "\n", - "pred_df.sort_values('error', ascending=False).head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "From this table, we note that some short trip distances have large errors because the low predicted fare does not match the high actual fare. This could be the result of a generous tip which we haven't included in this dataset.\n", - "\n", - "You can also analyze the results by plotting the absolute error to visualize outliers. In this graph, we see that most of the outliers are cases where the model predicted a much lower fare than the actual fare. There are only a few outliers where the model predicted a higher fare than the actual fare." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sns.scatterplot(data=pred_df, x='total_amount_predictions', y='total_amount', hue='error')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you want one overall measure of quality for the model, you can calculate the root mean square error (RMSE) for the predicted fares compared to the actual fares. Compare this to the [results calculated on the validation set](#validation-results) at the end of the 'Inspect Training Job' section." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from math import sqrt\n", - "from sklearn.metrics import mean_squared_error\n", - "\n", - "def rmse(pred_df):\n", - " return sqrt(mean_squared_error(pred_df['total_amount'], pred_df['total_amount_predictions']))\n", - "\n", - "print('RMSE: {}'.format(rmse(pred_df)))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Deploy Prod\n", - "\n", - "### Approve Deployment to Production\n", - "\n", - "If you are happy with the results of the model, you can go ahead and approve the model to be deployed into production. You can do so by clicking the **Review** button in the CodePipeline UI, leaving a comment to explain why you approve this model, and clicking on **Approve**. \n", - "\n", - "Alternatively, you can create a Jupyter widget which (when enabled) allows you to comment and approve the model directly from this notebook. Run the cell below to see this in action." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import ipywidgets as widgets\n", - "\n", - "def on_click(obj):\n", - " result = { 'summary': approval_text.value, 'status': obj.description }\n", - " response = codepipeline.put_approval_result(\n", - " pipelineName=pipeline_name,\n", - " stageName='DeployDev',\n", - " actionName='ApproveDeploy',\n", - " result=result,\n", - " token=approval_action['token']\n", - " )\n", - " button_box.close()\n", - " print(result)\n", - " \n", - "# Create the widget if we are ready for approval\n", - "deploy_dev = get_pipeline_stage(pipeline_name, 'DeployDev')\n", - "if not 'latestExecution' in deploy_dev['actionStates'][-1]:\n", - " raise(Exception('Please wait. Deploy dev not complete'))\n", - "\n", - "approval_action = deploy_dev['actionStates'][-1]['latestExecution']\n", - "if approval_action['status'] == 'Succeeded':\n", - " print('Dev approved: {}'.format(approval_action['summary']))\n", - "elif 'token' in approval_action:\n", - " approval_text = widgets.Text(placeholder='Optional approval message') \n", - " approve_btn = widgets.Button(description=\"Approved\", button_style='success', icon='check')\n", - " reject_btn = widgets.Button(description=\"Rejected\", button_style='danger', icon='close')\n", - " approve_btn.on_click(on_click)\n", - " reject_btn.on_click(on_click)\n", - " button_box = widgets.HBox([approval_text, approve_btn, reject_btn])\n", - " display(button_box)\n", - "else:\n", - " raise(Exception('Please wait. No dev approval'))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Test Production Deployment\n", - "\n", - "Within about a minute after approving the model deployment, you should see the pipeline start on the final step: deploying your model into production. In this section, you will check the deployment status and test the production endpoint after it has been deployed.\n", - "\n", - "![Deploy production endpoint in code pipeline](../docs/deploy-production.png)\n", - "\n", - "This step of the pipeline uses CloudFormation to deploy a number of resources on your behalf. In particular, it creates:\n", - "\n", - "1. A production-ready SageMaker Endpoint for your model, with [data capture](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html)⇗ (used by SageMaker Model Monitor) and [autoscaling](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html)⇗ enabled.\n", - "1. A [model monitoring schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-scheduling.html)⇗ which outputs the results to CloudWatch metrics, along with a [CloudWatch Alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)⇗ which will notify you when a violation occurs. \n", - "1. A CodeDeploy instance which creates a simple app by deploying API Gateway, three Lambda functions, and an alarm to notify of the success or failure of this deployment. The code for the Lambda functions can be found in `api/app.py`, `api/pre_traffic_hook.py`, and `api/post_traffic_hook.py`. These functions update the endpoint to enable data capture, format and submit incoming traffic to the SageMaker endpoint, and capture the data logs.\n", - "\n", - "![Components of production deployment](../docs/cloud-formation.png)\n", - "\n", - "Let's check how the deployment is progressing. Use the code below to fetch the execution ID of the depoyment step. Then generate a table which lists the resources created by the CloudFormation stack and their creation status. You can re-run the cell after a few minutes to see how the steps are progressing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "deploy_prd = get_pipeline_stage(pipeline_name, 'DeployPrd')\n", - "if not 'latestExecution' in deploy_prd or not 'latestExecution' in deploy_prd['actionStates'][0]:\n", - " raise(Exception('Please wait. Deploy prd not started'))\n", - " \n", - "execution_id = deploy_prd['latestExecution']['pipelineExecutionId']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from datetime import datetime, timedelta\n", - "from dateutil.tz import tzlocal\n", - "\n", - "def get_event_dataframe(events):\n", - " stack_cols = ['LogicalResourceId', 'ResourceStatus', 'ResourceStatusReason', 'Timestamp']\n", - " stack_event_df = pd.DataFrame(events)[stack_cols].fillna('')\n", - " stack_event_df['TimeAgo'] = (datetime.now(tzlocal())-stack_event_df['Timestamp'])\n", - " return stack_event_df.drop('Timestamp', axis=1)\n", - "\n", - "cfn = boto3.client('cloudformation')\n", - "\n", - "stack_name = stack_name='{}-deploy-prd'.format(pipeline_name)\n", - "print('stack name: {}'.format(stack_name))\n", - "\n", - "# Get latest stack events\n", - "while True:\n", - " try:\n", - " response = cfn.describe_stack_events(StackName=stack_name)\n", - " break\n", - " except ClientError as e:\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " time.sleep(10)\n", - " \n", - "get_event_dataframe(response['StackEvents']).head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The resource of most interest to us is the endpoint. This takes on average 10 minutes to deploy. In the meantime, you can take a look at the Python code used for the application. \n", - "\n", - "The `app.py` is the main entry point invoking the Amazon SageMaker endpoint. It returns results along with a custom header for the endpoint we invoked." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize ../api/app.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `pre_traffic_hook.py` lambda is invoked prior to deployment and confirms the endpoint has data capture enabled." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize ../api/pre_traffic_hook.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `post_traffic_hook.py` lambda is invoked to perform any final checks, in this case to verify that we have received log data from data capature." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize ../api/post_traffic_hook.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use the code below to fetch the name of the endpoint, then run a loop to wait for the endpoint to be fully deployed. You need the status to be 'InService'." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "prd_endpoint_name='mlops-{}-prd-{}'.format(model_name, execution_id)\n", - "print('prod endpoint: {}'.format(prd_endpoint_name))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sm = boto3.client('sagemaker')\n", - "\n", - "while True:\n", - " try:\n", - " response = sm.describe_endpoint(EndpointName=prd_endpoint_name)\n", - " print(\"Endpoint status: {}\".format(response['EndpointStatus']))\n", - " # Wait until the endpoint is in service with data capture enabled\n", - " if response['EndpointStatus'] == 'InService' \\\n", - " and 'DataCaptureConfig' in response \\\n", - " and response['DataCaptureConfig']['EnableCapture']:\n", - " break\n", - " except ClientError as e:\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " time.sleep(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When the endpoint status is 'InService', you can continue. Earlier in this notebook, you created some code to send data to the dev endpoint. Reuse this code now to send a sample of the test data to the production endpoint. Since data capture is enabled on this endpoint, you want to send single records at a time, so the model monitor can map these records to the baseline. \n", - "\n", - "You will [inspect the model monitor](#Inspect-Model-Monitor) later in this notebook. For now, just check if you can send data to the endpoint and receive predictions in return." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "prd_predictor = get_predictor(prd_endpoint_name)\n", - "sample_values = test_df[test_df.columns[1:]].sample(100).values\n", - "predictions = predict(prd_predictor, sample_values, rows=1)\n", - "predictions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Test REST API\n", - "\n", - "Although you already tested the SageMaker endpoint in the previous section, it is also a good idea to test the application created with API Gateway. \n", - "\n", - "![Traffic shift between endpoints](../docs/lambda-deploy-create.png)\n", - "\n", - "Follow the link below to open the Lambda Deployment where you can see the in-progress and completed deployments. You can also click to expand the **SAM template** to see the packaged CloudFormation template used in the deployment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "HTML('Lambda Deployment'.format(region, model_name))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Run the code below to confirm that the endpoint is in service. It will complete once the REST API is available." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def get_stack_status(stack_name):\n", - " response = cfn.describe_stacks(StackName=stack_name)\n", - " if response['Stacks']:\n", - " stack = response['Stacks'][0]\n", - " outputs = None\n", - " if 'Outputs' in stack:\n", - " outputs = dict([(o['OutputKey'], o['OutputValue']) for o in stack['Outputs']])\n", - " return stack['StackStatus'], outputs \n", - "\n", - "outputs = None\n", - "while True:\n", - " try:\n", - " status, outputs = get_stack_status(stack_name)\n", - " response = sm.describe_endpoint(EndpointName=prd_endpoint_name)\n", - " print(\"Endpoint status: {}\".format(response['EndpointStatus']))\n", - " if outputs:\n", - " break\n", - " elif status.endswith('FAILED'):\n", - " raise(Exception('Stack status: {}'.format(status)))\n", - " except ClientError as e:\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " time.sleep(10)\n", - "\n", - "if outputs:\n", - " print('deployment application: {}'.format(outputs['DeploymentApplication']))\n", - " print('rest api: {}'.format(outputs['RestApi']))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you are performing an update on your production deployment as a result of running [Trigger Retraining](#Trigger-Retraining) you will then be able to expand the Lambda Deployment tab to reveal the resources. Click on the **ApiFunctionAliaslive** link to see the Lambda Deployment in progress. \n", - "\n", - "![Traffic shift between endpoints](../docs/lambda-deploy-update.png)\n", - "\n", - "This page will be updated to list the deployment events. It also has a link to the Deployment Application which you can access in the output of the next cell." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "HTML('CodeDeploy application'.format(region, outputs['DeploymentApplication']))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "CodeDeploy will perform a canary deployment and send 10% of the traffic to the new endpoint over a 5-minute period.\n", - "\n", - "![Traffic shift between endpoints](../docs/code-deploy.gif)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can invoke the REST API and inspect the headers being returned to see which endpoint we are hitting. You will occasionally see the cell below show a different endpoint that settles to the new version once the stack is complete. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false - }, - "outputs": [], - "source": [ - "%%time\n", - "\n", - "from urllib import request\n", - "\n", - "headers = {\"Content-type\": \"text/csv\"}\n", - "payload = test_df[test_df.columns[1:]].head(1).to_csv(header=False, index=False).encode('utf-8')\n", - "rest_api = outputs['RestApi']\n", - "\n", - "while True:\n", - " try:\n", - " resp = request.urlopen(request.Request(rest_api, data=payload, headers=headers))\n", - " print(\"Response code: %d: endpoint: %s\" % (resp.getcode(), resp.getheader('x-sagemaker-endpoint')))\n", - " status, outputs = get_stack_status(stack_name) \n", - " if status.endswith('COMPLETE'):\n", - " print('Deployment complete\\n')\n", - " break\n", - " elif status.endswith('FAILED'):\n", - " raise(Exception('Stack status: {}'.format(status)))\n", - " except ClientError as e:\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " time.sleep(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Monitor\n", - "\n", - "### Inspect Model Monitor\n", - "\n", - "When you prepared the datasets for model training at the start of this notebook, you saved a baseline dataset (a copy of the train dataset). Then, when you approved the model for deployment into production, the pipeline set up an SageMaker Endpoint with data capture enabled and a model monitoring schedule. In this section, you will take a closer look at the model monitor results.\n", - "\n", - "To start off, fetch the latest production deployment execution ID." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "deploy_prd = get_pipeline_stage(pipeline_name, 'DeployPrd')\n", - "if not 'latestExecution' in deploy_prd:\n", - " raise(Exception('Please wait. Deploy prod not complete'))\n", - " \n", - "execution_id = deploy_prd['latestExecution']['pipelineExecutionId']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Under the hood, SageMaker model monitor runs in SageMaker processing jobs. Use the execution ID to fetch the names of the processing job and the schedule." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "processing_job_name='mlops-{}-pbl-{}'.format(model_name, execution_id)\n", - "schedule_name='mlops-{}-pms'.format(model_name)\n", - "\n", - "print('processing job name: {}'.format(processing_job_name))\n", - "print('schedule name: {}'.format(schedule_name))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Explore Baseline\n", - "\n", - "Now fetch the baseline results from the processing job. This cell will throw an exception if the processing job is not complete - if that happens, just wait several minutes and try again. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "import sagemaker\n", - "from sagemaker.model_monitor import BaseliningJob, MonitoringExecution\n", - "from sagemaker.s3 import S3Downloader\n", - "\n", - "sagemaker_session = sagemaker.Session()\n", - "baseline_job = BaseliningJob.from_processing_name(sagemaker_session, processing_job_name)\n", - "status = baseline_job.describe()['ProcessingJobStatus']\n", - "if status != 'Completed':\n", - " raise(Exception('Please wait. Processing job not complete, status: {}'.format(status)))\n", - " \n", - "baseline_results_uri = baseline_job.outputs[0].destination" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "SageMaker model monitor generates two types of files. Take a look at the statistics file first. It calculates various statistics for each feature of the dataset, including the mean, standard deviation, minimum value, maximum value, and more. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import json\n", - "\n", - "baseline_statistics = baseline_job.baseline_statistics().body_dict\n", - "schema_df = pd.json_normalize(baseline_statistics[\"features\"])\n", - "schema_df[[\"name\", \"numerical_statistics.mean\", \"numerical_statistics.std_dev\",\n", - " \"numerical_statistics.min\", \"numerical_statistics.max\"]].head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now look at the suggested [constraints files](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-byoc-constraints.html)⇗. As the name implies, these are constraints which SageMaker model monitor recommends. If the live data which is sent to your production SageMaker Endpoint violates these constraints, this indicates data drift, and model monitor can raise an alert to trigger retraining. Of course, you can set different constraints based on the statistics which you viewed previously." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "baseline_constraints = baseline_job.suggested_constraints().body_dict\n", - "constraints_df = pd.json_normalize(baseline_constraints[\"features\"])\n", - "constraints_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### View data capture\n", - "\n", - "When the \"Deploy Production\" stage of the MLOps pipeline deploys a SageMaker endpoint, it also enables data capture. This means the incoming requests to the endpoint, as well as the results from the ML model, are stored in an S3 location. Model monitor can analyze this data and compare it to the baseline to ensure that no constraints are violated. \n", - "\n", - "Use the code below to check how many files have been created by the data capture, and view the latest file in detail. Note, data capture relies on data being sent to the production endpoint. If you don't see any files yet, wait several minutes and try again." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "bucket = sagemaker_session.default_bucket()\n", - "data_capture_logs_uri = 's3://{}/{}/datacapture/{}'.format(bucket, model_name, prd_endpoint_name)\n", - "\n", - "capture_files = S3Downloader.list(data_capture_logs_uri)\n", - "print('Found {} files'.format(len(capture_files)))\n", - "\n", - "if capture_files:\n", - " # Get the first line of the most recent file \n", - " event = json.loads(S3Downloader.read_file(capture_files[-1]).split('\\n')[0])\n", - " print('\\nLast file:\\n{}'.format(json.dumps(event, indent=2)))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### View monitoring schedule\n", - "\n", - "There are some useful functions for plotting and rendering distribution statistics or constraint violations provided in a `utils` file in the [SageMaker Examples GitHub](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker_model_monitor/visualization)⇗. Grab a copy of this code to use in this notebook. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!wget -O utils.py --quiet https://raw.githubusercontent.com/awslabs/amazon-sagemaker-examples/master/sagemaker_model_monitor/visualization/utils.py\n", - "import utils as mu" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [minimum scheduled run time](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-scheduling.html)⇗ for model monitor is one hour, which means you will need to wait at least an hour to see any results. Use the code below to check the schedule status and list the next run. If you are completing this notebook as part of a workshop, your host will have activities which you can complete while you wait. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sm = boto3.client('sagemaker')\n", - "\n", - "response = sm.describe_monitoring_schedule(MonitoringScheduleName=schedule_name)\n", - "print('Schedule Status: {}'.format(response['MonitoringScheduleStatus']))\n", - "\n", - "now = datetime.now(tzlocal())\n", - "next_hour = (now+timedelta(hours=1)).replace(minute=0)\n", - "scheduled_diff = (next_hour-now).seconds//60\n", - "print('Next schedule in {} minutes'.format(scheduled_diff))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "While you wait, you can take a look at the CloudFormation template which is used as a base for the CloudFormation template built by CodeDeploy to deploy the production application. \n", - "\n", - "Alterntively, you can jump ahead to [Trigger Retraining](#Trigger-Retraining) which will kick off another run of the code pipeline whilst you wait." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!cat ../assets/deploy-model-prd.yml" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A couple of minutes after the model monitoring schedule has run, you can use the code below to fetch the latest schedule status. A completed schedule run may have found violations. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "processing_job_arn = None\n", - "\n", - "while processing_job_arn == None:\n", - " try:\n", - " response = sm.list_monitoring_executions(MonitoringScheduleName=schedule_name)\n", - " except ClientError as e:\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " for mon in response['MonitoringExecutionSummaries']:\n", - " status = mon['MonitoringExecutionStatus']\n", - " now = datetime.now(tzlocal())\n", - " created_diff = (now-mon['CreationTime']).seconds//60\n", - " print('Schedule status: {}, Created: {} minutes ago'.format(status, created_diff))\n", - " if status in ['Completed', 'CompletedWithViolations']:\n", - " processing_job_arn = mon['ProcessingJobArn']\n", - " break\n", - " if status == 'InProgress':\n", - " break\n", - " else:\n", - " raise(Exception('Please wait. No Schedules executing'))\n", - " time.sleep(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### View monitoring results\n", - "\n", - "Once the model monitoring schedule has had a chance to run at least once, you can take a look at the results. First, load the monitoring execution results from the latest scheduled run." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "if processing_job_arn:\n", - " execution = MonitoringExecution.from_processing_arn(sagemaker_session=sagemaker.Session(),\n", - " processing_job_arn=processing_job_arn)\n", - " exec_inputs = {inp['InputName']: inp for inp in execution.describe()['ProcessingInputs']}\n", - " exec_results_uri = execution.output.destination\n", - "\n", - " print('Monitoring Execution results: {}'.format(exec_results_uri))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Take a look at the files which have been saved in the S3 output location. If violations were found, you should see a constraint violations file in addition to the statistics and constraints file which you viewed before." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!aws s3 ls $exec_results_uri/" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, fetch the monitoring statistics and violations. Then use the utils code to visualize the results in a table. It will highlight any baseline drift found by the model monitor. Drift can happen for categorical features (for inferred string styles) or for numerical features (e.g. total fare amount)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Get the baseline and monitoring statistics & violations\n", - "baseline_statistics = baseline_job.baseline_statistics().body_dict\n", - "execution_statistics = execution.statistics().body_dict\n", - "violations = execution.constraint_violations().body_dict['violations']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mu.show_violation_df(baseline_statistics=baseline_statistics, \n", - " latest_statistics=execution_statistics, \n", - " violations=violations)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Trigger Retraining\n", - "\n", - "The CodePipeline instance is configured with [CloudWatch Events](https://docs.aws.amazon.com/codepipeline/latest/userguide/create-cloudtrail-S3-source.html)⇗ to start the pipeline for retraining when the drift detection triggers specific metric alarms.\n", - "\n", - "You can simulate drift by putting a metric value above the threshold of `0.2` directly into CloudWatch. This will trigger the alarm, and start the code pipeline.\n", - "\n", - "
\n", - " Tip: This alarm is configured only for the latest production endpoint, so re-training will only occur if you are putting metrics against the latest endpoint.\n", - "
\n", - "\n", - "![Metric graph in CloudWatch](../docs/cloudwatch-alarm.png)\n", - "\n", - "Run the code below to trigger the metric alarm. The cell output will be a link to CloudWatch, where you can see the alarm (similar to the screenshot above), and a link to CodePipeline which you will see run again. Note that it can take a couple of minutes for everything to trigger." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from datetime import datetime\n", - "import random\n", - "\n", - "cloudwatch = boto3.client('cloudwatch')\n", - "\n", - "# Define the metric name and threshold\n", - "metric_name = 'feature_baseline_drift_total_amount'\n", - "metric_threshold = 0.2\n", - "\n", - "# Put a new metric to trigger an alaram\n", - "def put_drift_metric(value):\n", - " print('Putting metric: {}'.format(value))\n", - " response = cloudwatch.put_metric_data(\n", - " Namespace='aws/sagemaker/Endpoints/data-metrics',\n", - " MetricData=[\n", - " {\n", - " 'MetricName': metric_name,\n", - " 'Dimensions': [\n", - " {\n", - " 'Name': 'MonitoringSchedule',\n", - " 'Value': schedule_name\n", - " },\n", - " {\n", - " 'Name': 'Endpoint',\n", - " 'Value': prd_endpoint_name\n", - " },\n", - " ],\n", - " 'Timestamp': datetime.now(),\n", - " 'Value': value,\n", - " 'Unit': 'None'\n", - " },\n", - " ]\n", - " )\n", - " \n", - "def get_drift_stats():\n", - " response = cloudwatch.get_metric_statistics(\n", - " Namespace='aws/sagemaker/Endpoints/data-metrics',\n", - " MetricName=metric_name,\n", - " Dimensions=[\n", - " {\n", - " 'Name': 'MonitoringSchedule',\n", - " 'Value': schedule_name\n", - " },\n", - " {\n", - " 'Name': 'Endpoint',\n", - " 'Value': prd_endpoint_name\n", - " },\n", - " ],\n", - " StartTime=datetime.now() - timedelta(minutes=2),\n", - " EndTime=datetime.now(),\n", - " Period=1,\n", - " Statistics=['Average'],\n", - " Unit='None'\n", - " )\n", - " if 'Datapoints' in response and len(response['Datapoints']) > 0: \n", - " return response['Datapoints'][0]['Average']\n", - " return 0 \n", - "\n", - "print('Simluate drift on endpoint: {}'.format(prd_endpoint_name))\n", - "\n", - "while True:\n", - " put_drift_metric(round(random.uniform(metric_threshold, 1.0), 4))\n", - " drift_stats = get_drift_stats()\n", - " print('Average drift amount: {}'.format(get_drift_stats()))\n", - " if drift_stats > metric_threshold:\n", - " break\n", - " time.sleep(1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Click through to the Alarm and CodePipeline Execution history with the links below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Output a html link to the cloudwatch dashboard\n", - "metric_alarm_name = 'mlops-{}-metric-gt-threshold'.format(model_name)\n", - "HTML('''CloudWatch Alarm triggers\n", - " Code Pipeline Execution'''.format(region, metric_alarm_name, pipeline_name))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once the pipeline is running again you can jump back up to [Inspect Training Job](#Inspect-Training-Job)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create Synthetic Monitoring\n", - "\n", - "[Amazon CloudWatch Synthetics](https://aws.amazon.com/blogs/aws/new-use-cloudwatch-synthetics-to-monitor-sites-api-endpoints-web-workflows-and-more/) allows you to monitor sites, REST APIs, and other services deployed on AWS. You can set up a canary to test that your REST API is returning an expected value at a regular interval. This is a great way to validate that the blue/green deployment is not causing any downtime for your end-users.\n", - "\n", - "Use the code below to set up a canary to continuously test the production deployment. This canary simply pings the REST API to test if it is live, using code from `notebook/canary.js`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from urllib.parse import urlparse\n", - "from string import Template\n", - "from io import BytesIO\n", - "import zipfile\n", - "\n", - "# Format the canary_js with rest_api and payload\n", - "rest_url = urlparse(rest_api)\n", - "\n", - "with open('canary.js') as f:\n", - " canary_js = Template(f.read()).substitute(hostname=rest_url.netloc, path=rest_url.path, \n", - " data=payload.decode('utf-8').strip())\n", - "# Write the zip file\n", - "zip_buffer = BytesIO()\n", - "with zipfile.ZipFile(zip_buffer, 'w') as zf:\n", - " zip_path = 'nodejs/node_modules/apiCanaryBlueprint.js' # Set a valid path\n", - " zip_info = zipfile.ZipInfo(zip_path)\n", - " zip_info.external_attr = 0o0755 << 16 # Ensure the file is readable\n", - " zf.writestr(zip_info, canary_js)\n", - "zip_buffer.seek(0)\n", - "\n", - "# Create the canary\n", - "synth = boto3.client('synthetics')\n", - "\n", - "role = sagemaker.get_execution_role()\n", - "s3_canary_uri = 's3://{}/{}'.format(artifact_bucket, model_name)\n", - "canary_name = 'mlops-{}'.format(model_name)\n", - "\n", - "try:\n", - " response = synth.create_canary(\n", - " Name=canary_name,\n", - " Code={\n", - " 'ZipFile': bytearray(zip_buffer.read()),\n", - " 'Handler': 'apiCanaryBlueprint.handler'\n", - " },\n", - " ArtifactS3Location=s3_canary_uri,\n", - " ExecutionRoleArn=role,\n", - " Schedule={ \n", - " 'Expression': 'rate(10 minutes)', \n", - " 'DurationInSeconds': 0 },\n", - " RunConfig={\n", - " 'TimeoutInSeconds': 60,\n", - " 'MemoryInMB': 960\n", - " },\n", - " SuccessRetentionPeriodInDays=31,\n", - " FailureRetentionPeriodInDays=31,\n", - " RuntimeVersion='syn-nodejs-2.0',\n", - " )\n", - " print('Creating canary: {}'.format(canary_name)) \n", - "except ClientError as e:\n", - " if e.response[\"Error\"][\"Code\"] == \"AccessDeniedException\":\n", - " print('Canary not supported.') # Not supported in event engine\n", - " else:\n", - " raise(e)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now create a CloudWatch alarm which will trigger if the success rate of the canary drops below 90%. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "cloudwatch = boto3.client('cloudwatch')\n", - "\n", - "canary_alarm_name = '{}-synth-lt-threshold'.format(canary_name)\n", - "\n", - "response = cloudwatch.put_metric_alarm(\n", - " AlarmName=canary_alarm_name,\n", - " ComparisonOperator='LessThanThreshold',\n", - " EvaluationPeriods=1,\n", - " DatapointsToAlarm=1,\n", - " Period=600, # 10 minute interval\n", - " Statistic='Average',\n", - " Threshold=90.0,\n", - " ActionsEnabled=False,\n", - " AlarmDescription='SuccessPercent LessThanThreshold 90%',\n", - " Namespace='CloudWatchSynthetics',\n", - " MetricName='SuccessPercent',\n", - " Dimensions=[\n", - " {\n", - " 'Name': 'CanaryName',\n", - " 'Value': canary_name\n", - " },\n", - " ],\n", - " Unit='Seconds'\n", - ")\n", - "\n", - "print('Creating alarm: {}'.format(canary_alarm_name))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Run the code below to check if the canary is running succesfully. The cell will output a link to your CloudWatch Canaries UI, where you can watch the results over time (see screenshot). It can take a couple of minutes for the canary to deploy.\n", - "\n", - "![Canary graph in CloudWatch](../docs/canary-green-1hr.png)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "while True:\n", - " try:\n", - " response = synth.get_canary(Name=canary_name)\n", - " status = response['Canary']['Status']['State'] \n", - " print('Canary status: {}'.format(status))\n", - " if status == 'ERROR':\n", - " raise(Exception(response['Canary']['Status']['StateReason'])) \n", - " elif status == 'READY':\n", - " synth.start_canary(Name=canary_name)\n", - " elif status == 'RUNNING':\n", - " break \n", - " except ClientError as e:\n", - " if e.response[\"Error\"][\"Code\"] == \"ResourceNotFoundException\":\n", - " print('No canary found.')\n", - " break\n", - " elif e.response[\"Error\"][\"Code\"] == \"AccessDeniedException\":\n", - " print('Canary not supported.') # Not supported in event engine\n", - " break\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " time.sleep(10)\n", - "\n", - "# Output a html link to the cloudwatch console\n", - "HTML('CloudWatch Canary'.format(region, canary_name))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create a CloudWatch dashboard\n", - "\n", - "Finally, use the code below to create a CloudWatch dashboard to visualize the key performance metrics and alarms which you have created during this demo. The cell will output a link to the dashboard. This dashboard shows 9 charts in three rows, where the first row displays Lambda metrics, the second row displays SageMaker metrics, and the third row (shown in the screenshot below) displays the alarms set up for the pipeline.\n", - "\n", - "![Graphs in CloudWatch dashboard](../docs/cloudwatch-dashboard.png)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sts = boto3.client('sts')\n", - "account_id = sts.get_caller_identity().get('Account')\n", - "dashboard_name = 'mlops-{}'.format(model_name)\n", - "\n", - "with open('dashboard.json') as f:\n", - " dashboard_body = Template(f.read()).substitute(region=region, account_id=account_id, model_name=model_name)\n", - " response = cloudwatch.put_dashboard(\n", - " DashboardName=dashboard_name,\n", - " DashboardBody=dashboard_body\n", - " )\n", - "\n", - "# Output a html link to the cloudwatch dashboard\n", - "HTML('CloudWatch Dashboard'.format(region, canary_name))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Congratulations! You have made it to the end of this notebook, and have automated a safe MLOps pipeline using a wide range of AWS services. \n", - "\n", - "You can use the other notebook in this repository [workflow.ipynb](workflow.ipynb) to implement your own ML model and deploy it as part of this pipeline. Or, if you are finished with the content, follow the instructions in the next section to clean up the resources you have deployed." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleanup\n", - "\n", - "Execute the following cell to delete the stacks created in the pipeline. For a model name of **nyctaxi** these would be:\n", - "\n", - "1. *nyctaxi*-deploy-prd\n", - "2. *nyctaxi*-deploy-dev\n", - "3. *nyctaxi*-workflow\n", - "4. sagemaker-custom-resource" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "cfn = boto3.client('cloudformation')\n", - "\n", - "# Delete the prod and then dev stack\n", - "for stack_name in [f'{pipeline_name}-deploy-prd', \n", - " f'{pipeline_name}-deploy-dev',\n", - " f'{pipeline_name}-workflow',\n", - " 'sagemaker-custom-resource']:\n", - " print('Deleting stack: {}'.format(stack_name))\n", - " cfn.delete_stack(StackName=stack_name)\n", - " cfn.get_waiter('stack_delete_complete').wait(StackName=stack_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following code will stop and delete the canary you created." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "while True:\n", - " try:\n", - " response = synth.get_canary(Name=canary_name)\n", - " status = response['Canary']['Status']['State'] \n", - " print('Canary status: {}'.format(status))\n", - " if status == 'ERROR':\n", - " raise(Exception(response['Canary']['Status']['StateReason'])) \n", - " elif status == 'STOPPED':\n", - " synth.delete_canary(Name=canary_name)\n", - " elif status == 'RUNNING':\n", - " synth.stop_canary(Name=canary_name)\n", - " except ClientError as e:\n", - " if e.response[\"Error\"][\"Code\"] == \"ResourceNotFoundException\":\n", - " print('Canary succesfully deleted.')\n", - " break\n", - " elif e.response[\"Error\"][\"Code\"] == \"AccessDeniedException\":\n", - " print('Canary not created.') # Not supported in event engine\n", - " break\n", - " print(e.response[\"Error\"][\"Message\"])\n", - " time.sleep(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following code will delete the dashboard." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "cloudwatch.delete_alarms(AlarmNames=[canary_alarm_name])\n", - "print('Alarm deleted')\n", - "\n", - "cloudwatch.delete_dashboards(DashboardNames=[dashboard_name])\n", - "print('Dashboard deleted')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, close this notebook and you can delete the CloudFormation you created to launch this MLOps sample." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "conda_python3", - "language": "python", - "name": "conda_python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.13" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} +{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Safe MLOps Deployment Pipeline\n", "\n", "\n", "## Overview\n", "\n", "In this notebook you will step through an MLOps pipeline to build, train, deploy and monitor an XGBoost regression model for predicting the expected taxi fare using the New York City Taxi [dataset](https://registry.opendata.aws/nyc-tlc-trip-records-pds/)\u21d7. This safe pipeline features a [canary deployment](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/canary-deployment.html) strategy with rollback on error. You will learn how to trigger and monitor the pipeline, inspect the training workflow, use model monitor to set up alerts, and create a canary deployment.\n", "\n", "
\n", " Note: This notebook assumes prior familiarity with the basics training ML models on Amazon SageMaker. Data preparation and visualization, although present, will be kept to a minimum. If you are not familiar with the basic concepts and features of SageMaker, we recommend reading the SageMaker documentation\u21d7 and completing the workshops and samples in AWS SageMaker Examples GitHub\u21d7 and AWS Samples GitHub\u21d7. \n", "
\n", "\n", "### Contents\n", "\n", "This notebook has the following key sections:\n", "\n", "1. [Data Prep](#Data-Prep)\n", "2. [Build](#Build)\n", "3. [Train Model](#Train-Model)\n", "4. [Deploy Dev](#Deploy-Dev)\n", "5. [Deploy Prod](#Deploy-Prod)\n", "6. [Monitor](#Monitor)\n", "6. [Cleanup](#Cleanup)\n", "\n", "### Architecture\n", "\n", "The architecture diagram below shows the entire MLOps pipeline at a high level.\n", "\n", "Use the CloudFormation template provided in this repository (`pipeline.yml`) to build the demo in your own AWS account. If you are currently viewing this notebook from SageMaker in your AWS account, then you have already completed this step. CloudFormation deploys several resources:\n", " \n", "1. A customer-managed encryption key in in Amazon KMS for encrypting data and artifacts.\n", "1. A secret in Amazon Secrets Manager to securely store your GitHub Access Token.\n", "1. Several AWS IAM roles so CloudFormation, SageMaker, and other AWS services can perform actions in your AWS account, following the principle of [least privilege](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#grant-least-privilege)\u21d7.\n", "1. A messaging service in Amazon SNS to notify you when CodeDeploy has successfully deployed the API, and to receive alerts for retraining and drift detection (signing up for these notifications is optional).\n", "1. Two Amazon CloudWatch event rules: one which schedules the pipeline to run every month, and one which triggers the pipeline to run when SageMaker Model Monitor detects certain metrics.\n", "1. An Amazon SageMaker Jupyter notebook with this workshop content pre-loaded.\n", "1. An Amazon S3 bucket for storing model artifacts.\n", "1. An AWS CodePipeline instance with several pre-defined stages. \n", "\n", "Take a moment to look at all of these resources now deployed in your account. \n", "\n", "![MLOps pipeline architecture](../docs/mlops-architecture.png)\n", "\n", "In this notebook, you will work through the CodePipeline instance created by the CloudFormation template. It has several stages:\n", "\n", "1. **Source** - The pipeline is already configured with two sources. If you upload a new dataset to a specific location in the S3 data bucket, this will trigger the pipeline to run. The Git source can be GitHub, or CodeCommit if you don\u2019t supply your access token. If you commit new code to your repository, this will trigger the pipeline to run. \n", "1. **Build** - In this stage, CodeBuild configured by the build specification `model/buildspec.yml` will execute `model/run_pipeline.py` to generate AWS CloudFormation templates for creating the AWS Step Function (including AWS Lambda custom resources), and deployment templates used in the following stages based on the data sets and hyperparameters specified for this pipeline run. You will take a closer look at these files later in this notebook. \n", "1. **Train** The Step Functions workflow created in the Build stage is run in this stage. The workflow creates a baseline for the model monitor using a SageMaker processing job, and trains an XGBoost model on the taxi ride dataset using a SageMaker training job.\n", "1. **Deploy Dev** In this stage, a CloudFormation template created in the build stage (from `assets/deploy-model-dev.yml`) deploys a dev endpoint. This will allow you to run tests on the model and decide if the model is of sufficient quality to deploy into production.\n", "1. **Deploy Production** The final stage of the pipeline is the only stage which does not run automatically as soon as the previous stage is complete. It waits for a user to manually approve the model which was previously deployed to dev. As soon as the model is approved, a CloudFormation template (packaged from `assets/deploy-model-prod.yml` to include the Lambda functions saved and uploaded as ZIP files in S3) deploys the production endpoint. It configures autoscaling and enables data capture. It creates a model monitoring schedule and sets CloudWatch alarms for certain metrics. It also sets up an AWS CodeDeploy instance which deploys a set of AWS Lambda functions and an Amazon API Gateway to sit in front of the SageMaker endpoint. This stage can make use of canary deployment to safely switch from an old model to a new model."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace `None` with the project name when creating SageMaker Project\n", "# You can find it from the left panel in Studio\n", "\n", "PROJECT_NAME = None\n", "\n", "assert PROJECT_NAME is not None and isinstance(\n", " PROJECT_NAME, str\n", "), \"Please specify the project name as string\""]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import boto3\n", "from IPython.core.display import HTML, display\n", "\n", "\n", "def get_provisioned_product_name(project_name):\n", " region = boto3.Session().region_name\n", " sc = boto3.client(\"servicecatalog\")\n", " products = sc.search_provisioned_products(\n", " Filters={\n", " \"SearchQuery\": [\n", " project_name,\n", " ]\n", " }\n", " )\n", " pp = products[\"ProvisionedProducts\"]\n", " if len(pp) != 1:\n", " print(\"Invalid provisioned product name. Open the link below and search manually\")\n", " display(\n", " HTML(\n", " f'Service Catalog'\n", " )\n", " )\n", " raise ValueError(\"Invalid provisioned product\")\n", "\n", " return pp[0][\"Name\"]\n", "\n", "\n", "PROVISIONED_PRODUCT_NAME = get_provisioned_product_name(PROJECT_NAME)\n", "print(\n", " f\"The associated Service Catalog Provisioned Product Name to this SagaMaker project: {PROVISIONED_PRODUCT_NAME}\"\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In case of any errors, you can examine the Service Catalog console from the above link and find the associated provisioned product name which is something like `example-p-1v7hbpwe594n` and assigns it to `PROVISIONED_PRODUCT_NAME` manually."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Import the latest sagemaker and boto3 SDKs.\n", "import sys\n", "\n", "!{sys.executable} -m pip install --upgrade pip\n", "!{sys.executable} -m pip install -qU awscli boto3 \"sagemaker>=2.1.0<3\" tqdm\n", "!{sys.executable} -m pip install -qU \"stepfunctions==2.0.0\"\n", "!{sys.executable} -m pip show sagemaker stepfunctions"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Restart your SageMaker kernel then continue with this notebook."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Data Prep\n", " \n", "In this section of the notebook, you will download the publicly available New York Taxi dataset in preparation for uploading it to S3.\n", "\n", "### Download Dataset\n", "\n", "First, download a sample of the New York City Taxi [dataset](https://registry.opendata.aws/nyc-tlc-trip-records-pds/)\u21d7 to this notebook instance. This dataset contains information on trips taken by taxis and for-hire vehicles in New York City, including pick-up and drop-off times and locations, fares, distance traveled, and more. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!aws s3 cp 's3://nyc-tlc/trip data/green_tripdata_2018-02.csv' 'nyc-tlc.csv'"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now load the dataset into a pandas data frame, taking care to parse the dates correctly."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import pandas as pd\n", "\n", "parse_dates = [\"lpep_dropoff_datetime\", \"lpep_pickup_datetime\"]\n", "trip_df = pd.read_csv(\"nyc-tlc.csv\", parse_dates=parse_dates)\n", "\n", "trip_df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Data manipulation\n", "\n", "Instead of the raw date and time features for pick-up and drop-off, let's use these features to calculate the total time of the trip in minutes, which will be easier to work with for our model."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["trip_df[\"duration_minutes\"] = (\n", " trip_df[\"lpep_dropoff_datetime\"] - trip_df[\"lpep_pickup_datetime\"]\n", ").dt.seconds / 60"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The dataset contains a lot of columns we don't need, so let's select a sample of columns for our machine learning model. Keep only `total_amount` (fare), `duration_minutes`, `passenger_count`, and `trip_distance`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["cols = [\"total_amount\", \"duration_minutes\", \"passenger_count\", \"trip_distance\"]\n", "data_df = trip_df[cols]\n", "print(data_df.shape)\n", "data_df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Generate some quick statistics for the dataset to understand the quality."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["data_df.describe()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The table above shows some clear outliers, e.g. -400 or 2626 as fare, or 0 passengers. There are many intelligent methods for identifying and removing outliers, but data cleaning is not the focus of this notebook, so just remove the outliers by setting some min and max values which seem more reasonable. Removing the outliers results in a final dataset of 754,671 rows."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["data_df = data_df[\n", " (data_df.total_amount > 0)\n", " & (data_df.total_amount < 200)\n", " & (data_df.duration_minutes > 0)\n", " & (data_df.duration_minutes < 120)\n", " & (data_df.trip_distance > 0)\n", " & (data_df.trip_distance < 121)\n", " & (data_df.passenger_count > 0)\n", "].dropna()\n", "print(data_df.shape)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Data visualization\n", "\n", "Since this notebook will build a regression model for the taxi data, it's a good idea to check if there is any correlation between the variables in our data. Use scatter plots on a sample of the data to compare trip distance with duration in minutes, and total amount (fare) with duration in minutes."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import seaborn as sns\n", "\n", "sample_df = data_df.sample(1000)\n", "sns.scatterplot(data=sample_df, x=\"duration_minutes\", y=\"trip_distance\")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["sns.scatterplot(data=sample_df, x=\"duration_minutes\", y=\"total_amount\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["These scatter plots look fine and show at least some correlation between our variables. \n", "\n", "### Data splitting and saving\n", "\n", "We are now ready to split the dataset into train, validation, and test sets. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "\n", "train_df, val_df = train_test_split(data_df, test_size=0.20, random_state=42)\n", "val_df, test_df = train_test_split(val_df, test_size=0.05, random_state=42)\n", "\n", "# Reset the index for our test dataframe\n", "test_df.reset_index(inplace=True, drop=True)\n", "\n", "print(\n", " \"Size of\\n train: {},\\n val: {},\\n test: {} \".format(\n", " train_df.shape[0], val_df.shape[0], test_df.shape[0]\n", " )\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Save the train, validation, and test files as CSV locally on this notebook instance. Notice that you save the train file twice - once as the training data file and once as the baseline data file. The baseline data file will be used by [SageMaker Model Monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html)\u21d7 to detect data drift. Data drift occurs when the statistical nature of the data that your model receives while in production drifts away from the nature of the baseline data it was trained on, which means the model begins to lose accuracy in its predictions."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["train_cols = [\"total_amount\", \"duration_minutes\", \"passenger_count\", \"trip_distance\"]\n", "train_df.to_csv(\"train.csv\", index=False, header=False)\n", "val_df.to_csv(\"validation.csv\", index=False, header=False)\n", "test_df.to_csv(\"test.csv\", index=False, header=False)\n", "\n", "# Save test and baseline with headers\n", "train_df.to_csv(\"baseline.csv\", index=False, header=True)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now upload these CSV files to your default SageMaker S3 bucket. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import sagemaker\n", "\n", "# Get the session and default bucket\n", "session = sagemaker.session.Session()\n", "bucket = session.default_bucket()\n", "\n", "# Specify data prefix and version\n", "prefix = \"nyc-tlc/v1\"\n", "\n", "s3_train_uri = session.upload_data(\"train.csv\", bucket, prefix + \"/data/training\")\n", "s3_val_uri = session.upload_data(\"validation.csv\", bucket, prefix + \"/data/validation\")\n", "s3_test_uri = session.upload_data(\"test.csv\", bucket, prefix + \"/data/test\")\n", "s3_baseline_uri = session.upload_data(\"baseline.csv\", bucket, prefix + \"/data/baseline\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["You will use the datasets which you have prepared and saved in this section to trigger the pipeline to train and deploy a model in the next section."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Build\n", "\n", "If you navigate to the CodePipeline instance created for this workshop, you will notice that the Source stage is initially in a `Failed` state. This happens because the dataset, which is one of the sources that can trigger the pipeline, has not yet been uploaded to the S3 location expected by the pipeline.\n", "\n", "![Failed code pipeline](../docs/pipeline_failed.png)\n", "\n", "### Trigger Build\n", "\n", "In this section, you will start a model build and deployment pipeline by packaging up the datasets you prepared in the previous section and uploading these to the S3 source location which triggers the CodePipeline instance created for this workshop. \n", "\n", "\n", "First, import some libraries and load some environment variables which you will need. These environment variables have been set through a [lifecycle configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html)\u21d7 script attached to this notebook."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import boto3\n", "from botocore.exceptions import ClientError\n", "import os\n", "import time\n", "\n", "\n", "def get_config(provisioned_product_name):\n", " sc = boto3.client(\"servicecatalog\")\n", " outputs = sc.get_provisioned_product_outputs(ProvisionedProductName=provisioned_product_name)[\n", " \"Outputs\"\n", " ]\n", " config = {}\n", " for out in outputs:\n", " config[out[\"OutputKey\"]] = out[\"OutputValue\"]\n", " return config\n", "\n", "\n", "config = get_config(PROVISIONED_PRODUCT_NAME)\n", "region = config[\"Region\"]\n", "artifact_bucket = config[\"ArtifactBucket\"]\n", "pipeline_name = config[\"PipelineName\"]\n", "model_name = config[\"ModelName\"]\n", "workflow_pipeline_arn = config[\"WorkflowPipelineARN\"]\n", "\n", "print(\"region: {}\".format(region))\n", "print(\"artifact bucket: {}\".format(artifact_bucket))\n", "print(\"pipeline: {}\".format(pipeline_name))\n", "print(\"model name: {}\".format(model_name))\n", "print(\"workflow: {}\".format(workflow_pipeline_arn))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["From the AWS CodePipeline [documentation](https://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-simple-s3.html)\u21d7:\n", "\n", "> When Amazon S3 is the source provider for your pipeline, you may zip your source file or files into a single .zip and upload the .zip to your source bucket. You may also upload a single unzipped file; however, downstream actions that expect a .zip file will fail.\n", "\n", "To train a model, you need multiple datasets (train, validation, and test) along with a file specifying the hyperparameters. In this example, you will create one JSON file which contains the S3 dataset locations and one JSON file which contains the hyperparameter values. Then you compress both files into a zip package to be used as input for the pipeline run. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from io import BytesIO\n", "import zipfile\n", "import json\n", "\n", "input_data = {\n", " \"TrainingUri\": s3_train_uri,\n", " \"ValidationUri\": s3_val_uri,\n", " \"TestUri\": s3_test_uri,\n", " \"BaselineUri\": s3_baseline_uri,\n", "}\n", "\n", "hyperparameters = {\"num_round\": 50}\n", "\n", "zip_buffer = BytesIO()\n", "with zipfile.ZipFile(zip_buffer, \"a\") as zf:\n", " zf.writestr(\"inputData.json\", json.dumps(input_data))\n", " zf.writestr(\"hyperparameters.json\", json.dumps(hyperparameters))\n", "zip_buffer.seek(0)\n", "\n", "data_source_key = \"{}/data-source.zip\".format(pipeline_name)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now upload the zip package to your artifact S3 bucket - this action will trigger the pipeline to train and deploy a model."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["s3 = boto3.client(\"s3\")\n", "s3.put_object(Bucket=artifact_bucket, Key=data_source_key, Body=bytearray(zip_buffer.read()))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Click the link below to open the AWS console at the Code Pipeline if you don't have it open in another tab.\n", "\n", "
\n", " Tip: You may need to wait a minute to see the DataSource stage turn green. The page will refresh automatically.\n", "
\n", "\n", "![Source Green](../docs/datasource-after.png)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from IPython.core.display import HTML\n", "\n", "HTML(\n", " 'Code Pipeline'.format(\n", " region, pipeline_name\n", " )\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Inspect Build Logs\n", "\n", "Once the build stage is running, you will see the AWS CodeBuild job turn blue with a status of **In progress**.\n", "\n", "![Failed code pipeline](../docs/codebuild-inprogress.png)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["You can click on the **Details** link displayed in the CodePipeline UI or click the link below to jump directly to the CodeBuild logs.\n", "\n", "
\n", " Tip: You may need to wait a few seconds for the pipeline to transition into the active (blue) state and for the build to start.\n", "
"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["codepipeline = boto3.client(\"codepipeline\")\n", "\n", "\n", "def get_pipeline_stage(pipeline_name, stage_name):\n", " response = codepipeline.get_pipeline_state(name=pipeline_name)\n", " for stage in response[\"stageStates\"]:\n", " if stage[\"stageName\"] == stage_name:\n", " return stage\n", "\n", "\n", "# Get last execution id\n", "build_stage = get_pipeline_stage(pipeline_name, \"Build\")\n", "if not \"latestExecution\" in build_stage:\n", " raise (Exception(\"Please wait. Build not started\"))\n", "\n", "build_url = build_stage[\"actionStates\"][0][\"latestExecution\"][\"externalExecutionUrl\"]\n", "\n", "# Out a link to the code build logs\n", "HTML('Code Build Logs'.format(build_url))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The AWS CodeBuild process is responsible for creating a number of AWS CloudFormation templates which we will explore in more detail in the next section. Two of these templates are used to set up the **Train** step by creating the AWS Step Functions worklow and the custom AWS Lambda functions used within this workflow."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Train Model\n", "\n", "### Inspect Training Job\n", "\n", "Wait until the pipeline has started running the Train step (see screenshot) before continuing with the next cells in this notebook. \n", "\n", "![Training in progress](../docs/train-in-progress.png)\n", "\n", "When the pipeline has started running the train step, you can click on the **Details** link displayed in the CodePipeline UI (see screenshot above) to view the Step Functions workflow which is running the training job. \n", "\n", "Alternatively, you can click on the Workflow link from the cell output below once it's available."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from stepfunctions.workflow import Workflow\n", "\n", "while True:\n", " try:\n", " workflow = Workflow.attach(workflow_pipeline_arn)\n", " break\n", " except ClientError as e:\n", " print(e.response[\"Error\"][\"Message\"])\n", " time.sleep(10)\n", "\n", "workflow"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Review Build Script\n", "\n", "While you wait for the training job to complete, let's take a look at the `run.py` code which was used by the AWS CodeBuild process.\n", "\n", "This script takes all of the input parameters, including the dataset locations and hyperparameters which you saved to JSON files earlier in this notebook, and uses them to generate the templates which the pipeline needs to run the training job. It *does not* create the actual Step Functions instance - it only generates the templates which define the Step Functions workflow, as well as the CloudFormation input templates which CodePipeline uses to instantiate the Step Functions instance.\n", "\n", "Step-by-step, the script does the following:\n", "\n", "1. It collects all the input parameters it needs to generate the templates. This includes information about the environment container needed to run the training job, the input and output data locations, IAM roles needed by various components, encryption keys, and more. It then sets up some basic parameters like the AWS region and the function names.\n", "1. If the input parameters specify an environment container stored in ECR, it fetches that container. Otherwise, it fetches the URI of the AWS managed environment container needed for the training job.\n", "1. It reads the input data JSON file which you generated earlier in this notebook (and which was included in the zip source for the pipeline), thereby fetching the locations of the train, validation, and baseline data files. Then it formats more parameters which will be needed later in the script, including version IDs and output data locations.\n", "1. It reads the hyperparameter JSON file which you generated earlier in this notebook.\n", "1. It defines the Step Functions workflow, starting with the input schema, followed by each step of the workflow (i.e. Create Experiment, Baseline Job, Training Job), and finally combines those steps into a workflow graph. \n", "1. The workflow graph is saved to file, along with a file containing all of the input parameters saved according to the schema defined in the workflow.\n", "1. It saves parameters to file which will be used by CloudFormation to instantiate the Step Functions workflow."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!pygmentize ../model/run_pipeline.py"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Customize Workflow (Optional)\n", "\n", "If you are interested in customising the workflow used in the Build Script, store the `input_data` to be used within the local [workflow.ipynb](workflow.ipynb) notebook. The workflow notebook can be used to experiment with the Step Functions workflow and training job definitions for your model."]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": true}, "outputs": [], "source": ["%store input_data PROVISIONED_PRODUCT_NAME"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Training Analytics\n", "\n", "Once the training and baseline jobs are complete (meaning they are displayed in a green color in the Step Functions workflow, this takes around 5 minutes), you can inspect the experiment metrics. The code below will display all experiments in a table. Note that the baseline processing job won't have RMSE metrics - it calculates metrics based on the training data, but does not train a machine learning model. \n", "\n", "You will [explore the baseline](#Explore-Baseline) results later in this notebook. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from sagemaker import analytics\n", "\n", "experiment_name = \"mlops-{}\".format(model_name)\n", "model_analytics = analytics.ExperimentAnalytics(experiment_name=experiment_name)\n", "analytics_df = model_analytics.dataframe()\n", "\n", "if analytics_df.shape[0] == 0:\n", " raise (Exception(\"Please wait. No training or baseline jobs\"))\n", "\n", "pd.set_option(\"display.max_colwidth\", 100) # Increase column width to show full copmontent name\n", "cols = [\n", " \"TrialComponentName\",\n", " \"DisplayName\",\n", " \"SageMaker.InstanceType\",\n", " \"train:rmse - Last\",\n", " \"validation:rmse - Last\",\n", "] # return the last rmse for training and validation\n", "analytics_df[analytics_df.columns & cols].head(2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Deploy Dev\n", "\n", "### Test Dev Deployment\n", "\n", "When the pipeline has finished training a model, it automatically moves to the next step, where the model is deployed as a SageMaker Endpoint. This endpoint is part of your dev deployment, therefore, in this section, you will run some tests on the endpoint to decide if you want to deploy this model into production.\n", "\n", "First, run the cell below to fetch the name of the SageMaker Endpoint."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["codepipeline = boto3.client(\"codepipeline\")\n", "\n", "deploy_dev = get_pipeline_stage(pipeline_name, \"DeployDev\")\n", "if not \"latestExecution\" in deploy_dev:\n", " raise (Exception(\"Please wait. Deploy dev not started\"))\n", "\n", "execution_id = deploy_dev[\"latestExecution\"][\"pipelineExecutionId\"]\n", "dev_endpoint_name = \"mlops-{}-dev-{}\".format(model_name, execution_id)\n", "\n", "print(\"endpoint name: {}\".format(dev_endpoint_name))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If you moved through the previous section very quickly, you will need to wait until the dev endpoint has been successfully deployed and the pipeline is waiting for approval to deploy to production (see screenshot). It can take up to 10 minutes for SageMaker to create an endpoint.\n", "\n", "![Deploying dev endpoint in code pipeline](../docs/dev-deploy-ready.png)\n", "\n", "Alternatively, run the code below to check the status of your endpoint. Wait until the status of the endpoint is 'InService'."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["sm = boto3.client(\"sagemaker\")\n", "\n", "while True:\n", " try:\n", " response = sm.describe_endpoint(EndpointName=dev_endpoint_name)\n", " print(\"Endpoint status: {}\".format(response[\"EndpointStatus\"]))\n", " if response[\"EndpointStatus\"] == \"InService\":\n", " break\n", " except ClientError as e:\n", " print(e.response[\"Error\"][\"Message\"])\n", " time.sleep(10)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now that your endpoint is ready, let's write some code to run the test data (which you split off from the dataset and saved to file at the start of this notebook) through the endpoint for inference. The code below supports both v1 and v2 of the SageMaker SDK, but we recommend using v2 of the SDK in all of your future projects."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import numpy as np\n", "from tqdm import tqdm\n", "\n", "from sagemaker.predictor import Predictor\n", "from sagemaker.serializers import CSVSerializer\n", "\n", "\n", "def get_predictor(endpoint_name):\n", " xgb_predictor = Predictor(endpoint_name)\n", " xgb_predictor.serializer = CSVSerializer()\n", " return xgb_predictor\n", "\n", "\n", "def predict(predictor, data, rows=500):\n", " split_array = np.array_split(data, round(data.shape[0] / float(rows)))\n", " predictions = \"\"\n", " for array in tqdm(split_array):\n", " predictions = \",\".join([predictions, predictor.predict(array).decode(\"utf-8\")])\n", " return np.fromstring(predictions[1:], sep=\",\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now use the `predict` function, which was defined in the code above, to run the test data through the endpoint and generate the predictions."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["dev_predictor = get_predictor(dev_endpoint_name)\n", "predictions = predict(dev_predictor, test_df[test_df.columns[1:]].values)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Next, load the predictions into a data frame, and join it with your test data. Then, calculate absolute error as the difference between the actual taxi fare and the predicted taxi fare. Display the results in a table, sorted by the highest absolute error values."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["pred_df = pd.DataFrame({\"total_amount_predictions\": predictions})\n", "pred_df = test_df.join(pred_df) # Join on all\n", "pred_df[\"error\"] = abs(pred_df[\"total_amount\"] - pred_df[\"total_amount_predictions\"])\n", "\n", "pred_df.sort_values(\"error\", ascending=False).head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["From this table, we note that some short trip distances have large errors because the low predicted fare does not match the high actual fare. This could be the result of a generous tip which we haven't included in this dataset.\n", "\n", "You can also analyze the results by plotting the absolute error to visualize outliers. In this graph, we see that most of the outliers are cases where the model predicted a much lower fare than the actual fare. There are only a few outliers where the model predicted a higher fare than the actual fare."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["sns.scatterplot(data=pred_df, x=\"total_amount_predictions\", y=\"total_amount\", hue=\"error\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If you want one overall measure of quality for the model, you can calculate the root mean square error (RMSE) for the predicted fares compared to the actual fares. Compare this to the [results calculated on the validation set](#validation-results) at the end of the 'Inspect Training Job' section."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from math import sqrt\n", "from sklearn.metrics import mean_squared_error\n", "\n", "\n", "def rmse(pred_df):\n", " return sqrt(mean_squared_error(pred_df[\"total_amount\"], pred_df[\"total_amount_predictions\"]))\n", "\n", "\n", "print(\"RMSE: {}\".format(rmse(pred_df)))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Deploy Prod\n", "\n", "### Approve Deployment to Production\n", "\n", "If you are happy with the results of the model, you can go ahead and approve the model to be deployed into production. You can do so by clicking the **Review** button in the CodePipeline UI, leaving a comment to explain why you approve this model, and clicking on **Approve**. \n", "\n", "Alternatively, you can create a Jupyter widget which (when enabled) allows you to comment and approve the model directly from this notebook. Run the cell below to see this in action."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import ipywidgets as widgets\n", "\n", "\n", "def on_click(obj):\n", " result = {\"summary\": approval_text.value, \"status\": obj.description}\n", " response = codepipeline.put_approval_result(\n", " pipelineName=pipeline_name,\n", " stageName=\"DeployDev\",\n", " actionName=\"ApproveDeploy\",\n", " result=result,\n", " token=approval_action[\"token\"],\n", " )\n", " button_box.close()\n", " print(result)\n", "\n", "\n", "# Create the widget if we are ready for approval\n", "deploy_dev = get_pipeline_stage(pipeline_name, \"DeployDev\")\n", "if not \"latestExecution\" in deploy_dev[\"actionStates\"][-1]:\n", " raise (Exception(\"Please wait. Deploy dev not complete\"))\n", "\n", "approval_action = deploy_dev[\"actionStates\"][-1][\"latestExecution\"]\n", "if approval_action[\"status\"] == \"Succeeded\":\n", " print(\"Dev approved: {}\".format(approval_action[\"summary\"]))\n", "elif \"token\" in approval_action:\n", " approval_text = widgets.Text(placeholder=\"Optional approval message\")\n", " approve_btn = widgets.Button(description=\"Approved\", button_style=\"success\", icon=\"check\")\n", " reject_btn = widgets.Button(description=\"Rejected\", button_style=\"danger\", icon=\"close\")\n", " approve_btn.on_click(on_click)\n", " reject_btn.on_click(on_click)\n", " button_box = widgets.HBox([approval_text, approve_btn, reject_btn])\n", " display(button_box)\n", "else:\n", " raise (Exception(\"Please wait. No dev approval\"))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Test Production Deployment\n", "\n", "Within about a minute after approving the model deployment, you should see the pipeline start on the final step: deploying your model into production. In this section, you will check the deployment status and test the production endpoint after it has been deployed.\n", "\n", "![Deploy production endpoint in code pipeline](../docs/deploy-production.png)\n", "\n", "This step of the pipeline uses CloudFormation to deploy a number of resources on your behalf. In particular, it creates:\n", "\n", "1. A production-ready SageMaker Endpoint for your model, with [data capture](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html)\u21d7 (used by SageMaker Model Monitor) and [autoscaling](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html)\u21d7 enabled.\n", "1. A [model monitoring schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-scheduling.html)\u21d7 which outputs the results to CloudWatch metrics, along with a [CloudWatch Alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)\u21d7 which will notify you when a violation occurs. \n", "1. A CodeDeploy instance which creates a simple app by deploying API Gateway, three Lambda functions, and an alarm to notify of the success or failure of this deployment. The code for the Lambda functions can be found in `api/app.py`, `api/pre_traffic_hook.py`, and `api/post_traffic_hook.py`. These functions update the endpoint to enable data capture, format and submit incoming traffic to the SageMaker endpoint, and capture the data logs.\n", "\n", "![Components of production deployment](../docs/cloud-formation.png)\n", "\n", "Let's check how the deployment is progressing. Use the code below to fetch the execution ID of the deployment step. Then generate a table which lists the resources created by the CloudFormation stack and their creation status. You can re-run the cell after a few minutes to see how the steps are progressing."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["deploy_prd = get_pipeline_stage(pipeline_name, \"DeployPrd\")\n", "if not \"latestExecution\" in deploy_prd or not \"latestExecution\" in deploy_prd[\"actionStates\"][0]:\n", " raise (Exception(\"Please wait. Deploy prd not started\"))\n", "\n", "execution_id = deploy_prd[\"latestExecution\"][\"pipelineExecutionId\"]"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from datetime import datetime, timedelta\n", "from dateutil.tz import tzlocal\n", "\n", "\n", "def get_event_dataframe(events):\n", " stack_cols = [\n", " \"LogicalResourceId\",\n", " \"ResourceStatus\",\n", " \"ResourceStatusReason\",\n", " \"Timestamp\",\n", " ]\n", " stack_event_df = pd.DataFrame(events)[stack_cols].fillna(\"\")\n", " stack_event_df[\"TimeAgo\"] = datetime.now(tzlocal()) - stack_event_df[\"Timestamp\"]\n", " return stack_event_df.drop(\"Timestamp\", axis=1)\n", "\n", "\n", "cfn = boto3.client(\"cloudformation\")\n", "\n", "stack_name = stack_name = \"{}-deploy-prd\".format(pipeline_name)\n", "print(\"stack name: {}\".format(stack_name))\n", "\n", "# Get latest stack events\n", "while True:\n", " try:\n", " response = cfn.describe_stack_events(StackName=stack_name)\n", " break\n", " except ClientError as e:\n", " print(e.response[\"Error\"][\"Message\"])\n", " time.sleep(10)\n", "\n", "get_event_dataframe(response[\"StackEvents\"]).head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The resource of most interest to us is the endpoint. This takes on average 10 minutes to deploy. In the meantime, you can take a look at the Python code used for the application. \n", "\n", "The `app.py` is the main entry point invoking the Amazon SageMaker endpoint. It returns results along with a custom header for the endpoint we invoked."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!pygmentize ../api/app.py"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The `pre_traffic_hook.py` lambda is invoked prior to deployment and confirms the endpoint has data capture enabled."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!pygmentize ../api/pre_traffic_hook.py"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The `post_traffic_hook.py` lambda is invoked to perform any final checks, in this case to verify that we have received log data from data capature."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!pygmentize ../api/post_traffic_hook.py"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Use the code below to fetch the name of the endpoint, then run a loop to wait for the endpoint to be fully deployed. You need the status to be 'InService'."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["prd_endpoint_name = \"mlops-{}-prd-{}\".format(model_name, execution_id)\n", "print(\"prod endpoint: {}\".format(prd_endpoint_name))"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["sm = boto3.client(\"sagemaker\")\n", "\n", "while True:\n", " try:\n", " response = sm.describe_endpoint(EndpointName=prd_endpoint_name)\n", " print(\"Endpoint status: {}\".format(response[\"EndpointStatus\"]))\n", " # Wait until the endpoint is in service with data capture enabled\n", " if (\n", " response[\"EndpointStatus\"] == \"InService\"\n", " and \"DataCaptureConfig\" in response\n", " and response[\"DataCaptureConfig\"][\"EnableCapture\"]\n", " ):\n", " break\n", " except ClientError as e:\n", " print(e.response[\"Error\"][\"Message\"])\n", " time.sleep(10)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["When the endpoint status is 'InService', you can continue. Earlier in this notebook, you created some code to send data to the dev endpoint. Reuse this code now to send a sample of the test data to the production endpoint. Since data capture is enabled on this endpoint, you want to send single records at a time, so the model monitor can map these records to the baseline. \n", "\n", "You will [inspect the model monitor](#Inspect-Model-Monitor) later in this notebook. For now, just check if you can send data to the endpoint and receive predictions in return."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["prd_predictor = get_predictor(prd_endpoint_name)\n", "sample_values = test_df[test_df.columns[1:]].sample(100).values\n", "predictions = predict(prd_predictor, sample_values, rows=1)\n", "predictions"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Test REST API\n", "\n", "Although you already tested the SageMaker endpoint in the previous section, it is also a good idea to test the application created with API Gateway. \n", "\n", "![Traffic shift between endpoints](../docs/lambda-deploy-create.png)\n", "\n", "Follow the link below to open the Lambda Deployment where you can see the in-progress and completed deployments. You can also click to expand the **SAM template** to see the packaged CloudFormation template used in the deployment."]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": true}, "outputs": [], "source": ["HTML(\n", " 'Lambda Deployment'.format(\n", " region, model_name\n", " )\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Run the code below to confirm that the endpoint is in service. It will complete once the REST API is available."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["def get_stack_status(stack_name):\n", " response = cfn.describe_stacks(StackName=stack_name)\n", " if response[\"Stacks\"]:\n", " stack = response[\"Stacks\"][0]\n", " outputs = None\n", " if \"Outputs\" in stack:\n", " outputs = dict([(o[\"OutputKey\"], o[\"OutputValue\"]) for o in stack[\"Outputs\"]])\n", " return stack[\"StackStatus\"], outputs\n", "\n", "\n", "outputs = None\n", "while True:\n", " try:\n", " status, outputs = get_stack_status(stack_name)\n", " response = sm.describe_endpoint(EndpointName=prd_endpoint_name)\n", " print(\"Endpoint status: {}\".format(response[\"EndpointStatus\"]))\n", " if outputs:\n", " break\n", " elif status.endswith(\"FAILED\"):\n", " raise (Exception(\"Stack status: {}\".format(status)))\n", " except ClientError as e:\n", " print(e.response[\"Error\"][\"Message\"])\n", " time.sleep(10)\n", "\n", "if outputs:\n", " print(\"deployment application: {}\".format(outputs[\"DeploymentApplication\"]))\n", " print(\"rest api: {}\".format(outputs[\"RestApi\"]))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If you are performing an update on your production deployment as a result of running [Trigger Retraining](#Trigger-Retraining) you will then be able to expand the Lambda Deployment tab to reveal the resources. Click on the **ApiFunctionAliaslive** link to see the Lambda Deployment in progress. \n", "\n", "![Traffic shift between endpoints](../docs/lambda-deploy-update.png)\n", "\n", "This page will be updated to list the deployment events. It also has a link to the Deployment Application which you can access in the output of the next cell."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["HTML(\n", " 'CodeDeploy application'.format(\n", " region, outputs[\"DeploymentApplication\"]\n", " )\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["CodeDeploy will perform a canary deployment and send 10% of the traffic to the new endpoint over a 5-minute period.\n", "\n", "![Traffic shift between endpoints](../docs/code-deploy.gif)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We can invoke the REST API and inspect the headers being returned to see which endpoint we are hitting. You will occasionally see the cell below show a different endpoint that settles to the new version once the stack is complete. "]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": false}, "outputs": [], "source": ["%%time\n", "\n", "from urllib import request\n", "\n", "headers = {\"Content-type\": \"text/csv\"}\n", "payload = test_df[test_df.columns[1:]].head(1).to_csv(header=False, index=False).encode(\"utf-8\")\n", "rest_api = outputs[\"RestApi\"]\n", "\n", "while True:\n", " try:\n", " resp = request.urlopen(request.Request(rest_api, data=payload, headers=headers))\n", " print(\n", " \"Response code: %d: endpoint: %s\"\n", " % (resp.getcode(), resp.getheader(\"x-sagemaker-endpoint\"))\n", " )\n", " status, outputs = get_stack_status(stack_name)\n", " if status.endswith(\"COMPLETE\"):\n", " print(\"Deployment complete\\n\")\n", " break\n", " elif status.endswith(\"FAILED\"):\n", " raise (Exception(\"Stack status: {}\".format(status)))\n", " except ClientError as e:\n", " print(e.response[\"Error\"][\"Message\"])\n", " time.sleep(10)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Monitor\n", "\n", "### Inspect Model Monitor\n", "\n", "When you prepared the datasets for model training at the start of this notebook, you saved a baseline dataset (a copy of the train dataset). Then, when you approved the model for deployment into production, the pipeline set up an SageMaker Endpoint with data capture enabled and a model monitoring schedule. In this section, you will take a closer look at the model monitor results.\n", "\n", "To start off, fetch the latest production deployment execution ID."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["deploy_prd = get_pipeline_stage(pipeline_name, \"DeployPrd\")\n", "if not \"latestExecution\" in deploy_prd:\n", " raise (Exception(\"Please wait. Deploy prod not complete\"))\n", "\n", "execution_id = deploy_prd[\"latestExecution\"][\"pipelineExecutionId\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Under the hood, SageMaker model monitor runs in SageMaker processing jobs. Use the execution ID to fetch the names of the processing job and the schedule."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["processing_job_name = \"mlops-{}-pbl-{}\".format(model_name, execution_id)\n", "schedule_name = \"mlops-{}-pms\".format(model_name)\n", "\n", "print(\"processing job name: {}\".format(processing_job_name))\n", "print(\"schedule name: {}\".format(schedule_name))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Explore Baseline\n", "\n", "Now fetch the baseline results from the processing job. This cell will throw an exception if the processing job is not complete - if that happens, just wait several minutes and try again. "]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": true}, "outputs": [], "source": ["import sagemaker\n", "from sagemaker.model_monitor import BaseliningJob, MonitoringExecution\n", "from sagemaker.s3 import S3Downloader\n", "\n", "sagemaker_session = sagemaker.Session()\n", "baseline_job = BaseliningJob.from_processing_name(sagemaker_session, processing_job_name)\n", "status = baseline_job.describe()[\"ProcessingJobStatus\"]\n", "if status != \"Completed\":\n", " raise (Exception(\"Please wait. Processing job not complete, status: {}\".format(status)))\n", "\n", "baseline_results_uri = baseline_job.outputs[0].destination"]}, {"cell_type": "markdown", "metadata": {}, "source": ["SageMaker model monitor generates two types of files. Take a look at the statistics file first. It calculates various statistics for each feature of the dataset, including the mean, standard deviation, minimum value, maximum value, and more. "]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": true}, "outputs": [], "source": ["import pandas as pd\n", "import json\n", "\n", "baseline_statistics = baseline_job.baseline_statistics().body_dict\n", "schema_df = pd.json_normalize(baseline_statistics[\"features\"])\n", "schema_df[\n", " [\n", " \"name\",\n", " \"numerical_statistics.mean\",\n", " \"numerical_statistics.std_dev\",\n", " \"numerical_statistics.min\",\n", " \"numerical_statistics.max\",\n", " ]\n", "].head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now look at the suggested [constraints files](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-byoc-constraints.html)\u21d7. As the name implies, these are constraints which SageMaker model monitor recommends. If the live data which is sent to your production SageMaker Endpoint violates these constraints, this indicates data drift, and model monitor can raise an alert to trigger retraining. Of course, you can set different constraints based on the statistics which you viewed previously."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["baseline_constraints = baseline_job.suggested_constraints().body_dict\n", "constraints_df = pd.json_normalize(baseline_constraints[\"features\"])\n", "constraints_df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### View data capture\n", "\n", "When the \"Deploy Production\" stage of the MLOps pipeline deploys a SageMaker endpoint, it also enables data capture. This means the incoming requests to the endpoint, as well as the results from the ML model, are stored in an S3 location. Model monitor can analyze this data and compare it to the baseline to ensure that no constraints are violated. \n", "\n", "Use the code below to check how many files have been created by the data capture, and view the latest file in detail. Note, data capture relies on data being sent to the production endpoint. If you don't see any files yet, wait several minutes and try again."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["bucket = sagemaker_session.default_bucket()\n", "data_capture_logs_uri = \"s3://{}/mlops-{}/datacapture/{}\".format(\n", " bucket, model_name, prd_endpoint_name\n", ")\n", "\n", "capture_files = S3Downloader.list(data_capture_logs_uri)\n", "print(\"Found {} files\".format(len(capture_files)))\n", "\n", "if capture_files:\n", " # Get the first line of the most recent file\n", " event = json.loads(S3Downloader.read_file(capture_files[-1]).split(\"\\n\")[0])\n", " print(\"\\nLast file:\\n{}\".format(json.dumps(event, indent=2)))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### View monitoring schedule\n", "\n", "There are some useful functions for plotting and rendering distribution statistics or constraint violations provided in a `utils` file in the [SageMaker Examples GitHub](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker_model_monitor/visualization)\u21d7. Grab a copy of this code to use in this notebook. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!wget -O utils.py --quiet https://raw.githubusercontent.com/awslabs/amazon-sagemaker-examples/master/sagemaker_model_monitor/visualization/utils.py\n", "import utils as mu"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The [minimum scheduled run time](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-scheduling.html)\u21d7 for model monitor is one hour, which means you will need to wait at least an hour to see any results. Use the code below to check the schedule status and list the next run. If you are completing this notebook as part of a workshop, your host will have activities which you can complete while you wait. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["sm = boto3.client(\"sagemaker\")\n", "\n", "response = sm.describe_monitoring_schedule(MonitoringScheduleName=schedule_name)\n", "print(\"Schedule Status: {}\".format(response[\"MonitoringScheduleStatus\"]))\n", "\n", "now = datetime.now(tzlocal())\n", "next_hour = (now + timedelta(hours=1)).replace(minute=0)\n", "scheduled_diff = (next_hour - now).seconds // 60\n", "print(\"Next schedule in {} minutes\".format(scheduled_diff))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["While you wait, you can take a look at the CloudFormation template which is used as a base for the CloudFormation template built by CodeDeploy to deploy the production application. \n", "\n", "Alterntively, you can jump ahead to [Trigger Retraining](#Trigger-Retraining) which will kick off another run of the code pipeline whilst you wait."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!cat ../assets/deploy-model-prd.yml"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A couple of minutes after the model monitoring schedule has run, you can use the code below to fetch the latest schedule status. A completed schedule run may have found violations. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["processing_job_arn = None\n", "\n", "while processing_job_arn is None:\n", " try:\n", " response = sm.list_monitoring_executions(MonitoringScheduleName=schedule_name)\n", " except ClientError as e:\n", " print(e.response[\"Error\"][\"Message\"])\n", " for mon in response[\"MonitoringExecutionSummaries\"]:\n", " status = mon[\"MonitoringExecutionStatus\"]\n", " now = datetime.now(tzlocal())\n", " created_diff = (now - mon[\"CreationTime\"]).seconds // 60\n", " print(\"Schedule status: {}, Created: {} minutes ago\".format(status, created_diff))\n", " if status in [\"Completed\", \"CompletedWithViolations\"]:\n", " processing_job_arn = mon[\"ProcessingJobArn\"]\n", " break\n", " if status == \"InProgress\":\n", " break\n", " else:\n", " raise (Exception(\"Please wait. No Schedules executing\"))\n", " time.sleep(10)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### View monitoring results\n", "\n", "Once the model monitoring schedule has had a chance to run at least once, you can take a look at the results. First, load the monitoring execution results from the latest scheduled run."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["if processing_job_arn:\n", " execution = MonitoringExecution.from_processing_arn(\n", " sagemaker_session=sagemaker.Session(), processing_job_arn=processing_job_arn\n", " )\n", " exec_inputs = {inp[\"InputName\"]: inp for inp in execution.describe()[\"ProcessingInputs\"]}\n", " exec_results_uri = execution.output.destination\n", "\n", " print(\"Monitoring Execution results: {}\".format(exec_results_uri))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Take a look at the files which have been saved in the S3 output location. If violations were found, you should see a constraint violations file in addition to the statistics and constraints file which you viewed before."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!aws s3 ls $exec_results_uri/"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, fetch the monitoring statistics and violations. Then use the utils code to visualize the results in a table. It will highlight any baseline drift found by the model monitor. Drift can happen for categorical features (for inferred string styles) or for numerical features (e.g. total fare amount)."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Get the baseline and monitoring statistics & violations\n", "baseline_statistics = baseline_job.baseline_statistics().body_dict\n", "execution_statistics = execution.statistics().body_dict\n", "violations = execution.constraint_violations().body_dict[\"violations\"]"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["mu.show_violation_df(\n", " baseline_statistics=baseline_statistics,\n", " latest_statistics=execution_statistics,\n", " violations=violations,\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Trigger Retraining\n", "\n", "The CodePipeline instance is configured with [CloudWatch Events](https://docs.aws.amazon.com/codepipeline/latest/userguide/create-cloudtrail-S3-source.html)\u21d7 to start the pipeline for retraining when the drift detection triggers specific metric alarms.\n", "\n", "You can simulate drift by putting a metric value above the threshold of `0.2` directly into CloudWatch. This will trigger the alarm, and start the code pipeline.\n", "\n", "
\n", " Tip: This alarm is configured only for the latest production endpoint, so re-training will only occur if you are putting metrics against the latest endpoint.\n", "
\n", "\n", "![Metric graph in CloudWatch](../docs/cloudwatch-alarm.png)\n", "\n", "Run the code below to trigger the metric alarm. The cell output will be a link to CloudWatch, where you can see the alarm (similar to the screenshot above), and a link to CodePipeline which you will see run again. Note that it can take a couple of minutes for everything to trigger."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from datetime import datetime\n", "import random\n", "\n", "cloudwatch = boto3.client(\"cloudwatch\")\n", "\n", "# Define the metric name and threshold\n", "metric_name = \"feature_baseline_drift_total_amount\"\n", "metric_threshold = 0.2\n", "\n", "# Put a new metric to trigger an alaram\n", "def put_drift_metric(value):\n", " print(\"Putting metric: {}\".format(value))\n", " response = cloudwatch.put_metric_data(\n", " Namespace=\"aws/sagemaker/Endpoints/data-metrics\",\n", " MetricData=[\n", " {\n", " \"MetricName\": metric_name,\n", " \"Dimensions\": [\n", " {\"Name\": \"MonitoringSchedule\", \"Value\": schedule_name},\n", " {\"Name\": \"Endpoint\", \"Value\": prd_endpoint_name},\n", " ],\n", " \"Timestamp\": datetime.now(),\n", " \"Value\": value,\n", " \"Unit\": \"None\",\n", " },\n", " ],\n", " )\n", "\n", "\n", "def get_drift_stats():\n", " response = cloudwatch.get_metric_statistics(\n", " Namespace=\"aws/sagemaker/Endpoints/data-metrics\",\n", " MetricName=metric_name,\n", " Dimensions=[\n", " {\"Name\": \"MonitoringSchedule\", \"Value\": schedule_name},\n", " {\"Name\": \"Endpoint\", \"Value\": prd_endpoint_name},\n", " ],\n", " StartTime=datetime.now() - timedelta(minutes=2),\n", " EndTime=datetime.now(),\n", " Period=1,\n", " Statistics=[\"Average\"],\n", " Unit=\"None\",\n", " )\n", " if \"Datapoints\" in response and len(response[\"Datapoints\"]) > 0:\n", " return response[\"Datapoints\"][0][\"Average\"]\n", " return 0\n", "\n", "\n", "print(\"Simluate drift on endpoint: {}\".format(prd_endpoint_name))\n", "\n", "while True:\n", " put_drift_metric(round(random.uniform(metric_threshold, 1.0), 4))\n", " drift_stats = get_drift_stats()\n", " print(\"Average drift amount: {}\".format(get_drift_stats()))\n", " if drift_stats > metric_threshold:\n", " break\n", " time.sleep(1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Click through to the Alarm and CodePipeline Execution history with the links below."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Output a html link to the cloudwatch dashboard\n", "metric_alarm_name = \"mlops-{}-metric-gt-threshold\".format(model_name)\n", "HTML(\n", " \"\"\"CloudWatch Alarm triggers\n", " Code Pipeline Execution\"\"\".format(\n", " region, metric_alarm_name, pipeline_name\n", " )\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Once the pipeline is running again you can jump back up to [Inspect Training Job](#Inspect-Training-Job)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Create a CloudWatch dashboard\n", "\n", "Finally, use the code below to create a CloudWatch dashboard to visualize the key performance metrics and alarms which you have created during this demo. The cell will output a link to the dashboard. This dashboard shows 9 charts in three rows, where the first row displays Lambda metrics, the second row displays SageMaker metrics, and the third row (shown in the screenshot below) displays the alarms set up for the pipeline.\n", "\n", "![Graphs in CloudWatch dashboard](../docs/cloudwatch-dashboard.png)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from string import Template\n", "\n", "sts = boto3.client(\"sts\")\n", "account_id = sts.get_caller_identity().get(\"Account\")\n", "dashboard_name = \"mlops-{0}-{1}\".format(model_name, config[\"SageMakerProjectId\"])\n", "\n", "with open(\"dashboard.json\") as f:\n", " dashboard_body = Template(f.read()).substitute(\n", " region=region, account_id=account_id, model_name=model_name\n", " )\n", " response = cloudwatch.put_dashboard(DashboardName=dashboard_name, DashboardBody=dashboard_body)\n", "\n", "# Output a html link to the cloudwatch dashboard\n", "HTML(\n", " 'CloudWatch Dashboard'.format(\n", " region, dashboard_name\n", " )\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Congratulations! You have made it to the end of this notebook, and have automated a safe MLOps pipeline using a wide range of AWS services. \n", "\n", "You can use the other notebook in this repository [workflow.ipynb](workflow.ipynb) to implement your own ML model and deploy it as part of this pipeline. Or, if you are finished with the content, follow the instructions in the next section to clean up the resources you have deployed."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Cleanup\n", "\n", "Execute the following cell to delete the stacks created in the pipeline. For a model name of **nyctaxi** these would be:\n", "\n", "1. *nyctaxi*-deploy-prd\n", "2. *nyctaxi*-deploy-dev\n", "3. *nyctaxi*-workflow\n", "4. sagemaker-custom-resource"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["cfn = boto3.client(\"cloudformation\")\n", "\n", "# Delete the prod and then dev stack\n", "for stack_name in [\n", " f\"{pipeline_name}-deploy-prd\",\n", " f\"{pipeline_name}-deploy-dev\",\n", " f\"{pipeline_name}-workflow\",\n", " f\"mlops-{model_name}-{config['SageMakerProjectId']}-sagemaker-custom-resource\",\n", "]:\n", " print(\"Deleting stack: {}\".format(stack_name))\n", " cfn.delete_stack(StackName=stack_name)\n", " cfn.get_waiter(\"stack_delete_complete\").wait(StackName=stack_name)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The following code will delete the dashboard."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["cloudwatch.delete_dashboards(DashboardNames=[dashboard_name])\n", "print(\"Dashboard deleted\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Finally, close this notebook and you can delete the CloudFormation you created to launch this MLOps sample."]}], "metadata": {"kernelspec": {"display_name": "conda_python3", "language": "python", "name": "conda_python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10"}}, "nbformat": 4, "nbformat_minor": 4} \ No newline at end of file diff --git a/notebook/workflow.ipynb b/notebook/workflow.ipynb index aaaa305..8f23fd3 100644 --- a/notebook/workflow.ipynb +++ b/notebook/workflow.ipynb @@ -1,1186 +1 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Workflow\n", - "\n", - "The following notebook contains the step functions workflow definition for training and baseline jobs.\n", - "\n", - "This can be run after you have started the [mlops](mlops.ipynb) build and have stored `input_data`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Import the latest sagemaker, stepfunctions and boto3 SDKs\n", - "import sys\n", - "!{sys.executable} -m pip install --upgrade pip\n", - "!{sys.executable} -m pip install -qU awscli boto3 \"sagemaker>=2.1.0<3\"\n", - "!{sys.executable} -m pip install -qU \"stepfunctions==2.0.0\"\n", - "!{sys.executable} -m pip show sagemaker stepfunctions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import boto3\n", - "import json\n", - "import os\n", - "import time\n", - "import uuid\n", - "\n", - "import sagemaker\n", - "from sagemaker.image_uris import retrieve \n", - "from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput\n", - "from sagemaker.model_monitor.dataset_format import DatasetFormat\n", - "\n", - "import stepfunctions\n", - "from stepfunctions import steps\n", - "from stepfunctions.inputs import ExecutionInput\n", - "from stepfunctions.workflow import Workflow" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Load variables from environment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "region = boto3.Session().region_name\n", - "role = sagemaker.get_execution_role()\n", - "pipeline_name = os.environ['PIPELINE_NAME']\n", - "model_name = os.environ['MODEL_NAME']\n", - "workflow_role_arn = os.environ['WORKFLOW_ROLE_ARN']\n", - "\n", - "# Define the lambda function names for steps\n", - "create_experiment_function_name = 'mlops-create-experiment'\n", - "query_training_function_name = 'mlops-query-training'\n", - "transform_header_function_name = 'mlops-add-transform-header'\n", - "query_drift_function_name = 'mlops-query-drift'\n", - "\n", - "# Get the session and default bucket\n", - "session = sagemaker.session.Session()\n", - "bucket = session.default_bucket()\n", - "\n", - "print('region: {}'.format(region))\n", - "print('pipeline: {}'.format(pipeline_name))\n", - "print('model name: {}'.format(model_name))\n", - "print('bucket: {}'.format(bucket))\n", - "print('sagemaker role: {}'.format(role))\n", - "print('workflow role: {}'.format(workflow_role_arn))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Load the input data from the mlops notebook and print values" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r input_data \n", - "input_data " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Specify the training model and transform output base uri" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "output_data = {\n", - " 'ModelOutputUri': 's3://{}/{}/model'.format(bucket, model_name), \n", - "}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Define Training Resources\n", - "\n", - "### Input Schema\n", - "\n", - "Define the input schema for the step functions which can then be used as arguments to resources" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "execution_input = ExecutionInput(\n", - " schema={\n", - " \"GitBranch\": str,\n", - " \"GitCommitHash\": str,\n", - " \"DataVersionId\": str,\n", - " \"ExperimentName\": str,\n", - " \"TrialName\": str,\n", - " \"BaselineJobName\": str,\n", - " \"BaselineOutputUri\": str,\n", - " \"TrainingJobName\": str,\n", - " \"ModelName\": str\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Define the model monitor baseline\n", - "\n", - "Define the environment variables" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset_format = DatasetFormat.csv()\n", - "env = {\n", - " \"dataset_format\": json.dumps(dataset_format),\n", - " \"dataset_source\": \"/opt/ml/processing/input/baseline_dataset_input\",\n", - " \"output_path\": \"/opt/ml/processing/output\",\n", - " \"publish_cloudwatch_metrics\": \"Disabled\", # Have to be disabled from processing job?\n", - "}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define the processing inputs and outputs " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "inputs = [\n", - " ProcessingInput(\n", - " source=input_data['BaselineUri'],\n", - " destination=\"/opt/ml/processing/input/baseline_dataset_input\",\n", - " input_name=\"baseline_dataset_input\",\n", - " ),\n", - "]\n", - "outputs = [\n", - " ProcessingOutput(\n", - " source=\"/opt/ml/processing/output\",\n", - " destination=execution_input[\"BaselineOutputUri\"],\n", - " output_name=\"monitoring_output\",\n", - " ),\n", - "]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create the baseline processing job using the sagemaker [model monitor](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_monitoring.html) container." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Get the default model monitor container\n", - "region = boto3.Session().region_name\n", - "monor_monitor_container_uri = retrieve(region=region, framework=\"model-monitor\", version=\"latest\")\n", - "\n", - "# Use the base processing where we pass through the \n", - "monitor_analyzer = Processor(\n", - " image_uri=monor_monitor_container_uri,\n", - " role=role, \n", - " instance_count=1,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " max_runtime_in_seconds=1800,\n", - " env=env\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Test the model baseline processing job by running inline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#monitor_analyzer.run(inputs=inputs, outputs=outputs, wait=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Defining the Training Job\n", - "\n", - "Define the training job to run in paralell with the processing job" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "image_uri = sagemaker.image_uris.retrieve(region=region, framework=\"xgboost\", version=\"latest\")\n", - "\n", - "# Create the estimator\n", - "xgb = sagemaker.estimator.Estimator(\n", - " image_uri,\n", - " role,\n", - " instance_count=1,\n", - " instance_type=\"ml.m4.xlarge\",\n", - " output_path=output_data['ModelOutputUri'], # NOTE: Can't use execution_input here\n", - ")\n", - "\n", - "# Set the hyperparameters overriding with any defaults\n", - "hyperparameters = {\n", - " \"max_depth\": \"9\",\n", - " \"eta\": \"0.2\",\n", - " \"gamma\": \"4\",\n", - " \"min_child_weight\": \"300\",\n", - " \"subsample\": \"0.8\",\n", - " \"objective\": \"reg:linear\",\n", - " \"early_stopping_rounds\": \"10\",\n", - " \"num_round\": \"50\", # Don't stop to early or results are bad\n", - "}\n", - "xgb.set_hyperparameters(**hyperparameters)\n", - "\n", - "# Specify the data source\n", - "s3_input_train = sagemaker.inputs.TrainingInput(s3_data=input_data['TrainingUri'], content_type=\"csv\")\n", - "s3_input_val = sagemaker.inputs.TrainingInput(s3_data=input_data['ValidationUri'], content_type=\"csv\")\n", - "data = {\"train\": s3_input_train, \"validation\": s3_input_val}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Test the estimator directly in the notebook" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#xgb.fit(inputs=data)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Define Training Workflow\n", - "\n", - "### 1. Create the Experiment\n", - "\n", - "Define the create experiment lambda.\n", - "\n", - "In future add [ResultsPath](https://docs.aws.amazon.com/step-functions/latest/dg/input-output-resultpath.html) to filter the results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "create_experiment_step = steps.compute.LambdaStep(\n", - " 'Create Experiment',\n", - " parameters={ \n", - " \"FunctionName\": create_experiment_function_name,\n", - " 'Payload': {\n", - " \"ExperimentName.$\": '$.ExperimentName',\n", - " \"TrialName.$\": '$.TrialName',\n", - " }\n", - " },\n", - " result_path='$.CreateTrialResults'\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2a. Run processing Job\n", - "\n", - "Define the processing job with a specific failure handling" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "baseline_step = steps.sagemaker.ProcessingStep(\n", - " \"Baseline Job\",\n", - " processor=monitor_analyzer,\n", - " job_name=execution_input[\"BaselineJobName\"],\n", - " inputs=inputs,\n", - " outputs=outputs,\n", - " experiment_config={\n", - " 'ExperimentName': execution_input[\"ExperimentName\"], # '$.ExperimentName', \n", - " 'TrialName': execution_input[\"TrialName\"],\n", - " 'TrialComponentDisplayName': \"Baseline\",\n", - " },\n", - " tags={\n", - " \"GitBranch\": execution_input[\"GitBranch\"],\n", - " \"GitCommitHash\": execution_input[\"GitCommitHash\"],\n", - " \"DataVersionId\": execution_input[\"DataVersionId\"],\n", - " },\n", - " result_path='$.BaselineJobResults'\n", - ")\n", - "\n", - "baseline_step.add_catch(steps.states.Catch(\n", - " error_equals=[\"States.TaskFailed\"],\n", - " next_step=stepfunctions.steps.states.Fail(\n", - " \"Baseline failed\", cause=\"SageMakerBaselineJobFailed\"\n", - " ),\n", - "))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2b. Run and query training Job\n", - "\n", - "Define the training job and add a validation step" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "training_step = steps.TrainingStep(\n", - " \"Training Job\",\n", - " estimator=xgb,\n", - " data=data,\n", - " job_name=execution_input[\"TrainingJobName\"],\n", - " experiment_config={\n", - " 'ExperimentName': execution_input[\"ExperimentName\"],\n", - " 'TrialName': execution_input[\"TrialName\"],\n", - " 'TrialComponentDisplayName': \"Training\",\n", - " },\n", - " tags={\n", - " \"GitBranch\": execution_input[\"GitBranch\"],\n", - " \"GitCommitHash\": execution_input[\"GitCommitHash\"],\n", - " \"DataVersionId\": execution_input[\"DataVersionId\"],\n", - " },\n", - " result_path='$.TrainingResults'\n", - ")\n", - "\n", - "training_step.add_catch(stepfunctions.steps.states.Catch(\n", - " error_equals=[\"States.TaskFailed\"],\n", - " next_step=stepfunctions.steps.states.Fail(\n", - " \"Training failed\", cause=\"SageMakerTrainingJobFailed\"\n", - " ),\n", - "))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a model from the training job, note this must follow training to retrieve the expected model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# Must follow the training test\n", - "model_step = steps.sagemaker.ModelStep(\n", - " 'Save Model',\n", - " input_path='$.TrainingResults',\n", - " model=training_step.get_expected_model(),\n", - " model_name=execution_input['ModelName'],\n", - " result_path='$.ModelStepResults'\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Query training results, and validate that the RMSE error is within an acceptable range " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "training_query_step = steps.compute.LambdaStep(\n", - " 'Query Training Results',\n", - " parameters={ \n", - " \"FunctionName\": query_training_function_name,\n", - " 'Payload':{\n", - " \"TrainingJobName.$\": '$.TrainingJobName'\n", - " }\n", - " },\n", - " result_path='$.QueryTrainingResults'\n", - ")\n", - "\n", - "check_accuracy_fail_step = steps.states.Fail(\n", - " 'Model Error Too Low',\n", - " comment='RMSE accuracy higher than threshold'\n", - ")\n", - "\n", - "check_accuracy_succeed_step = steps.states.Succeed('Model Error Acceptable')\n", - "\n", - "# TODO: Update query method to query validation error using better result path\n", - "threshold_rule = steps.choice_rule.ChoiceRule.NumericLessThan(\n", - " variable=training_query_step.output()['QueryTrainingResults']['Payload']['results']['TrainingMetrics'][0]['Value'], value=10\n", - ")\n", - "\n", - "check_accuracy_step = steps.states.Choice(\n", - " 'RMSE < 10'\n", - ")\n", - "\n", - "check_accuracy_step.add_choice(rule=threshold_rule, next_step=check_accuracy_succeed_step)\n", - "check_accuracy_step.default_choice(next_step=check_accuracy_fail_step)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Add the Error handling in the workflow\n", - "\n", - "We will use the [Catch Block](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/states.html#stepfunctions.steps.states.Catch) to perform error handling. If the Processing Job Step or Training Step fails, the flow will go into failure state." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sagemaker_jobs = steps.states.Parallel(\"SageMaker Jobs\")\n", - "sagemaker_jobs.add_branch(baseline_step)\n", - "sagemaker_jobs.add_branch(steps.states.Chain([training_step, model_step, training_query_step, check_accuracy_step]))\n", - "\n", - "# Do we need specific failure for the jobs for group?\n", - "sagemaker_jobs.add_catch(stepfunctions.steps.states.Catch(\n", - " error_equals=[\"States.TaskFailed\"],\n", - " next_step=stepfunctions.steps.states.Fail(\n", - " \"SageMaker Jobs failed\", cause=\"SageMakerJobsFailed\"\n", - " ),\n", - "))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Execute Training Workflow\n", - "\n", - "Create the training workflow." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "training_workflow_definition = steps.states.Chain([\n", - " create_experiment_step,\n", - " sagemaker_jobs\n", - "])\n", - "\n", - "training_workflow_name = '{}-training'.format(model_name)\n", - "training_workflow = Workflow(training_workflow_name, training_workflow_definition, workflow_role_arn)\n", - "training_workflow.create()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Render the graph of the workflow as defined by the graph" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "training_workflow.render_graph()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can also inspect the raw workflow definition and verify the execution variables are correctly passed in" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "print(training_workflow.definition.to_json(pretty=True))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " Now we define the inputs for the workflow" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define some dummy job and git params\n", - "job_id = uuid.uuid1().hex\n", - "git_branch = 'master'\n", - "git_commit_hash = 'xxx' \n", - "data_verison_id = 'yyy'\n", - "\n", - "# Define the experiment and trial name based on model name and job id\n", - "experiment_name = \"mlops-{}\".format(model_name)\n", - "trial_name = \"mlops-{}-{}\".format(model_name, job_id)\n", - "\n", - "workflow_inputs = {\n", - " \"ExperimentName\": experiment_name,\n", - " \"TrialName\": trial_name,\n", - " \"GitBranch\": git_branch,\n", - " \"GitCommitHash\": git_commit_hash, \n", - " \"DataVersionId\": data_verison_id, \n", - " \"BaselineJobName\": trial_name, \n", - " \"BaselineOutputUri\": f\"s3://{bucket}/{model_name}/monitoring/baseline/mlops-{model_name}-pbl-{job_id}\",\n", - " \"TrainingJobName\": trial_name,\n", - " \"ModelName\": trial_name,\n", - "}\n", - "print(json.dumps(workflow_inputs))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then execute the workflow" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "execution = training_workflow.execute(\n", - " inputs=workflow_inputs\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Render workflow progress with the [render_progress](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Execution.render_progress).\n", - "\n", - "This generates a snapshot of the current state of your workflow as it executes. Run the cell again to refresh progress or jump to step functions in the console." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "execution.render_progress()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wait for the execution to complete, and output the last step." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "execution_output = execution.get_output(wait=True)\n", - "execution.list_events()[-1]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use [list_events](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Execution.list_events) to list all events in the workflow execution." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# execution.list_events(html=True) # Bug" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Execute Batch Transform\n", - "\n", - "Take the model we have trained and run a batch transform on the validation dataset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "execution_input = ExecutionInput(\n", - " schema={\n", - " \"GitBranch\": str,\n", - " \"GitCommitHash\": str,\n", - " \"DataVersionId\": str,\n", - " \"ExperimentName\": str,\n", - " \"TrialName\": str,\n", - " \"ModelName\": str,\n", - " \"TransformJobName\": str,\n", - " \"MonitorJobName\": str,\n", - " \"MonitorOutputUri\": str,\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define some new output paths for the transform and monitoring jobs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "output_data['TransformOutputUri'] = f\"s3://{bucket}/{model_name}/transform/mlops-{model_name}-{job_id}\"\n", - "output_data['MonitoringOutputUri'] = f\"s3://{bucket}/{model_name}/monitoring/mlops-{model_name}-{job_id}\"\n", - "output_data['BaselineOutputUri'] = workflow_inputs['BaselineOutputUri']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Run the Transform Job\n", - "\n", - "Define a transform job to take the test dataset as input. \n", - "\n", - "We can configured the batch transform to [associate prediction results](https://aws.amazon.com/blogs/machine-learning/associating-prediction-results-with-input-data-using-amazon-sagemaker-batch-transform/) with the input based in the `input_filter` and `output_filter` arguments." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "transform_step = steps.TransformStep(\n", - " 'Transform Input Dataset',\n", - " transformer=xgb.transformer(\n", - " instance_count=1,\n", - " instance_type='ml.m5.large',\n", - " assemble_with='Line', \n", - " accept = 'text/csv',\n", - " output_path=output_data['TransformOutputUri'], # NOTE: Can't use execution_input here\n", - " ),\n", - " job_name=execution_input['TransformJobName'], # TEMP\n", - " model_name=execution_input['ModelName'], \n", - " data=input_data['TestUri'],\n", - " content_type='text/csv',\n", - " split_type='Line',\n", - " input_filter='$[1:]', # Skip the first target column output_amount\n", - " join_source='Input',\n", - " output_filter='$[1:]', # Output all inputs excluding output_amount, followed by the predicted_output_amount\n", - " result_path='$.TransformJobResults'\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Add the Transform Header\n", - "\n", - "The batch transform output does not include the header, so add this back to be able to run baseline." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "transform_file_name = 'test.csv'\n", - "header = 'duration_minutes,passenger_count,trip_distance,total_amount'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "transform_header_step = steps.compute.LambdaStep(\n", - " 'Add Transform Header',\n", - " parameters={ \n", - " \"FunctionName\": transform_header_function_name,\n", - " 'Payload': {\n", - " \"TransformOutputUri\": output_data['TransformOutputUri'],\n", - " \"FileName\": transform_file_name,\n", - " \"Header\": header,\n", - " }\n", - " },\n", - " result_path='$.TransformHeaderResults'\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Run the Model Monitor Processing Job\n", - "\n", - "Create a model monitor processing job that takes the output of the transform job.\n", - "\n", - "Reference the `constraints.json` and `statistics.json` from the output form the training baseline." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset_format = DatasetFormat.csv()\n", - "env = {\n", - " \"dataset_format\": json.dumps(dataset_format),\n", - " \"dataset_source\": \"/opt/ml/processing/input/baseline_dataset_input\",\n", - " \"output_path\": \"/opt/ml/processing/output\",\n", - " \"publish_cloudwatch_metrics\": \"Disabled\", # Have to be disabled from processing job?\n", - " \"baseline_constraints\": \"/opt/ml/processing/baseline/constraints/constraints.json\",\n", - " \"baseline_statistics\": \"/opt/ml/processing/baseline/stats/statistics.json\"\n", - "}\n", - "inputs = [\n", - " ProcessingInput(\n", - " source=os.path.join(output_data['TransformOutputUri'], transform_file_name), # Transform with header\n", - " destination=\"/opt/ml/processing/input/baseline_dataset_input\",\n", - " input_name=\"baseline_dataset_input\",\n", - " ),\n", - " ProcessingInput(\n", - " source=os.path.join(output_data['BaselineOutputUri'], 'constraints.json'),\n", - " destination=\"/opt/ml/processing/baseline/constraints\",\n", - " input_name=\"constraints\",\n", - " ),\n", - " ProcessingInput(\n", - " source=os.path.join(output_data['BaselineOutputUri'], 'statistics.json'),\n", - " destination=\"/opt/ml/processing/baseline/stats\",\n", - " input_name=\"baseline\",\n", - " ),\n", - "]\n", - "outputs = [\n", - " ProcessingOutput(\n", - " source=\"/opt/ml/processing/output\",\n", - " destination=output_data['MonitoringOutputUri'],\n", - " output_name=\"monitoring_output\",\n", - " ),\n", - "]\n", - "\n", - "# Get the default model monitor container\n", - "region = boto3.Session().region_name\n", - "monor_monitor_container_uri = retrieve(region=region, framework=\"model-monitor\", version=\"latest\")\n", - "\n", - "# Use the base processing where we pass through the \n", - "monitor_analyzer = Processor(\n", - " image_uri=monor_monitor_container_uri,\n", - " role=role, \n", - " instance_count=1,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " max_runtime_in_seconds=1800,\n", - " env=env\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Test the monitor baseline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# monitor_analyzer.run(inputs=inputs, outputs=outputs, wait=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Add the monitor step" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "monitor_step = steps.sagemaker.ProcessingStep(\n", - " \"Monitor Job\",\n", - " processor=monitor_analyzer,\n", - " job_name=execution_input[\"MonitorJobName\"],\n", - " inputs=inputs,\n", - " outputs=outputs,\n", - " experiment_config={\n", - " 'ExperimentName': execution_input[\"ExperimentName\"],\n", - " 'TrialName': execution_input[\"TrialName\"],\n", - " 'TrialComponentDisplayName': \"Baseline\",\n", - " },\n", - " tags={\n", - " \"GitBranch\": execution_input[\"GitBranch\"],\n", - " \"GitCommitHash\": execution_input[\"GitCommitHash\"],\n", - " \"DataVersionId\": execution_input[\"DataVersionId\"],\n", - " },\n", - " result_path='$.MonitorJobResults'\n", - ")\n", - "\n", - "monitor_step.add_catch(stepfunctions.steps.states.Catch(\n", - " error_equals=[\"States.TaskFailed\"],\n", - " next_step=stepfunctions.steps.states.Fail(\n", - " \"Monitor failed\", cause=\"SageMakerMonitorJobFailed\"\n", - " ),\n", - "))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Add the lambda step to query for violations in the processing job." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "monitor_query_step = steps.compute.LambdaStep(\n", - " 'Query Monitoring Results',\n", - " parameters={ \n", - " \"FunctionName\": query_drift_function_name,\n", - " 'Payload':{\n", - " \"ProcessingJobName.$\": '$.MonitorJobName'\n", - " }\n", - " },\n", - " result_path='$.QueryMonitorResults'\n", - ")\n", - "\n", - "check_violations_fail_step = steps.states.Fail(\n", - " 'Completed with Violations',\n", - " comment='Processing job completed with violations'\n", - ")\n", - "\n", - "check_violations_succeed_step = steps.states.Succeed('Completed')\n", - "\n", - "# TODO: Check specific drift in violations\n", - "status_rule = steps.choice_rule.ChoiceRule.StringEquals(\n", - " variable=monitor_query_step.output()['QueryMonitorResults']['Payload']['results']['ProcessingJobStatus'], value='Completed'\n", - ")\n", - "\n", - "check_violations_step = steps.states.Choice(\n", - " 'Check Violations'\n", - ")\n", - "\n", - "check_violations_step.add_choice(rule=status_rule, next_step=check_violations_succeed_step)\n", - "check_violations_step.default_choice(next_step=check_violations_fail_step)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create the transform workflow definition" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "transform_workflow_definition = steps.states.Chain([\n", - " transform_step,\n", - " transform_header_step,\n", - " monitor_step, \n", - " monitor_query_step, \n", - " check_violations_step\n", - "])\n", - "\n", - "transform_workflow_name = '{}-transform'.format(model_name)\n", - "transform_workflow = Workflow(transform_workflow_name, transform_workflow_definition, workflow_role_arn)\n", - "transform_workflow.create()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Render the graph of the workflow as defined by the graph" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "transform_workflow.render_graph()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define the workflow inputs based on the previous training run" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define unique names for the transform and monitor baseline jobs\n", - "transform_job_name = \"mlops-{}-trn-{}\".format(model_name, job_id)\n", - "monitor_job_name = \"mlops-{}-mbl-{}\".format(model_name, job_id)\n", - "\n", - "workflow_inputs = {\n", - " \"ExperimentName\": experiment_name,\n", - " \"TrialName\": trial_name,\n", - " \"GitBranch\": git_branch,\n", - " \"GitCommitHash\": git_commit_hash, \n", - " \"DataVersionId\": data_verison_id, \n", - " \"ModelName\": trial_name,\n", - " \"TransformJobName\": transform_job_name, \n", - " \"MonitorJobName\": monitor_job_name,\n", - "}\n", - "print(json.dumps(workflow_inputs))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Execute the workflow and render the progress. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "execution = transform_workflow.execute(\n", - " inputs=workflow_inputs\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "execution.render_progress()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wait for the execution to finish and list the last event." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "execution_output = execution.get_output(wait=True)\n", - "execution.list_events()[-1]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Inspect Transform Results\n", - "\n", - "Verify that we can load the transform output with header" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from io import StringIO\n", - "import pandas as pd\n", - "from sagemaker.s3 import S3Downloader\n", - "\n", - "# Get the output, and add header\n", - "transform_output_uri = os.path.join(output_data['TransformOutputUri'], transform_file_name)\n", - "transform_body = S3Downloader.read_file(transform_output_uri)\n", - "pred_df = pd.read_csv(StringIO(transform_body), sep=\",\")\n", - "pred_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Query monitoring output\n", - "\n", - "If this completed with violations, let's inspect the output to see why that is the case." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "violiations_uri = os.path.join(output_data['MonitoringOutputUri'], 'constraint_violations.json')\n", - "violiations = json.loads(S3Downloader.read_file(violiations_uri))\n", - "violiations" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleanup\n", - "\n", - "Delete the workflows that we created as part of this notebook" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "training_workflow.delete()\n", - "transform_workflow.delete()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "conda_python3", - "language": "python", - "name": "conda_python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.13" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} +{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Workflow\n", "\n", "The following notebook contains the step functions workflow definition for training and baseline jobs.\n", "\n", "This can be run after you have started the [mlops](mlops.ipynb) build and have stored `input_data`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Import the latest sagemaker, stepfunctions and boto3 SDKs\n", "import sys\n", "!{sys.executable} -m pip install --upgrade pip\n", "!{sys.executable} -m pip install -qU awscli boto3 \"sagemaker>=2.1.0<3\"\n", "!{sys.executable} -m pip install -qU \"stepfunctions==2.0.0\"\n", "!{sys.executable} -m pip show sagemaker stepfunctions"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import boto3\n", "import json\n", "import os\n", "import time\n", "import uuid\n", "from botocore.exceptions import ClientError\n", "\n", "import sagemaker\n", "from sagemaker.image_uris import retrieve \n", "from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput\n", "from sagemaker.model_monitor.dataset_format import DatasetFormat\n", "\n", "import stepfunctions\n", "from stepfunctions import steps\n", "from stepfunctions.inputs import ExecutionInput\n", "from stepfunctions.workflow import Workflow"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Load the input data from the `mlops.ipynb` notebook and print values"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["%store -r input_data PROVISIONED_PRODUCT_NAME\n", "input_data, PROVISIONED_PRODUCT_NAME"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Load variables from environment"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["def get_config(provisioned_product_name):\n", " sc = boto3.client(\"servicecatalog\")\n", " outputs = sc.get_provisioned_product_outputs(ProvisionedProductName=provisioned_product_name)[\n", " \"Outputs\"\n", " ]\n", " config = {}\n", " for out in outputs:\n", " config[out[\"OutputKey\"]] = out[\"OutputValue\"]\n", " return config\n", "\n", "\n", "config = get_config(PROVISIONED_PRODUCT_NAME)\n", "region = config[\"Region\"]\n", "model_name = config[\"ModelName\"]\n", "role = config[\"SageMakerRoleARN\"]\n", "workflow_role_arn = config[\"WorkflowRoleARN\"]\n", "\n", "\n", "# Define the lambda function names for steps\n", "create_experiment_function_name = 'mlops-create-experiment'\n", "query_training_function_name = 'mlops-query-training'\n", "transform_header_function_name = 'mlops-add-transform-header'\n", "query_drift_function_name = 'mlops-query-drift'\n", "\n", "# Get the session and default bucket\n", "session = sagemaker.session.Session()\n", "bucket = session.default_bucket()\n", "\n", "print('region: {}'.format(region))\n", "print('bucket: {}'.format(bucket))\n", "print('sagemaker role: {}'.format(role))\n", "print('workflow role: {}'.format(workflow_role_arn))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Specify the training model and transform output base uri"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["output_data = {\n", " 'ModelOutputUri': 's3://{}/{}/model'.format(bucket, model_name), \n", "}"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Define Training Resources\n", "\n", "### Input Schema\n", "\n", "Define the input schema for the step functions which can then be used as arguments to resources"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["execution_input = ExecutionInput(\n", " schema={\n", " \"GitBranch\": str,\n", " \"GitCommitHash\": str,\n", " \"DataVersionId\": str,\n", " \"ExperimentName\": str,\n", " \"TrialName\": str,\n", " \"BaselineJobName\": str,\n", " \"BaselineOutputUri\": str,\n", " \"TrainingJobName\": str,\n", " \"ModelName\": str\n", " }\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Define the model monitor baseline\n", "\n", "Define the environment variables"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["dataset_format = DatasetFormat.csv()\n", "env = {\n", " \"dataset_format\": json.dumps(dataset_format),\n", " \"dataset_source\": \"/opt/ml/processing/input/baseline_dataset_input\",\n", " \"output_path\": \"/opt/ml/processing/output\",\n", " \"publish_cloudwatch_metrics\": \"Disabled\", # Have to be disabled from processing job?\n", "}"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Define the processing inputs and outputs "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["inputs = [\n", " ProcessingInput(\n", " source=input_data['BaselineUri'],\n", " destination=\"/opt/ml/processing/input/baseline_dataset_input\",\n", " input_name=\"baseline_dataset_input\",\n", " ),\n", "]\n", "outputs = [\n", " ProcessingOutput(\n", " source=\"/opt/ml/processing/output\",\n", " destination=execution_input[\"BaselineOutputUri\"],\n", " output_name=\"monitoring_output\",\n", " ),\n", "]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Create the baseline processing job using the sagemaker [model monitor](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_monitoring.html) container."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Get the default model monitor container\n", "monor_monitor_container_uri = retrieve(region=region, framework=\"model-monitor\", version=\"latest\")\n", "\n", "# Use the base processing where we pass through the \n", "monitor_analyzer = Processor(\n", " image_uri=monor_monitor_container_uri,\n", " role=role, \n", " instance_count=1,\n", " instance_type=\"ml.m5.xlarge\",\n", " max_runtime_in_seconds=1800,\n", " env=env\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Test the model baseline processing job by running inline"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# monitor_analyzer.run(inputs=inputs, outputs=outputs, wait=True)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Defining the Training Job\n", "\n", "Define the training job to run in paralell with the processing job"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["image_uri = sagemaker.image_uris.retrieve(region=region, framework=\"xgboost\", version=\"latest\")\n", "\n", "# Create the estimator\n", "xgb = sagemaker.estimator.Estimator(\n", " image_uri,\n", " role,\n", " instance_count=1,\n", " instance_type=\"ml.m4.xlarge\",\n", " output_path=output_data['ModelOutputUri'], # NOTE: Can't use execution_input here\n", ")\n", "\n", "# Set the hyperparameters overriding with any defaults\n", "hyperparameters = {\n", " \"max_depth\": \"9\",\n", " \"eta\": \"0.2\",\n", " \"gamma\": \"4\",\n", " \"min_child_weight\": \"300\",\n", " \"subsample\": \"0.8\",\n", " \"objective\": \"reg:linear\",\n", " \"early_stopping_rounds\": \"10\",\n", " \"num_round\": \"50\", # Don't stop to early or results are bad\n", "}\n", "xgb.set_hyperparameters(**hyperparameters)\n", "\n", "# Specify the data source\n", "s3_input_train = sagemaker.inputs.TrainingInput(s3_data=input_data['TrainingUri'], content_type=\"csv\")\n", "s3_input_val = sagemaker.inputs.TrainingInput(s3_data=input_data['ValidationUri'], content_type=\"csv\")\n", "data = {\"train\": s3_input_train, \"validation\": s3_input_val}"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Test the estimator directly in the notebook"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# xgb.fit(inputs=data)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Define Training Workflow\n", "\n", "### 1. Create the Experiment\n", "\n", "Define the create experiment lambda.\n", "\n", "In future add [ResultsPath](https://docs.aws.amazon.com/step-functions/latest/dg/input-output-resultpath.html) to filter the results."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["create_experiment_step = steps.compute.LambdaStep(\n", " 'Create Experiment',\n", " parameters={ \n", " \"FunctionName\": create_experiment_function_name,\n", " 'Payload': {\n", " \"ExperimentName.$\": '$.ExperimentName',\n", " \"TrialName.$\": '$.TrialName',\n", " }\n", " },\n", " result_path='$.CreateTrialResults'\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 2a. Run processing Job\n", "\n", "Define the processing job with a specific failure handling"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["baseline_step = steps.sagemaker.ProcessingStep(\n", " \"Baseline Job\",\n", " processor=monitor_analyzer,\n", " job_name=execution_input[\"BaselineJobName\"],\n", " inputs=inputs,\n", " outputs=outputs,\n", " experiment_config={\n", " 'ExperimentName': execution_input[\"ExperimentName\"], # '$.ExperimentName', \n", " 'TrialName': execution_input[\"TrialName\"],\n", " 'TrialComponentDisplayName': \"Baseline\",\n", " },\n", " tags={\n", " \"GitBranch\": execution_input[\"GitBranch\"],\n", " \"GitCommitHash\": execution_input[\"GitCommitHash\"],\n", " \"DataVersionId\": execution_input[\"DataVersionId\"],\n", " },\n", " result_path='$.BaselineJobResults'\n", ")\n", "\n", "baseline_step.add_catch(steps.states.Catch(\n", " error_equals=[\"States.TaskFailed\"],\n", " next_step=stepfunctions.steps.states.Fail(\n", " \"Baseline failed\", cause=\"SageMakerBaselineJobFailed\"\n", " ),\n", "))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 2b. Run and query training Job\n", "\n", "Define the training job and add a validation step"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["training_step = steps.TrainingStep(\n", " \"Training Job\",\n", " estimator=xgb,\n", " data=data,\n", " job_name=execution_input[\"TrainingJobName\"],\n", " experiment_config={\n", " 'ExperimentName': execution_input[\"ExperimentName\"],\n", " 'TrialName': execution_input[\"TrialName\"],\n", " 'TrialComponentDisplayName': \"Training\",\n", " },\n", " tags={\n", " \"GitBranch\": execution_input[\"GitBranch\"],\n", " \"GitCommitHash\": execution_input[\"GitCommitHash\"],\n", " \"DataVersionId\": execution_input[\"DataVersionId\"],\n", " },\n", " result_path='$.TrainingResults'\n", ")\n", "\n", "training_step.add_catch(stepfunctions.steps.states.Catch(\n", " error_equals=[\"States.TaskFailed\"],\n", " next_step=stepfunctions.steps.states.Fail(\n", " \"Training failed\", cause=\"SageMakerTrainingJobFailed\"\n", " ),\n", "))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Create a model from the training job, note this must follow training to retrieve the expected model"]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": true}, "outputs": [], "source": ["# Must follow the training test\n", "model_step = steps.sagemaker.ModelStep(\n", " 'Save Model',\n", " input_path='$.TrainingResults',\n", " model=training_step.get_expected_model(),\n", " model_name=execution_input['ModelName'],\n", " result_path='$.ModelStepResults'\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Query training results, and validate that the RMSE error is within an acceptable range "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["training_query_step = steps.compute.LambdaStep(\n", " 'Query Training Results',\n", " parameters={ \n", " \"FunctionName\": query_training_function_name,\n", " 'Payload':{\n", " \"TrainingJobName.$\": '$.TrainingJobName'\n", " }\n", " },\n", " result_path='$.QueryTrainingResults'\n", ")\n", "\n", "check_accuracy_fail_step = steps.states.Fail(\n", " 'Model Error Too Low',\n", " comment='RMSE accuracy higher than threshold'\n", ")\n", "\n", "check_accuracy_succeed_step = steps.states.Succeed('Model Error Acceptable')\n", "\n", "# TODO: Update query method to query validation error using better result path\n", "threshold_rule = steps.choice_rule.ChoiceRule.NumericLessThan(\n", " variable=training_query_step.output()['QueryTrainingResults']['Payload']['results']['TrainingMetrics'][0]['Value'], value=10\n", ")\n", "\n", "check_accuracy_step = steps.states.Choice(\n", " 'RMSE < 10'\n", ")\n", "\n", "check_accuracy_step.add_choice(rule=threshold_rule, next_step=check_accuracy_succeed_step)\n", "check_accuracy_step.default_choice(next_step=check_accuracy_fail_step)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 3. Add the Error handling in the workflow\n", "\n", "We will use the [Catch Block](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/states.html#stepfunctions.steps.states.Catch) to perform error handling. If the Processing Job Step or Training Step fails, the flow will go into failure state."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["sagemaker_jobs = steps.states.Parallel(\"SageMaker Jobs\")\n", "sagemaker_jobs.add_branch(baseline_step)\n", "sagemaker_jobs.add_branch(steps.states.Chain([training_step, model_step, training_query_step, check_accuracy_step]))\n", "\n", "# Do we need specific failure for the jobs for group?\n", "sagemaker_jobs.add_catch(stepfunctions.steps.states.Catch(\n", " error_equals=[\"States.TaskFailed\"],\n", " next_step=stepfunctions.steps.states.Fail(\n", " \"SageMaker Jobs failed\", cause=\"SageMakerJobsFailed\"\n", " ),\n", "))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Execute Training Workflow\n", "\n", "Create the training workflow."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["training_workflow_definition = steps.states.Chain([\n", " create_experiment_step,\n", " sagemaker_jobs\n", "])\n", "\n", "training_workflow_name = 'mlops-{}-training'.format(model_name)\n", "training_workflow = Workflow(training_workflow_name, training_workflow_definition, workflow_role_arn)\n", "training_workflow.create()\n", "training_workflow"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We can also inspect the raw workflow definition and verify the execution variables are correctly passed in"]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": true}, "outputs": [], "source": ["print(training_workflow.definition.to_json(pretty=True))"]}, {"cell_type": "markdown", "metadata": {}, "source": [" Now we define the inputs for the workflow"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Define some dummy job and git params\n", "job_id = uuid.uuid1().hex\n", "git_branch = 'main'\n", "git_commit_hash = 'xxx' \n", "data_verison_id = 'yyy'\n", "\n", "# Define the experiment and trial name based on model name and job id\n", "experiment_name = \"mlops-{}\".format(model_name)\n", "trial_name = \"mlops-{}-{}\".format(model_name, job_id)\n", "\n", "workflow_inputs = {\n", " \"ExperimentName\": experiment_name,\n", " \"TrialName\": trial_name,\n", " \"GitBranch\": git_branch,\n", " \"GitCommitHash\": git_commit_hash, \n", " \"DataVersionId\": data_verison_id, \n", " \"BaselineJobName\": trial_name, \n", " \"BaselineOutputUri\": f\"s3://{bucket}/{model_name}/monitoring/baseline/mlops-{model_name}-pbl-{job_id}\",\n", " \"TrainingJobName\": trial_name,\n", " \"ModelName\": trial_name,\n", "}\n", "print(json.dumps(workflow_inputs))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Then execute the workflow"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["execution = training_workflow.execute(\n", " inputs=workflow_inputs\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Wait for the execution to complete, and output the last step."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["execution_output = execution.get_output(wait=True)\n", "execution.list_events()[-1]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Use [list_events](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Execution.list_events) to list all events in the workflow execution."]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": true}, "outputs": [], "source": ["# execution.list_events(html=True) # Bug"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Execute Batch Transform\n", "\n", "Take the model we have trained and run a batch transform on the validation dataset.\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["execution_input = ExecutionInput(\n", " schema={\n", " \"GitBranch\": str,\n", " \"GitCommitHash\": str,\n", " \"DataVersionId\": str,\n", " \"ExperimentName\": str,\n", " \"TrialName\": str,\n", " \"ModelName\": str,\n", " \"TransformJobName\": str,\n", " \"MonitorJobName\": str,\n", " \"MonitorOutputUri\": str,\n", " }\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Define some new output paths for the transform and monitoring jobs"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["output_data['TransformOutputUri'] = f\"s3://{bucket}/{model_name}/transform/mlops-{model_name}-{job_id}\"\n", "output_data['MonitoringOutputUri'] = f\"s3://{bucket}/{model_name}/monitoring/mlops-{model_name}-{job_id}\"\n", "output_data['BaselineOutputUri'] = workflow_inputs['BaselineOutputUri']"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 1. Run the Transform Job\n", "\n", "Define a transform job to take the test dataset as input. \n", "\n", "We can configured the batch transform to [associate prediction results](https://aws.amazon.com/blogs/machine-learning/associating-prediction-results-with-input-data-using-amazon-sagemaker-batch-transform/) with the input based in the `input_filter` and `output_filter` arguments."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["transform_step = steps.TransformStep(\n", " 'Transform Input Dataset',\n", " transformer=xgb.transformer(\n", " instance_count=1,\n", " instance_type='ml.m5.large',\n", " assemble_with='Line', \n", " accept = 'text/csv',\n", " output_path=output_data['TransformOutputUri'], # NOTE: Can't use execution_input here\n", " ),\n", " job_name=execution_input['TransformJobName'], # TEMP\n", " model_name=execution_input['ModelName'], \n", " data=input_data['TestUri'],\n", " content_type='text/csv',\n", " split_type='Line',\n", " input_filter='$[1:]', # Skip the first target column output_amount\n", " join_source='Input',\n", " output_filter='$[1:]', # Output all inputs excluding output_amount, followed by the predicted_output_amount\n", " result_path='$.TransformJobResults'\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 2. Add the Transform Header\n", "\n", "The batch transform output does not include the header, so add this back to be able to run baseline."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["transform_file_name = 'test.csv'\n", "header = 'duration_minutes,passenger_count,trip_distance,total_amount'"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["transform_header_step = steps.compute.LambdaStep(\n", " 'Add Transform Header',\n", " parameters={ \n", " \"FunctionName\": transform_header_function_name,\n", " 'Payload': {\n", " \"TransformOutputUri\": output_data['TransformOutputUri'],\n", " \"FileName\": transform_file_name,\n", " \"Header\": header,\n", " }\n", " },\n", " result_path='$.TransformHeaderResults'\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 3. Run the Model Monitor Processing Job\n", "\n", "Create a model monitor processing job that takes the output of the transform job.\n", "\n", "Reference the `constraints.json` and `statistics.json` from the output form the training baseline."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["dataset_format = DatasetFormat.csv()\n", "env = {\n", " \"dataset_format\": json.dumps(dataset_format),\n", " \"dataset_source\": \"/opt/ml/processing/input/baseline_dataset_input\",\n", " \"output_path\": \"/opt/ml/processing/output\",\n", " \"publish_cloudwatch_metrics\": \"Disabled\", # Have to be disabled from processing job?\n", " \"baseline_constraints\": \"/opt/ml/processing/baseline/constraints/constraints.json\",\n", " \"baseline_statistics\": \"/opt/ml/processing/baseline/stats/statistics.json\"\n", "}\n", "inputs = [\n", " ProcessingInput(\n", " source=os.path.join(output_data['TransformOutputUri'], transform_file_name), # Transform with header\n", " destination=\"/opt/ml/processing/input/baseline_dataset_input\",\n", " input_name=\"baseline_dataset_input\",\n", " ),\n", " ProcessingInput(\n", " source=os.path.join(output_data['BaselineOutputUri'], 'constraints.json'),\n", " destination=\"/opt/ml/processing/baseline/constraints\",\n", " input_name=\"constraints\",\n", " ),\n", " ProcessingInput(\n", " source=os.path.join(output_data['BaselineOutputUri'], 'statistics.json'),\n", " destination=\"/opt/ml/processing/baseline/stats\",\n", " input_name=\"baseline\",\n", " ),\n", "]\n", "outputs = [\n", " ProcessingOutput(\n", " source=\"/opt/ml/processing/output\",\n", " destination=output_data['MonitoringOutputUri'],\n", " output_name=\"monitoring_output\",\n", " ),\n", "]\n", "\n", "# Get the default model monitor container\n", "monor_monitor_container_uri = retrieve(region=region, framework=\"model-monitor\", version=\"latest\")\n", "\n", "# Use the base processing where we pass through the \n", "monitor_analyzer = Processor(\n", " image_uri=monor_monitor_container_uri,\n", " role=role, \n", " instance_count=1,\n", " instance_type=\"ml.m5.xlarge\",\n", " max_runtime_in_seconds=1800,\n", " env=env\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Test the monitor baseline"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# monitor_analyzer.run(inputs=inputs, outputs=outputs, wait=True)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Add the monitor step"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["monitor_step = steps.sagemaker.ProcessingStep(\n", " \"Monitor Job\",\n", " processor=monitor_analyzer,\n", " job_name=execution_input[\"MonitorJobName\"],\n", " inputs=inputs,\n", " outputs=outputs,\n", " experiment_config={\n", " 'ExperimentName': execution_input[\"ExperimentName\"],\n", " 'TrialName': execution_input[\"TrialName\"],\n", " 'TrialComponentDisplayName': \"Baseline\",\n", " },\n", " tags={\n", " \"GitBranch\": execution_input[\"GitBranch\"],\n", " \"GitCommitHash\": execution_input[\"GitCommitHash\"],\n", " \"DataVersionId\": execution_input[\"DataVersionId\"],\n", " },\n", " result_path='$.MonitorJobResults'\n", ")\n", "\n", "monitor_step.add_catch(stepfunctions.steps.states.Catch(\n", " error_equals=[\"States.TaskFailed\"],\n", " next_step=stepfunctions.steps.states.Fail(\n", " \"Monitor failed\", cause=\"SageMakerMonitorJobFailed\"\n", " ),\n", "))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Add the lambda step to query for violations in the processing job."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["monitor_query_step = steps.compute.LambdaStep(\n", " 'Query Monitoring Results',\n", " parameters={ \n", " \"FunctionName\": query_drift_function_name,\n", " 'Payload':{\n", " \"ProcessingJobName.$\": '$.MonitorJobName'\n", " }\n", " },\n", " result_path='$.QueryMonitorResults'\n", ")\n", "\n", "check_violations_fail_step = steps.states.Fail(\n", " 'Completed with Violations',\n", " comment='Processing job completed with violations'\n", ")\n", "\n", "check_violations_succeed_step = steps.states.Succeed('Completed')\n", "\n", "# TODO: Check specific drift in violations\n", "status_rule = steps.choice_rule.ChoiceRule.StringEquals(\n", " variable=monitor_query_step.output()['QueryMonitorResults']['Payload']['results']['ProcessingJobStatus'], value='Completed'\n", ")\n", "\n", "check_violations_step = steps.states.Choice(\n", " 'Check Violations'\n", ")\n", "\n", "check_violations_step.add_choice(rule=status_rule, next_step=check_violations_succeed_step)\n", "check_violations_step.default_choice(next_step=check_violations_fail_step)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Create the transform workflow definition"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["transform_workflow_definition = steps.states.Chain([\n", " transform_step,\n", " transform_header_step,\n", " monitor_step, \n", " monitor_query_step, \n", " check_violations_step\n", "])\n", "\n", "transform_workflow_name = 'mlops-{}-transform'.format(model_name)\n", "transform_workflow = Workflow(transform_workflow_name, transform_workflow_definition, workflow_role_arn)\n", "transform_workflow.create()\n", "transform_workflow"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Define the workflow inputs based on the previous training run"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Define unique names for the transform and monitor baseline jobs\n", "transform_job_name = \"mlops-{}-trn-{}\".format(model_name, job_id)\n", "monitor_job_name = \"mlops-{}-mbl-{}\".format(model_name, job_id)\n", "\n", "workflow_inputs = {\n", " \"ExperimentName\": experiment_name,\n", " \"TrialName\": trial_name,\n", " \"GitBranch\": git_branch,\n", " \"GitCommitHash\": git_commit_hash, \n", " \"DataVersionId\": data_verison_id, \n", " \"ModelName\": trial_name,\n", " \"TransformJobName\": transform_job_name, \n", " \"MonitorJobName\": monitor_job_name,\n", "}\n", "print(json.dumps(workflow_inputs))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Execute the workflow and render the progress. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["execution = transform_workflow.execute(\n", " inputs=workflow_inputs\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Wait for the execution to finish and list the last event."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["execution_output = execution.get_output(wait=True)\n", "execution.list_events()[-1]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Inspect Transform Results\n", "\n", "Verify that we can load the transform output with header"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from io import StringIO\n", "import pandas as pd\n", "from sagemaker.s3 import S3Downloader\n", "\n", "# Get the output, and add header\n", "transform_output_uri = os.path.join(output_data['TransformOutputUri'], transform_file_name)\n", "transform_body = S3Downloader.read_file(transform_output_uri)\n", "pred_df = pd.read_csv(StringIO(transform_body), sep=\",\")\n", "pred_df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Query monitoring output\n", "\n", "If this completed with violations, let's inspect the output to see why that is the case."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["violiations_uri = os.path.join(output_data['MonitoringOutputUri'], 'constraint_violations.json')\n", "violiations = json.loads(S3Downloader.read_file(violiations_uri))\n", "violiations"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Cleanup\n", "\n", "Delete the workflows that we created as part of this notebook"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["training_workflow.delete()\n", "transform_workflow.delete()"]}], "metadata": {"instance_type": "ml.t3.medium", "kernelspec": {"display_name": "conda_python3", "language": "python", "name": "conda_python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10"}}, "nbformat": 4, "nbformat_minor": 4} \ No newline at end of file diff --git a/pipeline.yml b/pipeline.yml index 0a57e23..0c530b6 100644 --- a/pipeline.yml +++ b/pipeline.yml @@ -1,17 +1,4 @@ -# Delete the stack: -# -# aws cloudformation delete-stack --stack-name sagemaker-safe-deployment -# -# Create the stack: -# -# aws cloudformation create-stack --stack-name sagemaker-safe-deployment \ -# --template-body file://pipeline.yml \ -# --capabilities CAPABILITY_IAM \ -# --parameters \ -# ParameterKey=GitHubUser,ParameterValue= \ -# ParameterKey=GitHubToken,ParameterValue= \ -# ParameterKey=ModelName,ParameterValue= - +AWSTemplateFormatVersion: "2010-09-09" Description: Create an Amazon SageMaker safe deployment pipeline Metadata: AWS::CloudFormation::Interface: @@ -21,37 +8,47 @@ Metadata: Parameters: - ModelName - DatasetBucket - - NotebookInstanceType - Label: - default: Optional GitHub Parameters + default: Git Config Parameters: - - GitHubRepo - - GitHubBranch - - GitHubUser - - GitHubToken + - GitBranch - Label: default: Optional Notification Settings Parameters: - EmailAddress ParameterLabels: + SageMakerProjectName: + default: Project Name + SageMakerProjectId: + default: Project ID + ProjectPrefix: + default: Unique prefix to bind the components ModelName: default: Model Name DatasetBucket: default: S3 Bucket for Dataset - NotebookInstanceType: - default: Notebook Instance Type - GitHubRepo: - default: GitHub Repository - GitHubBranch: - default: GitHub Branch - GitHubUser: - default: GitHub Username - GitHubToken: - default: GitHub Access Token + GitBranch: + default: Git Branch EmailAddress: default: Email Address + TimeoutInMinutes: + default: Train and build timeout Parameters: + SageMakerProjectName: + Type: String + Description: Name of the project + MinLength: 1 + AllowedPattern: ^[a-zA-Z](-*[a-zA-Z0-9])* + SageMakerProjectId: + Type: String + Description: Service generated Id of the project + ProjectPrefix: + Type: String + Description: | + Unique prefix to make resource privileges scoped-limited. + Changing the default must be done with care + Default: PROJECT_PREFIX ModelName: Default: nyctaxi Type: String @@ -60,57 +57,35 @@ Parameters: MaxLength: 15 # Limited to this due to mlops-{model}-{dev/prd}-{pipeline-executionid} AllowedPattern: ^[a-z0-9](-*[a-z0-9])* # no UPPERCASE due to S3 naming restrictions ConstraintDescription: Must be lowercase or numbers with a length of 1-15 characters. - NotebookInstanceType: - Type: String - Default: ml.t3.medium - Description: Select Instance type for the SageMaker Notebook - AllowedValues: - - ml.t3.medium - - ml.t3.large - - ml.t3.2xlarge - - ml.m5.large - - ml.m5.xlarge - - ml.m5.2xlarge - ConstraintDescription: Must select a valid notebook instance type. DatasetBucket: Default: nyc-tlc Description: S3 dataset bucket. Type: String - GitHubUser: - Default: aws-samples - Description: Your GitHub username - Type: String - GitHubRepo: - Default: amazon-sagemaker-safe-deployment-pipeline - Type: String - Description: Name of the GitHub repository. - GitHubBranch: - Default: master + GitBranch: + Default: main Type: String Description: Name of the GitHub branch where the code is located. - GitHubToken: - Default: "" - NoEcho: true - Description: Optional Github OAuthToken with access to your Repo. Leave blank to pull the public repository into local CodeCommit. - Type: String EmailAddress: - Default: "" + Default: "example@example.com" Description: Email address to notify on successful or failed deployments. Type: String + TimeoutInMinutes: + Default: 30 + Description: Train and build timeout in minutes + Type: String Conditions: EmailAddressNotEmpty: !Not [!Equals [!Ref EmailAddress, ""]] - GitHubTokenEmpty: !Equals [!Ref GitHubToken, ""] Resources: KMSKey: Type: AWS::KMS::Key Properties: - Description: !Sub KMS Key for mlops pipeline ${ModelName} + Description: !Sub KMS Key for mlops pipeline ${ProjectPrefix}-${ModelName}-${SageMakerProjectId} EnableKeyRotation: true KeyPolicy: Version: "2012-10-17" - Id: !Ref ModelName + Id: !Sub ${ProjectPrefix}-${ModelName} Statement: - Sid: Allows admin of the key Effect: Allow @@ -123,13 +98,13 @@ Resources: KMSAlias: Type: AWS::KMS::Alias Properties: - AliasName: !Sub alias/mlops-${ModelName} + AliasName: !Sub alias/${ProjectPrefix}-${ModelName}-${SageMakerProjectId} TargetKeyId: !Ref KMSKey ArtifactBucket: Type: AWS::S3::Bucket Properties: - BucketName: !Sub mlops-${ModelName}-artifact-${AWS::Region}-${AWS::AccountId} + BucketName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId} AccessControl: Private VersioningConfiguration: Status: Enabled @@ -161,13 +136,13 @@ Resources: DependsOn: ArtifactBucketPolicy Type: AWS::CloudTrail::Trail Properties: - TrailName: !Sub mlops-${ModelName} + TrailName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId} S3BucketName: !Ref ArtifactBucket EventSelectors: - DataResources: - Type: AWS::S3::Object Values: - - !Sub ${ArtifactBucket.Arn}/${ModelName}/data-source.zip + - !Sub ${ArtifactBucket.Arn}/${ProjectPrefix}-${ModelName}-${SageMakerProjectId}/data-source.zip ReadWriteType: WriteOnly IncludeGlobalServiceEvents: true IsLogging: true @@ -176,7 +151,7 @@ Resources: NotificationTopic: Type: AWS::SNS::Topic Properties: - TopicName: !Sub mlops-${ModelName}-notification + TopicName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-notification NotificationTopicPolicy: Type: AWS::SNS::TopicPolicy @@ -198,89 +173,24 @@ Resources: Condition: EmailAddressNotEmpty Properties: Endpoint: !Ref EmailAddress - Protocol: Email + Protocol: "email" TopicArn: !Ref NotificationTopic - GitHubSecret: - Type: AWS::SecretsManager::Secret - Properties: - Description: !Sub GitHub Secret for ${GitHubRepo} - KmsKeyId: !Ref KMSKey - SecretString: !Sub '{"username":"${GitHubUser}","password":"${GitHubToken}"}' - CodeCommitRepository: Type: AWS::CodeCommit::Repository - Condition: GitHubTokenEmpty - Properties: - RepositoryName: !Ref GitHubRepo - RepositoryDescription: !Sub SageMaker safe deployment pipeline for ${ModelName} - - SageMakerCodeRepository: - Type: AWS::SageMaker::CodeRepository Properties: - CodeRepositoryName: !Join ["-", !Split ["_", !Ref GitHubRepo]] - GitConfig: - RepositoryUrl: - Fn::If: - - GitHubTokenEmpty - - !GetAtt CodeCommitRepository.CloneUrlHttp - - !Sub https://github.com/${GitHubUser}/${GitHubRepo}.git - Branch: !Ref GitHubBranch - SecretArn: - Fn::If: - - GitHubTokenEmpty - - !Ref "AWS::NoValue" - - !Ref GitHubSecret - - NotebookInstanceLifecycleConfig: - Type: AWS::SageMaker::NotebookInstanceLifecycleConfig - Properties: - NotebookInstanceLifecycleConfigName: !Sub ${ModelName}-lifecycle-config - OnCreate: - - Content: - Fn::If: - - GitHubTokenEmpty - - Fn::Base64: - Fn::Sub: | - #!/bin/bash - # Clone the public github repo, and push it to a local codecommit branch - export HOME=/root/ - echo "Configuring github for AWS credentials" - git config --global credential.helper '!aws codecommit credential-helper $@' - git config --global credential.UseHttpPath true - cp /root/.gitconfig /home/ec2-user/ && chown ec2-user:ec2-user /home/ec2-user/.gitconfig - echo "Clone the public repo and push it to codecommit repo" - git clone -b ${GitHubBranch} "https://github.com/${GitHubUser}/${GitHubRepo}.git" /tmp/mlops-repo - cd /tmp/mlops-repo - git remote add codecommit ${CodeCommitRepository.CloneUrlHttp} - git push --set-upstream codecommit ${GitHubBranch} - - Ref: AWS::NoValue - OnStart: - - Content: - Fn::Base64: - Fn::Sub: | - #!/bin/bash - touch /etc/profile.d/jupyter-env.sh - echo "export ARTIFACT_BUCKET=${ArtifactBucket}" >> /etc/profile.d/jupyter-env.sh - echo "export PIPELINE_NAME=${ModelName}" >> /etc/profile.d/jupyter-env.sh - echo "export MODEL_NAME=${ModelName}" >> /etc/profile.d/jupyter-env.sh - echo "export WORKFLOW_PIPELINE_ARN=arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${ModelName}" >> /etc/profile.d/jupyter-env.sh - echo "export WORKFLOW_ROLE_ARN=${WorkflowExecutionRole.Arn}" >> /etc/profile.d/jupyter-env.sh - - NotebookInstance: - Type: AWS::SageMaker::NotebookInstance - Properties: - NotebookInstanceName: !Sub ${ModelName}-notebook - InstanceType: !Ref NotebookInstanceType - LifecycleConfigName: !GetAtt NotebookInstanceLifecycleConfig.NotebookInstanceLifecycleConfigName - DefaultCodeRepository: !GetAtt SageMakerCodeRepository.CodeRepositoryName - KmsKeyId: !Ref KMSKey - RoleArn: !GetAtt SageMakerRole.Arn + RepositoryName: !Sub "amazon-sagemaker-safe-deployment-pipeline-${SageMakerProjectName}-${SageMakerProjectId}" # !Sub ${GitHubRepo} + RepositoryDescription: !Sub SageMaker safe deployment pipeline for project ${SageMakerProjectName} with id ${SageMakerProjectId}, prefix ${ProjectPrefix} and model name ${ModelName} + Code: + S3: + Bucket: "S3_BUCKET_NAME" + Key: "project.zip" + BranchName: !Ref GitBranch BuildProject: Type: AWS::CodeBuild::Project Properties: - Name: !Sub ${ModelName}-build + Name: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-build Description: Builds the assets required for executing the rest of pipeline ServiceRole: !GetAtt SageMakerRole.Arn Artifacts: @@ -290,12 +200,33 @@ Resources: ComputeType: BUILD_GENERAL1_SMALL Image: aws/codebuild/amazonlinux2-x86_64-standard:3.0 EnvironmentVariables: + - Name: SAGEMAKER_PROJECT_NAME + Type: PLAINTEXT + Value: !Ref SageMakerProjectName + - Name: SAGEMAKER_PROJECT_ID + Type: PLAINTEXT + Value: !Ref SageMakerProjectId + - Name: AWS_REGION + Type: PLAINTEXT + Value: !Ref AWS::Region + - Name: AWS_ACCOUNT_ID + Type: PLAINTEXT + Value: !Ref AWS::AccountId + - Name: IMAGE_TAG + Type: PLAINTEXT + Value: "latest" - Name: GIT_BRANCH Type: PLAINTEXT - Value: !Ref GitHubBranch + Value: !Ref GitBranch + - Name: PREFIX + Type: PLAINTEXT + Value: !Ref ProjectPrefix - Name: MODEL_NAME Type: PLAINTEXT Value: !Ref ModelName + - Name: PIPELINE_NAME + Type: PLAINTEXT + Value: !Sub ${ModelName}-${SageMakerProjectId} - Name: ARTIFACT_BUCKET Type: PLAINTEXT Value: !Ref ArtifactBucket @@ -313,19 +244,19 @@ Resources: Value: !GetAtt SageMakerRole.Arn - Name: SAGEMAKER_BUCKET Type: PLAINTEXT - Value: !Sub "sagemaker-${AWS::Region}-${AWS::AccountId}" + Value: !Sub "sagemaker-${AWS::Region}-${AWS::AccountId}" # match ArtifactBucket - Name: WORKFLOW_ROLE_ARN Type: PLAINTEXT Value: !GetAtt WorkflowExecutionRole.Arn Source: Type: CODEPIPELINE BuildSpec: model/buildspec.yml - TimeoutInMinutes: 30 + TimeoutInMinutes: !Ref TimeoutInMinutes DeployPipeline: Type: "AWS::CodePipeline::Pipeline" Properties: - Name: !Sub ${ModelName} + Name: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId} RoleArn: !GetAtt PipelineRole.Arn ArtifactStore: Type: S3 @@ -337,33 +268,18 @@ Resources: Stages: - Name: Source Actions: - - Fn::If: - - GitHubTokenEmpty - - Name: GitSource - ActionTypeId: - Category: Source - Owner: AWS - Version: "1" - Provider: CodeCommit - Configuration: - PollForSourceChanges: false # Triggered by CodeCommitEventRule - RepositoryName: !Ref GitHubRepo - BranchName: !Ref GitHubBranch - OutputArtifacts: - - Name: ModelSourceOutput - - Name: GitSource - ActionTypeId: - Category: Source - Owner: ThirdParty - Version: "1" - Provider: GitHub # Explore CodeStarSourceConnection: https://docs.aws.amazon.com/codepipeline/latest/userguide/update-github-action-connections.html - OutputArtifacts: - - Name: ModelSourceOutput - Configuration: - Owner: !Ref GitHubUser - Repo: !Ref GitHubRepo - Branch: !Ref GitHubBranch - OAuthToken: !Ref GitHubToken + - Name: GitSource + ActionTypeId: + Category: Source + Owner: AWS + Version: "1" + Provider: CodeCommit + Configuration: + PollForSourceChanges: false # Triggered by CodeCommitEventRule + RepositoryName: !GetAtt CodeCommitRepository.Name + BranchName: !Ref GitBranch + OutputArtifacts: + - Name: ModelSourceOutput - Name: DataSource ActionTypeId: Category: Source @@ -375,8 +291,7 @@ Resources: Configuration: PollForSourceChanges: false # Triggered by S3EventRule S3Bucket: !Ref ArtifactBucket - S3ObjectKey: !Sub ${ModelName}/data-source.zip - PollForSourceChanges: false + S3ObjectKey: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}/data-source.zip RunOrder: 1 - Name: Build Actions: @@ -407,7 +322,7 @@ Resources: ActionMode: REPLACE_ON_FAILURE RoleArn: !GetAtt DeployRole.Arn Capabilities: CAPABILITY_NAMED_IAM - StackName: !Sub ${ModelName}-workflow + StackName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-workflow TemplatePath: BuildOutput::workflow-graph.yml RunOrder: 2 - Name: CreateCustomResources @@ -422,7 +337,7 @@ Resources: ActionMode: REPLACE_ON_FAILURE RoleArn: !GetAtt DeployRole.Arn Capabilities: CAPABILITY_NAMED_IAM,CAPABILITY_AUTO_EXPAND - StackName: sagemaker-custom-resource # Use global name to re-use across templates + StackName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-sagemaker-custom-resource # Use global name to re-use across templates TemplatePath: BuildOutput::packaged-custom-resource.yml RunOrder: 2 - Name: Train @@ -433,12 +348,12 @@ Resources: ActionTypeId: Category: Invoke Owner: AWS - Version: 1 + Version: "1" Provider: StepFunctions OutputArtifacts: - Name: TrainWorkflow Configuration: - StateMachineArn: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${ModelName}" + StateMachineArn: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${ProjectPrefix}-${ModelName}" InputType: FilePath Input: workflow-input.json RunOrder: 1 @@ -455,7 +370,7 @@ Resources: Configuration: ActionMode: REPLACE_ON_FAILURE RoleArn: !GetAtt DeployRole.Arn - StackName: !Sub ${ModelName}-deploy-dev + StackName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-deploy-dev TemplateConfiguration: BuildOutput::deploy-model-dev.json TemplatePath: BuildOutput::deploy-model-dev.yml RunOrder: 1 @@ -466,7 +381,6 @@ Resources: Version: "1" Provider: Manual Configuration: - ExternalEntityLink: !Sub https://${ModelName}-notebook.notebook.${AWS::Region}.sagemaker.aws/notebooks/${GitHubRepo}/notebook/mlops.ipynb CustomData: "Shall this model be put into production?" RunOrder: 2 - Name: DeployPrd @@ -485,16 +399,15 @@ Resources: ActionMode: CREATE_UPDATE RoleArn: !GetAtt DeployRole.Arn Capabilities: CAPABILITY_IAM,CAPABILITY_AUTO_EXPAND - StackName: !Sub ${ModelName}-deploy-prd + StackName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-deploy-prd TemplateConfiguration: BuildOutput::deploy-model-prd.json TemplatePath: BuildOutput::packaged-model-prd.yml RunOrder: 1 CodeCommitEventRule: Type: AWS::Events::Rule - Condition: GitHubTokenEmpty Properties: - Name: !Sub mlops-${ModelName}-codecommit-pipeline + Name: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-codecommit-pipeline Description: "AWS CodeCommit change to trigger AWS Code Pipeline" EventPattern: source: @@ -507,7 +420,7 @@ Resources: referenceType: - "branch" referenceName: - - !Ref GitHubBranch + - !Ref GitBranch Targets: - Arn: !Sub "arn:aws:codepipeline:${AWS::Region}:${AWS::AccountId}:${DeployPipeline}" RoleArn: !GetAtt CloudWatchEventRole.Arn @@ -518,6 +431,7 @@ Resources: S3EventRule: Type: AWS::Events::Rule Properties: + Name: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-s3-event-rule EventPattern: source: - aws.s3 @@ -534,7 +448,7 @@ Resources: bucketName: - !Ref ArtifactBucket key: - - !Sub ${ModelName}/data-source.zip + - !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}/data-source.zip Targets: - Arn: !Sub "arn:aws:codepipeline:${AWS::Region}:${AWS::AccountId}:${DeployPipeline}" RoleArn: !GetAtt CloudWatchEventRole.Arn @@ -543,7 +457,7 @@ Resources: RetrainRule: Type: AWS::Events::Rule Properties: - Name: !Sub mlops-${ModelName}-retrain-pipeline + Name: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-retrain-pipeline Description: "Retrain rule for the AWS Code Pipeline" EventPattern: source: @@ -552,7 +466,7 @@ Resources: - "CloudWatch Alarm State Change" detail: alarmName: - - !Sub mlops-${ModelName}-metric-gt-threshold + - !Sub ${ProjectPrefix}-${ModelName}-metric-gt-threshold state: value: - "ALARM" @@ -560,20 +474,23 @@ Resources: - Arn: !Sub "arn:aws:codepipeline:${AWS::Region}:${AWS::AccountId}:${DeployPipeline}" RoleArn: !GetAtt CloudWatchEventRole.Arn Id: !Sub codepipeline-${DeployPipeline} - - Arn: !Ref NotificationTopic - Id: "RetrainRule" - InputTransformer: - InputPathsMap: - alarmName: $.detail.alarmName - reason: $.detail.state.reason - InputTemplate: | - "The retrain rule for alarm: has been triggered." - "Reason: ." + # TODO: Studio doesn't accept empty CFT parameter so the SNS email + # if not verified prevents the retraining to complete + # enable SNS when studio accepts empty parameter + # - Arn: !Ref NotificationTopic + # Id: "RetrainRule" + # InputTransformer: + # InputPathsMap: + # alarmName: $.detail.alarmName + # reason: $.detail.state.reason + # InputTemplate: | + # "The retrain rule for alarm: has been triggered." + # "Reason: ." ScheduledRule: Type: AWS::Events::Rule Properties: - Name: !Sub mlops-${ModelName}-schedule-pipeline + Name: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-schedule-pipeline Description: "Scheduled rule for the AWS Code Pipeline" ScheduleExpression: "cron(0 12 1 * ? *)" # 1 day of month State: DISABLED # Disable by default @@ -589,9 +506,8 @@ Resources: CodeCommitPolicy: Type: AWS::IAM::Policy - Condition: GitHubTokenEmpty Properties: - PolicyName: !Sub mlops-codecommit-policy + PolicyName: !Sub ${ProjectPrefix}-codecommit-policy PolicyDocument: Version: "2012-10-17" Statement: @@ -607,7 +523,7 @@ Resources: StepFunctionsPolicy: Type: AWS::IAM::Policy Properties: - PolicyName: !Sub mlops-sfn-policy + PolicyName: !Sub ${ProjectPrefix}-sfn-policy PolicyDocument: Version: "2012-10-17" Statement: @@ -616,8 +532,8 @@ Resources: Action: - states:* Resource: - - !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${ModelName}*" - - !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:execution:${ModelName}*:*" + - !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${ProjectPrefix}-${ModelName}*" + - !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:execution:${ProjectPrefix}-${ModelName}*:*" - Sid: AllowPassRoleStepFunctions Effect: Allow Action: @@ -634,7 +550,7 @@ Resources: CodePipelinePolicy: Type: AWS::IAM::Policy Properties: - PolicyName: !Sub mlops-codepipeline-policy + PolicyName: !Sub ${ProjectPrefix}-codepipeline-policy PolicyDocument: Version: "2012-10-17" Statement: @@ -651,7 +567,7 @@ Resources: SageMakerPolicy: Type: AWS::IAM::Policy Properties: - PolicyName: !Sub mlops-sagemaker-policy + PolicyName: !Sub ${ProjectPrefix}-sagemaker-policy PolicyDocument: Version: "2012-10-17" Statement: @@ -683,8 +599,8 @@ Resources: - sagemaker:StopTransformJob - kms:CreateGrant Resource: - - !Sub arn:aws:sagemaker:${AWS::Region}:${AWS::AccountId}:*/mlops-${ModelName}* - - !Sub arn:aws:sagemaker:${AWS::Region}:${AWS::AccountId}:*/model-monitoring-* + - !Sub arn:aws:sagemaker:${AWS::Region}:${AWS::AccountId}:*/${ProjectPrefix}-${ModelName}* + - !Sub arn:aws:sagemaker:${AWS::Region}:${AWS::AccountId}:*/${ProjectPrefix}-model-monitoring-* - Sid: AllowSageMakerSearch Effect: Allow Action: @@ -699,7 +615,7 @@ Resources: S3Policy: Type: AWS::IAM::Policy Properties: - PolicyName: !Sub mlops-s3-policy + PolicyName: !Sub ${ProjectPrefix}-s3-policy PolicyDocument: Version: "2012-10-17" Statement: @@ -747,7 +663,7 @@ Resources: CloudWatchEventRole: Type: AWS::IAM::Role Properties: - RoleName: !Sub mlops-${ModelName}-cwe-role + RoleName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-cwe-role AssumeRolePolicyDocument: Version: "2012-10-17" Statement: @@ -758,7 +674,7 @@ Resources: Action: sts:AssumeRole Path: / Policies: - - PolicyName: "mlops-cwe-pipeline-execution" + - PolicyName: !Sub "${ProjectPrefix}-cwe-pipeline-execution" PolicyDocument: Version: "2012-10-17" Statement: @@ -769,7 +685,7 @@ Resources: SageMakerRole: Type: AWS::IAM::Role Properties: - RoleName: !Sub mlops-${ModelName}-sagemaker-role + RoleName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-sagemaker-role AssumeRolePolicyDocument: Version: "2012-10-17" Statement: @@ -794,7 +710,7 @@ Resources: ManagedPolicyArns: - "arn:aws:iam::aws:policy/CloudWatchSyntheticsFullAccess" Policies: - - PolicyName: "mlops-notebook-policy" + - PolicyName: !Sub ${ProjectPrefix}-sagemaker-policy PolicyDocument: Version: "2012-10-17" Statement: @@ -803,8 +719,7 @@ Resources: Action: - cloudformation:* Resource: - - !Sub arn:aws:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${ModelName}* - - !Sub arn:aws:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/sagemaker-custom-resource* + - !Sub arn:aws:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${ProjectPrefix}-${ModelName}* - Sid: AllowCloudWatch Effect: Allow Action: @@ -812,7 +727,6 @@ Resources: - cloudwatch:PutMetricData - cloudwatch:PutMetricAlarm - cloudwatch:DeleteAlarms - - cloudwatch:PutDashboard - cloudwatch:DeleteDashboards - iam:GetRole Resource: "*" @@ -829,7 +743,7 @@ Resources: WorkflowExecutionRole: Type: AWS::IAM::Role Properties: - RoleName: !Sub mlops-${ModelName}-sfn-execution-role + RoleName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-sfn-execution-role AssumeRolePolicyDocument: Statement: - Action: @@ -842,7 +756,7 @@ Resources: ManagedPolicyArns: - "arn:aws:iam::aws:policy/CloudWatchEventsFullAccess" Policies: - - PolicyName: "mlops-sfn-sagemaker" + - PolicyName: !Sub ${ProjectPrefix}-sfn-sagemaker PolicyDocument: Statement: - Sid: AllowLambda @@ -850,7 +764,7 @@ Resources: Action: - lambda:InvokeFunction Resource: - - arn:aws:lambda:*:*:function:mlops-* + - !Sub "arn:aws:lambda:*:*:function:${ProjectPrefix}-*" - Sid: AllowEvents Effect: Allow Action: @@ -874,7 +788,7 @@ Resources: PipelineRole: Type: "AWS::IAM::Role" Properties: - RoleName: !Sub mlops-${ModelName}-pipeline-role + RoleName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-pipeline-role AssumeRolePolicyDocument: Version: "2012-10-17" Statement: @@ -886,7 +800,7 @@ Resources: - "sts:AssumeRole" Path: "/" Policies: - - PolicyName: "mlops-pipeline" + - PolicyName: !Sub "${ProjectPrefix}-pipeline" PolicyDocument: Version: "2012-10-17" Statement: @@ -909,11 +823,16 @@ Resources: Action: - iam:PassRole Resource: !GetAtt DeployRole.Arn + - Sid: AllowCodeCommit + Effect: Allow + Action: + - codecommit:GetBranch + Resource: !GetAtt CodeCommitRepository.Arn DeployRole: Type: "AWS::IAM::Role" Properties: - RoleName: !Sub mlops-${ModelName}-deploy-role + RoleName: !Sub ${ProjectPrefix}-${ModelName}-${SageMakerProjectId}-deploy-role AssumeRolePolicyDocument: Version: "2012-10-17" Statement: @@ -933,7 +852,7 @@ Resources: ManagedPolicyArns: - Fn::Sub: "arn:${AWS::Partition}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly" Policies: - - PolicyName: "mlops-deploy" + - PolicyName: !Sub ${ProjectPrefix}-deploy PolicyDocument: Version: "2012-10-17" Statement: @@ -1026,6 +945,25 @@ Outputs: DeployPipeline: Description: The deployment pipeline Value: !Ref DeployPipeline - NotebookInstance: - Description: The sagemaker notebook - Value: !Ref NotebookInstance + SageMakerProjectName: + Value: !Ref SageMakerProjectName + SageMakerProjectId: + Value: !Ref SageMakerProjectId + ArtifactBucket: + Value: !Ref ArtifactBucket + PipelineName: + Value: !Ref DeployPipeline + ModelName: + Value: !Ref ModelName + WorkflowPipelineARN: + Value: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${ProjectPrefix}-${ModelName}" + WorkflowRoleARN: + Value: !GetAtt WorkflowExecutionRole.Arn + SageMakerRoleARN: + Value: !GetAtt SageMakerRole.Arn + Region: + Value: !Ref AWS::Region + KMSKey: + Value: !Ref KMSKey + NotificationTopic: + Value: !Ref NotificationTopic diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 0000000..c265102 --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,3 @@ +[tool.black] +line-length = 100 +target-version = ['py36', 'py37', 'py38'] diff --git a/scripts/build.sh b/scripts/build.sh new file mode 100755 index 0000000..d44204f --- /dev/null +++ b/scripts/build.sh @@ -0,0 +1,52 @@ +#!/bin/bash +set -euxo pipefail + +NOW=$(date +"%x %r %Z") +echo "Time: $NOW" + +if [ $# -lt 4 ]; then + echo "Please provide the solution name as well as the base S3 bucket name and the region to run build script." + echo "For example: chmod a+x build.sh && ./build.sh S3_BUCKET_NAME STACK_NAME REGION STUDIO_ROLE_NAME" + echo "STUDIO_ROLE_NAME is just the name not ARN from studio console example: AmazonSageMaker-ExecutionRole-20210112T085906" + exit 1 +fi + +PREFIX="mlops" # should match the ProjectPrefix parameter in pipeline.yml and studio.yml additional ARN privileges +BUCKET="$PREFIX-$1" +STACK_NAME="$PREFIX-$2" +REGION=$3 +ROLE=$4 + +aws cloudformation delete-stack --stack-name "$STACK_NAME" + +rm -rf build +mkdir build +rsync -av --progress . build \ + --exclude build \ + --exclude "*.git*" \ + --exclude .pre-commit-config.yaml +cd build +# binding resources of pipeline.yml and studio.yml together with common PREFIX +sed -i -e "s/PROJECT_PREFIX/$PREFIX/g" assets/*.yml pipeline.yml +sed -i -e "s/S3_BUCKET_NAME/$BUCKET/g" pipeline.yml +find . -type f -iname "*.yml-e" -delete + +bash scripts/lint.sh || exit 1 + +rm -rf scripts # used in development only + +zip -r project.zip . + +aws s3 mb "s3://$BUCKET" --region "$REGION" +aws s3 cp --region "$REGION" project.zip "s3://$BUCKET/" +aws s3 cp --region "$REGION" pipeline.yml "s3://$BUCKET/" +aws s3 cp --region "$REGION" studio.yml "s3://$BUCKET/" + +aws cloudformation wait stack-delete-complete --stack-name "$STACK_NAME" + +aws cloudformation create-stack --stack-name "$STACK_NAME" \ + --template-url "https://$BUCKET.s3.$REGION.amazonaws.com/studio.yml" \ + --capabilities CAPABILITY_IAM \ + --parameters ParameterKey=ProjectPrefix,ParameterValue="$PREFIX" \ + ParameterKey=SageMakerStudioRoleName,ParameterValue="$ROLE" \ + ParameterKey=PipelineBucket,ParameterValue="$BUCKET" diff --git a/scripts/lint.sh b/scripts/lint.sh new file mode 100755 index 0000000..93bba76 --- /dev/null +++ b/scripts/lint.sh @@ -0,0 +1,11 @@ +#!/bin/bash + +BASE_DIR="$(pwd)" + +python -m pip install black==21.5b1 black-nb==0.5.0 -q +black . +black-nb --exclude "/(outputs|\.ipynb_checkpoints)/" --include "$BASE_DIR"/notebook/mlops.ipynb # workflow.ipynb fails + +for nb in "$BASE_DIR"/notebook/*.ipynb; do + python "$BASE_DIR"/scripts/set_kernelspec.py --notebook "$nb" --display-name "conda_python3" --kernel "conda_python3" +done diff --git a/scripts/set_kernelspec.py b/scripts/set_kernelspec.py new file mode 100644 index 0000000..00a11ee --- /dev/null +++ b/scripts/set_kernelspec.py @@ -0,0 +1,24 @@ +#!/usr/bin/env python3 + +import argparse +import json + + +def set_kernel_spec(notebook_filepath, display_name, kernel_name): + with open(notebook_filepath, "r") as openfile: + notebook = json.load(openfile) + kernel_spec = {"display_name": display_name, "language": "python", "name": kernel_name} + if "metadata" not in notebook: + notebook["metadata"] = {} + notebook["metadata"]["kernelspec"] = kernel_spec + with open(notebook_filepath, "w") as openfile: + json.dump(notebook, openfile) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--notebook") + parser.add_argument("--display-name") + parser.add_argument("--kernel") + args = parser.parse_args() + set_kernel_spec(args.notebook, args.display_name, args.kernel) diff --git a/studio.yml b/studio.yml new file mode 100644 index 0000000..e2777dd --- /dev/null +++ b/studio.yml @@ -0,0 +1,208 @@ +AWSTemplateFormatVersion: "2010-09-09" +Parameters: + ProjectPrefix: + Type: String + Description: | + Unique prefix to make resource privileges scoped-limited. + Changing the default must be done with care + Default: mlops + SageMakerStudioRoleName: + Type: String + Description: Name of the role used by SageMaker Studio + MinLength: 1 + AllowedPattern: ^[a-zA-Z](-*[a-zA-Z0-9])* + PortfolioName: + Type: String + Default: SageMaker Safe Deployment Pipeline Templates + TemplateName: + Type: String + Default: SageMaker Safe Deployment Pipeline + PipelineBucket: + Type: String + Default: S3 bucket to store the pipeline cloudformation template + +Resources: + SageMakerStudioOrganizationProjectsPortfolio: + Type: AWS::ServiceCatalog::Portfolio + Properties: + Description: "Creates a new portfolio with a template for SageMakerProjects" + DisplayName: !Ref PortfolioName + ProviderName: "SageMaker Safe Deployment Pipeline Templates" + + SageMakerSafeDeploymentProduct: + Type: AWS::ServiceCatalog::CloudFormationProduct + Properties: + Description: Deploys the template for creating a new SageMaker setup + Name: !Ref TemplateName + Owner: MLOps + ProvisioningArtifactParameters: + - Description: "SageMaker Safe Deployment Pipeline Project Templates" + DisableTemplateValidation: false + Info: + LoadTemplateFromURL: !Sub "https://${PipelineBucket}.s3.${AWS::Region}.amazonaws.com/pipeline.yml" + Tags: + - Key: sagemaker:studio-visibility + Value: "true" + + LaunchRoleConstraint: + Type: AWS::ServiceCatalog::LaunchRoleConstraint + Properties: + Description: "This is a launch constraint restriction for the SageMaker Launch Role" + PortfolioId: !Ref SageMakerStudioOrganizationProjectsPortfolio + ProductId: !Ref SageMakerSafeDeploymentProduct + RoleArn: !Sub "arn:aws:iam::${AWS::AccountId}:role/service-role/AmazonSageMakerServiceCatalogProductsLaunchRole" + DependsOn: + - ProductAssociation + - AdditionalPrivilegesForStudio + + PrincipalAssociation: + Type: AWS::ServiceCatalog::PortfolioPrincipalAssociation + Properties: + PortfolioId: !Ref SageMakerStudioOrganizationProjectsPortfolio + PrincipalType: IAM + PrincipalARN: !Sub arn:aws:iam::${AWS::AccountId}:role/${SageMakerStudioRoleName} + + ProductAssociation: + Type: AWS::ServiceCatalog::PortfolioProductAssociation + Properties: + PortfolioId: !Ref SageMakerStudioOrganizationProjectsPortfolio + ProductId: !Ref SageMakerSafeDeploymentProduct + + AdditionalPrivilegesForStudio: + Type: AWS::IAM::Policy + Properties: + PolicyName: SageMakerSafeDeploymentAdditional + PolicyDocument: + Version: '2012-10-17' + Statement: + - Effect: Allow + Action: + - "cloudtrail:*" + Resource: + - "*" + - Effect: Allow + Action: + - "cloudformation:DescribeStacks" + - "cloudformation:DescribeStackEvents" + - "cloudformation:DeleteStack" + Resource: + - !Sub "arn:aws:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${ProjectPrefix}-*" + - Effect: Allow + Action: + - "codecommit:CreateRepository" + - "codecommit:GetRepository" + - "codecommit:GetBranch" + - "codecommit:DeleteRepository" + - "codecommit:TagResource" + - "codecommit:CreateCommit" + Resource: + - !Sub "arn:aws:codecommit:${AWS::Region}:${AWS::AccountId}:amazon-sagemaker-safe-deployment-pipeline*" + - Effect: Allow + Action: + - "codebuild:CreateProject" + - "codebuild:DeleteProject" + Resource: + - !Sub "arn:aws:codebuild:${AWS::Region}:${AWS::AccountId}:project/${ProjectPrefix}-*" + - Effect: Allow + Action: + - "codepipeline:CreatePipeline" + - "codepipeline:DeletePipeline" + - "codepipeline:GetPipeline" + - "codepipeline:GetPipelineState" + - "codepipeline:TagResource" + - "codepipeline:StartPipelineExecution" + Resource: + - !Sub "arn:aws:codepipeline:${AWS::Region}:${AWS::AccountId}:${ProjectPrefix}-*" + - Effect: Allow + Action: + - "codepipeline:PutApprovalResult" + Resource: + - !Sub "arn:aws:codepipeline:${AWS::Region}:${AWS::AccountId}:${ProjectPrefix}-*/DeployDev/ApproveDeploy" + - Effect: Allow + Action: + - "kms:*" + Resource: + - "*" + - Effect: Allow + Action: + - "iam:GetRole" + - "iam:PassRole" + - "iam:CreateRole" + - "iam:DeleteRole" + - "iam:GetRolePolicy" + - "iam:PutRolePolicy" + - "iam:AttachRolePolicy" + - "iam:DetachRolePolicy" + - "iam:DeleteRolePolicy" + - "iam:ListRoleTags" + - "iam:TagRole" + Resource: + - !Sub "arn:aws:iam::${AWS::AccountId}:role/${ProjectPrefix}-*" + - Effect: Allow + Action: + - "sns:GetTopicAttributes" + - "sns:SetTopicAttributes" + - "sns:CreateTopic" + - "sns:DeleteTopic" + - "sns:TagResource" + - "sns:UntagResource" + - "sns:Subscribe" + - "sns:Unsubscribe" + Resource: + !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${ProjectPrefix}-*" + - Effect: Allow + Action: + - "s3:*" + Resource: + - !Sub "arn:aws:s3:::${ProjectPrefix}-*" + - Effect: Allow + Action: + - "events:PutRule" + - "events:DescribeRule" + - "events:PutTargets" + Resource: + - !Sub "arn:aws:events:${AWS::Region}:${AWS::AccountId}:rule/${ProjectPrefix}-*" + - !Sub "arn:aws:events:${AWS::Region}:${AWS::AccountId}:rule/SC-${AWS::AccountId}*" + - Effect: Allow + Action: + - "servicecatalog:GetProvisionedProductOutputs" + Resource: + - !Sub "arn:aws:servicecatalog:${AWS::Region}:${AWS::AccountId}:*" + - Effect: Allow + Action: + - "states:DescribeStateMachine" + - "states:ListExecutions" + - "states:GetExecutionHistory" + - "states:CreateStateMachine" + - "states:StartExecution" + - "states:DescribeExecution" + - "states:DeleteStateMachine" + Resource: + - !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:*:${ProjectPrefix}-*" + - Effect: Allow + Action: + - "cloudwatch:PutDashboard" + - "cloudwatch:GetDashboard" + - "cloudwatch:DeleteAlarms" + - "cloudwatch:DeleteDashboards" + Resource: + - !Sub "arn:aws:cloudwatch::${AWS::AccountId}:dashboard/${ProjectPrefix}-*" + - Effect: Allow + Action: + - "cloudwatch:PutMetricData" + - "cloudwatch:GetMetricStatistics" + Resource: + "*" + Condition: + StringEquals: + cloudwatch:namespace: "aws/sagemaker/Endpoints/data-metrics" + - Effect: Allow + Action: + - "lambda:CreateFunction" + - "lambda:InvokeFunction" + - "lambda:PublishLayerVersion" + Resource: + - !Sub "arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:*:${ProjectPrefix}-*" + Roles: + - !Ref SageMakerStudioRoleName + - AmazonSageMakerServiceCatalogProductsLaunchRole