The titanic-survival-prediction.py
sample runs a Spark ML pipeline to train a classfication model using random forest on AWS Elastic Map Reduce(EMR).
Titanic: Machine Learning from Disaster copy of Jeffwan's code for gitops workshop Also pipeline sample from kubeflow/pipeline
# Set default region
export AWS_REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region') # For ec2 client or cloud9
export AWS_DEFAULT_REGION=$AWS_REGION
# I encountered issues with EMR in eu-west-2 but it works fine in eu-west-1.
# Use this variable to set the region to run the EMR cluster in if you encounter issues in your local/default region
export EMR_REGION=$AWS_REGION
export BUCKET_NAME=mlops-kubeflow-pipeline-data2
aws iam create-user --user-name mlops-user
aws iam create-access-key --user-name mlops-user > $HOME/mlops-user.json
export THE_ACCESS_KEY_ID=$(jq '."AccessKey"["AccessKeyId"]' $HOME/mlops-user.json)
echo $THE_ACCESS_KEY_ID
export THE_SECRET_ACCESS_KEY=$(jq '."AccessKey"["SecretAccessKey"]' $HOME/mlops-user.json)
export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
aws iam create-policy --policy-name mlops-s3-access \
--policy-document https://raw.githubusercontent.com/paulcarlton-ww/ml-workshop/master/resources/s3-policy.json > s3-policy.json
aws iam create-policy --policy-name mlops-emr-access \
--policy-document https://raw.githubusercontent.com/paulcarlton-ww/ml-workshop/master/resources/emr-policy.json > emr-policy.json
aws iam create-policy --policy-name mlops-iam-access \
--policy-document https://raw.githubusercontent.com/paulcarlton-ww/ml-workshop/master/resources/iam-policy.json > iam-policy.json
aws iam attach-user-policy --user-name mlops-user --policy-arn $(jq '."Policy"["Arn"]' s3-policy.json)
aws iam attach-user-policy --user-name mlops-user --policy-arn $(jq '."Policy"["Arn"]' emr-policy.json)
curl https://raw.githubusercontent.com/paulcarlton-ww/ml-workshop/master/resources/kubeflow-aws-secret.yaml | \
sed s/YOUR_BASE64_SECRET_ACCESS/$(echo -n "$THE_SECRET_ACCESS_KEY" | base64)/ | \
sed s/YOUR_BASE64_ACCESS_KEY/$(echo -n "$THE_ACCESS_KEY_ID" | base64)/ | kubectl apply -f -;echo
aws s3api create-bucket --bucket $BUCKET_NAME --region $AWS_DEFAULT_REGION --create-bucket-configuration LocationConstraint=$AWS_DEFAULT_REGION
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java
sdk install sbt
git clone [email protected]:paulcarlton-ww/mlops-titanic
cd mlops-titanic/
sbt clean package
aws s3api put-object --bucket $BUCKET_NAME --key emr/titanic/titanic-survivors-prediction_2.11-1.0.jar --body target/scala-2.11/titanic-survivors-prediction_2.11-1.0.jar
Note: EMR has all spark libariries and this project doesn't reply on third-party library. We don't need to build fat jars.
Check Kaggle Titanic: Machine Learning from Disaster for more details about this problem. 70% training dataset is used to train model and rest 30% for validation.
A copy of train.csv is included in this repository, it needs to be uploaded to S3.
aws s3api put-object --bucket $BUCKET_NAME --key emr/titanic/train.csv --body train.csv
See building a pipeline to install the Kubeflow Pipelines SDK. The following command will install the tools required
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source /home/ec2-user/.bashrc
conda create --name mlpipeline python=3.7
pip3 install --user kfp --upgrade
rm Miniconda3-latest-Linux-x86_64.sh
cd aws-titanic
sed s/mlops-kubeflow-pipeline-data/$BUCKET_NAME/g titanic-survival-prediction.py | sed s/aws-region/$EMR_REGION/ > build/titanic-survival-prediction.py
dsl-compile --py build/titanic-survival-prediction.py --output build/titanic-survival-prediction.tar.gz
aws s3api put-object --bucket $BUCKET_NAME --key emr/titanic/titanic-survival-prediction.tar.gz --body build/titanic-survival-prediction.tar.gz
Using the portforward access to the Kubeflow UI the upload from URL option does not work so it is necessary to download the file to your workstation. A shell script is provided to generate the commands required.
get-tar-cmds.sh
Now use the kubeflow UI to upload the pipeline file and run an experiment.
Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (.tar.gz
file) as a new pipeline template.
Once the pipeline done, you can go to the S3 path specified in output
to check your prediction results. There're three columes, PassengerId
, prediction
, Survived
(Ground True value)
...
4,1,1
5,0,0
6,0,0
7,0,0
...
Find the result file name:
aws s3api list-objects --bucket $BUCKET_NAME --prefix emr/titanic/output
Download it and analyse:
export RESULT_FILE=<result file>
aws s3api get-object --bucket $BUCKET_NAME --key emr/titanic/output/$RESULT_FILE \$HOME/$RESULT_FILE.csv
grep ",1,1\|,0,0" $HOME/$RESULT_FILE | wc -l # To count correct results
wc -l $RESULT_FILE # To count items in file
Create Cluster: source code
Submit Spark Job: source code
Delete Cluster: source code