From 485ac5672c98ac7d67aa3f7f29c4835b1b0a9cf2 Mon Sep 17 00:00:00 2001 From: aman-ebay Date: Wed, 9 Jan 2019 11:53:01 -0800 Subject: [PATCH 1/2] Create python-api-walkthrough.md This Google Cloud Shell walkthrough is linked to Cloud Dataproc documentation to be published at: https://cloud.google.com/dataproc/docs/tutorials/python-library-example --- dataproc/python-api-walkthrough.md | 158 +++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 dataproc/python-api-walkthrough.md diff --git a/dataproc/python-api-walkthrough.md b/dataproc/python-api-walkthrough.md new file mode 100644 index 000000000000..66eb2aa2ff6a --- /dev/null +++ b/dataproc/python-api-walkthrough.md @@ -0,0 +1,158 @@ +# Use the Python Client Library to call Cloud Dataproc APIs + +Estimated completion time: + +## Overview + +This [Cloud Shell](https://cloud.google.com/shell/docs/) walkthrough leads you +through the steps to use the +[Google APIs Client Library for Python](http://code.google.com/p/google-api-python-client/ ) +to programmatically interact with [Cloud Dataproc](https://cloud.google.com/dataproc/docs/). + +As you follow this walkthrough, you run Python code that calls +[Cloud Dataproc REST API](https://cloud.google.com//dataproc/docs/reference/rest/) +methods to: + +* create a Cloud Dataproc cluster +* submit a small PySpark word sort job to run on the cluster +* get job status +* tear down the cluster after job completion + +## Using the walkthrough + +The `submit_job_to_cluster.py file` used in this walkthrough is opened in the +Cloud Shell editor when you launch the walkthrough. You can view +the code as your follow the walkthrough steps. + +**For more information**: See [Cloud Dataproc→Use the Python Client Library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example) for +an explanation of how the code works. + +**To reload this walkthrough:** Run the following command from the +`~/python-docs-samples/dataproc` directory in Cloud Shell: + + cloudshell launch-tutorial python-api-walkthrough.md + +**To copy and run commands**: Click the "Paste in Cloud Shell" button + () + on the side of a code box, then press `Enter` to run the command. + +## Prerequisites (1) + +1. Create or select a Google Cloud Platform project to use for this tutorial. + * + +1. Enable the Cloud Dataproc, Compute Engine, and Cloud Storage APIs in your project. + * + +## Prerequisites (2) + +1. This walkthrough uploads a PySpark file (`pyspark_sort.py`) to a + [Cloud Storage bucket](https://cloud.google.com/storage/docs/key-terms#buckets) in + your project. + * You can use the [Cloud Storage browser page](https://console.cloud.google.com/storage/browser) + in Google Cloud Platform Console to view existing buckets in your project. + * To create a new bucket, run the following command. Your bucket name must be unique. + ```bash + gsutil mb -p {{project-id}} gs://your-bucket-name + ``` + +1. Set environment variables. + + * Set the name of your bucket. + ```bash + BUCKET=your-bucket-name + ``` + +## Prerequisites (3) + +1. Set up a Python + [virtual environment](https://virtualenv.readthedocs.org/en/latest/) + in Cloud Shell. + + * Create the virtual environment. + ```bash + virtualenv ENV + ``` + * Activate the virtual environment. + ```bash + source ENV/bin/activate + ``` + +1. Install library dependencies in Cloud Shell. + ```bash + pip install -r requirements.txt + ``` + +## Create a cluster and submit a job + +1. Set a name for your new cluster. + ```bash + CLUSTER=new-cluster-name + ``` + +1. Set a [zone](https://cloud.google.com/compute/docs/regions-zones/#available) + where your new cluster will be located. You can change the + "us-central1-a" zone that is pre-set in the following command. + ```bash + ZONE=us-central1-a + ``` + +1. Run `submit_job.py` with the `--create_new_cluster` flag + to create a new cluster and submit the `pyspark_sort.py` job + to the cluster. + ```bash + python submit_job_to_cluster.py \ + --project_id={{project-id}} \ + --cluster_name=$CLUSTER \ + --zone=$ZONE \ + --gcs_bucket=$BUCKET \ + --create_new_cluster + ``` + +## Job Output + +Job output in Cloud Shell shows cluster creation, job submission, + job completion, and then tear-down of the cluster. + + ```bash + ... + Creating cluster... + Cluster created. + Uploading pyspark file to GCS + new-cluster-name - RUNNING + Submitted job ID ... + Waiting for job to finish... + Job finished. + Downloading output file + ..... + ['Hello,', 'dog', 'elephant', 'panther', 'world!'] + ... + Tearing down cluster + ``` +## Congratulations on Completing the Walkthrough! + + +--- + +### Next Steps: + +* **View job details from the Console.** View job details by selecting the + PySpark job from the Cloud Dataproc + [Jobs page](https://console.cloud.google.com/dataproc/jobs) + in the Google Cloud Platform Console. + +* **Delete resources used in the walkthrough.** + The `submit_job.py` job deletes the cluster that it created for this + walkthrough. You can run the following command to delete the + Cloud Storage bucket used in this walkthrough (the bucket must be empty). + ```bash + gsutil rb gs://$BUCKET + ``` + You can run the following command to delete the bucket **and all + objects within it. Note: the deleted objects cannot be recovered.** + ```bash + gsutil rm -r gs://$BUCKET + ``` + +* **For more information.** See the [Cloud Dataproc documentation](https://cloud.google.com/dataproc/docs/) + for API reference and product feature information. From 02e8b8132eaaf8ce69e84b8e6b7b661e3c8e3891 Mon Sep 17 00:00:00 2001 From: aman-ebay Date: Thu, 10 Jan 2019 12:16:06 -0800 Subject: [PATCH 2/2] Update python-api-walkthrough.md --- dataproc/python-api-walkthrough.md | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/dataproc/python-api-walkthrough.md b/dataproc/python-api-walkthrough.md index 66eb2aa2ff6a..0004e2419cd1 100644 --- a/dataproc/python-api-walkthrough.md +++ b/dataproc/python-api-walkthrough.md @@ -51,6 +51,9 @@ an explanation of how the code works. your project. * You can use the [Cloud Storage browser page](https://console.cloud.google.com/storage/browser) in Google Cloud Platform Console to view existing buckets in your project. + +     **OR** + * To create a new bucket, run the following command. Your bucket name must be unique. ```bash gsutil mb -p {{project-id}} gs://your-bucket-name @@ -100,6 +103,7 @@ an explanation of how the code works. 1. Run `submit_job.py` with the `--create_new_cluster` flag to create a new cluster and submit the `pyspark_sort.py` job to the cluster. + ```bash python submit_job_to_cluster.py \ --project_id={{project-id}} \ @@ -114,7 +118,6 @@ an explanation of how the code works. Job output in Cloud Shell shows cluster creation, job submission, job completion, and then tear-down of the cluster. - ```bash ... Creating cluster... Cluster created. @@ -143,8 +146,11 @@ Job output in Cloud Shell shows cluster creation, job submission, * **Delete resources used in the walkthrough.** The `submit_job.py` job deletes the cluster that it created for this - walkthrough. You can run the following command to delete the - Cloud Storage bucket used in this walkthrough (the bucket must be empty). + walkthrough. + + If you created a bucket to use for this walkthrough, + you can run the following command to delete the + Cloud Storage bucket (the bucket must be empty). ```bash gsutil rb gs://$BUCKET ``` @@ -156,3 +162,4 @@ Job output in Cloud Shell shows cluster creation, job submission, * **For more information.** See the [Cloud Dataproc documentation](https://cloud.google.com/dataproc/docs/) for API reference and product feature information. +