-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs for GCP Dataproc deployment #4393
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Thanks a lot for this contribution @abhi8893! 🙏🏼 We'll give it a look shortly. |
Thanks @astrojuanlu ! I will also revist it again to improve the flow and address any comments you may have 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this extensive contribution @abhi8893 ⭐
I've done a very quick initial review mostly just looking at wording/spelling. I'll do a more thorough review and will try to test this as well.
@@ -0,0 +1,556 @@ | |||
# GCP Dataproc | |||
|
|||
`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine. | |
`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you to package your dependencies at build time. Refer to [the Dataproc serverless documentation](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine. |
|
||
`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine. | ||
|
||
The guide details kedro pipeline deployment steps for `Dataproc serverless`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guide details kedro pipeline deployment steps for `Dataproc serverless`. | |
This guide describes the steps needed to deploy a Kedro pipeline with `Dataproc Serverless`. |
|
||
## Overview | ||
|
||
The below diagram details the dataproc serverless dev and prod deployment workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The below diagram details the dataproc serverless dev and prod deployment workflows. | |
The below sections and diagrams detail the dataproc serverless dev and prod deployment workflows. |
|
||
### DEV deployment (and experimentation) | ||
|
||
The following are the steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following are the steps: | |
The following steps are needed to do a DEV deployment on Dataproc Serverless: |
|
||
### PROD deployment | ||
|
||
The following are the steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following are the steps: | |
The following steps are needed to do a PROD deployment on Dataproc Serverless: |
|
||
1. **Cut a release from develop**: A release branch is cut from the `develop` branch as `release/v0.2.0` | ||
2. **Prepare release**: Minor fixes, final readiness and release notes are added to prepare the release. | ||
3. **Merge into main**: After all checks passes and necessary approvals, the release branch is merged into main, and the commit is tagged with the version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. **Merge into main**: After all checks passes and necessary approvals, the release branch is merged into main, and the commit is tagged with the version | |
3. **Merge into main**: After all checks pass and necessary approvals are received, the release branch is merged into main, and the commit is tagged with the version |
NOTE: | ||
|
||
> 1. The service account creation method below assigns all permissions needed for this walkthrough in one service account. | ||
> 2. Different teired environments may have their own GCP Projects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> 2. Different teired environments may have their own GCP Projects. | |
> 2. Different tiered environments may have their own GCP Projects. |
``` | ||
|
||
|
||
#### Authorize with service account |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we use British spelling in the Kedro docs 🤓
#### Authorize with service account | |
#### Authorise with service account |
`deployment/dataproc/serverless/build_push_docker.sh` | ||
|
||
- This script builds and pushes the docker image for user dev workflows by tagging each custom build with the branch name (or a custom tag). | ||
- The developer can experiment with any customizations to the docker image in their feature branches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The developer can experiment with any customizations to the docker image in their feature branches. | |
- The developer can experiment with any customisations to the docker image in their feature branches. |
@@ -0,0 +1,556 @@ | |||
# GCP Dataproc | |||
|
|||
`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine. | |
`Dataproc Serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc Serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc Serverless and compute engine. |
To respond to your point about the parsing of the Kedro CLI args:
Your implementation looks fine to me. In Kedro we use Click for CLI, which can be a tricky library to work with at times. So depending on the format you receive the arguments in, it is indeed difficult to parse. Did you find any issues with this implementation, as in is there anything a user can't do now? |
Description
This PR adds docs for the deployment of Kedro projects to GCP Dataproc (Serverless).
What does this guide include? ✅
What does this guide NOT include? ❌
(WIP) Checklist:
Please note that the current docs are very much WIP, and aren't verbose enough for developers unfamiliar with GCP. I will refine them soon!
Review guidance needed
In addition to a review of the overall approach, please provide guidance on the following:
Q1
: Kedro entrypoint script argumentsThe recommended entrypoint script invokes kedro's built in cli
main
entrypoint as follows:With kedro package wheel install:
Without kedro package wheel install:
However, the implementation in this PR relies on passing the arbitrary kedro args from one
py
script i.e.deployment/dataproc/serverless/submit_batches.py
to the main entrypoint scriptdeployment/dataproc/entrypoint.py
.As I was unable to implement parsing arbitrary args with dashes
--
, I implemented it as a single--kedro-run-args
named arg.Requesting for a review to enable a better implementation here.
Q2
: Incorporating spark configs while submitting jobsSpark configs can be divided into 2 parts:
SparkContext
=> These can't be set / overriden in aSparkSession
by kedro hook (if implemented)spark.driver.memory
,spark.executor.instances
SparkContext
and overriden for any newSparkSession
Since, the proposed implementation does NOT read in
spark.yml
config for the project when submitting the job to dataproc, this requires duplicating some of the configs in the submission script (outside kedro).How do we enable passing of these spark configs at job/batches submission time?
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
Updated the documentation to reflect the code changes(NA)RELEASE.md
fileAdded tests to cover my changes(NA)Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team(NA)