-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ADDITIONAL_SUFFIX environment variable #18
Conversation
- include in env.sh.example - adjust xgboost training pipeline - adjust Makefile
/gcbrun |
/gcbrun |
/gcbrun |
cloudbuild/e2e-test.yaml
Outdated
@@ -45,6 +45,7 @@ steps: | |||
- PIPELINE_TEMPLATE=${_PIPELINE_TEMPLATE} | |||
- VERTEX_PIPELINE_ROOT=${_TEST_VERTEX_PIPELINE_ROOT} | |||
- PIPELINE_FILES_GCS_PATH=${_PIPELINE_PUBLISH_GCS_PATH}/${COMMIT_SHA} | |||
- ADDITIONAL_SUFFIX= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would suggest to create the suffix dynamically from cloud build variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create new BQ tables and GCS file paths for assets for every PR that's made - I don't think that's the best solution as it will raise cost/mess and require a clean up process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there are two scenarios here that we are trying to address:
- Multiple developers working on pipelines in the same dev project - a user-provided suffix can be used to avoid clashing (as you have done here) OR a suffix can be autogenerated for each run (but then you need to clean up)
- Concurrent CI runs in the same project - suffix would need to be autogenerated, but then cleanup becomes an issue
My preference would be to not require the user to provide their own suffix, for example using the pipeline ID or a timestamp of some sort (or git commit hash for CI-triggered runs perhaps)
Clean-up of assets files in GCS is less of an issue as they are not very big (alternatively, you could set up object lifecycle management on the bucket)
Clean-up of bigquery tables could be managed by setting an expiration time on the table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also implement a teardown strategy for e2e tests which removes resources which were created as part of the test. This addresses (2) well while supporting auto-generated suffixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm onboard with that. For this story, I'll implement the autogenerated suffix approach, and create another story for implementing the clean up 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@browningjp-datatonic @felix-datatonic I've been working on the concurrent pipelines capability using an auto-generated suffix and running into a couple of issues
- Generating the suffix in env.sh doesn't work because you can't load it directly into the Makefile. You have to source
env.sh
in the Makefile which would break e2e tests - Generating the suffx and assigning it to a variable in the Makefile doesn't work because it gets re-generated in the make commands so you'd end up with different values going into
sync-assets
andcompile-pipeline
commands
The options I've come up with are
- Combine
sync-assets
andcompile-pipeline
into one command reduces modularity but means we can use one suffix for both - We could use arguments in the make commands and pass through the commit hash but it would introduce a lot of if statements depending on whether to use suffix or not for
sync-assets
,compile-pipeline
andrun
commands - Allow users to specify suffix themselves as is already the case in this PR
I think number 3 would be the cleanest and easiest 😬 but I may have missed something up to now, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As decided in the call, I'll implement the following
- Use the git username as suffix and store in env.sh as
RESOURCE_SUFFIX
env variable - For e2e tests in CI, use the commit hash for the
RESOURCE_SUFFIX
env variable - Clean up to be implemented in a later story/PR
- Add documentation around functionality in main README.md
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
Makefile
Outdated
@@ -73,7 +74,8 @@ test-all-components-coverage: ## Run tests with coverage | |||
sync-assets: ## Sync assets folder to GCS. | |||
@if [ -d "./pipelines/assets/" ] ; then \ | |||
echo "Syncing assets to GCS" && \ | |||
gsutil -m rsync -r -d ./pipelines/assets ${PIPELINE_FILES_GCS_PATH}/assets ; \ | |||
. ./env.sh && \ | |||
gsutil -m rsync -r -d ./pipelines/assets $$PIPELINE_FILES_GCS_PATH; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$PIPELINE_FILES_GCS_PATH
should be the parent folder of the assets directory, because (in CI/CD) the compiled pipelines are copied directly to $PIPELINE_FILES_GCS_PATH
and the contents of the assets folder are copied to the assets
subdirectory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the environment variable to include the /assets subdirectory to be able to construct it more easily with the RESOURCE_SUFFIX
. I've updated in the cloudbuild/e2e-test.yaml
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break release.yaml
because training.json
and prediction.json
are copied to $PIPELINE_FILES_GCS_PATH
and assets
is a directory within
I suggest you keep assets
as a subdirectory, and instead have (in env.sh
):
export PIPELINE_FILES_GCS_PATH=gs://${VERTEX_PROJECT_ID}-pl-assets/${RESOURCE_SUFFIX}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't break the release.yaml
because the compiled pipeline jsons and assets are copied into ${_PIPELINE_PUBLISH_GCS_PATHS}
. It doesn't use the Makefile either. @browningjp-datatonic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I still think you should keep it consistent though - ${_PIPELINE_PUBLISH_GCS_PATHS} in release.yaml
is assumed not to contain /assets
in the path, while $PIPELINE_FILES_GCS_PATH
now is expected to contain /assets
in the path
I've also realised we might need to update terraform/modules/scheduled_pipelines/scheduled_jobs.auto.tfvars.example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the release.yaml
pipeline, the compiled pipelines (training.json
and prediction.json
) are published to _PIPELINE_PUBLISH_GCS_PATHS
, and the assets
folder is then a subdirectory within _PIPELINE_PUBLISH_GCS_PATHS
(With KFP V2 we will be publishing pipelines to Artifact Registry instead of GCS, so this won't really be a problem in the future).
I was thinking the following
In env.sh
:
export RESOURCE_SUFFIX=_yourname # replace with a suffix unique to you
export PIPELINE_FILES_GCS_PATH=gs://${VERTEX_PROJECT_ID}-pl-assets/${RESOURCE_SUFFIX}
In pipeline.py
:
@pipeline
def pipeline(
...
pipeline_files_gcs_path: str = os.environ.get("PIPELINE_FILES_GCS_PATH")
...
):
...
train_script_uri = f"{pipeline_files_gcs_path}/assets/train_xgb_model.py"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(CI/CD pipelines can then be left as-is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted PIPELINE_FILES_GCS_PATH to your suggestion.
I made an update to the sync-assets
make command to handle the case of empty RESOURCE_SUFFIX
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would put this logic into env.sh.example
instead
PIPELINE_FILES_GCS_PATH = gs://${VERTEX_PROJECT_ID}-pl-assets
# If $RESOURCE_SUFFIX is not empty, append it to the end of $PIPELINE_FILES_GCS_PATH
if [[ ! -z "$RESOURCE_SUFFIX" ]]; then
PIPELINE_FILES_GCS_PATH = $PIPELINE_FILES_GCS_PATH/${RESOURCE_SUFFIX}
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work unfortunately because the bash syntax is different to Makefile syntax (at least when using include env.sh
). The fix would be to include Makefile syntax in env.sh.example i.e.:
# If $RESOURCE_SUFFIX is not empty, append it to the end of $PIPELINE_FILES_GCS_PATH
ifdef RESOURCE_SUFFIX
export PIPELINE_FILES_GCS_PATH := $(PIPELINE_FILES_GCS_PATH)/$(RESOURCE_SUFFIX)
endif
I can update sync-assets
to update the $PIPELINE_FILES_GCS_PATH
on the fly as in your suggestion, would only add one line of code:
sync-assets: ## Sync assets folder to GCS.
@if [ -d "./pipelines/assets/" ]; then \
echo "Syncing assets to GCS"; \
PIPELINE_FILES_GCS_PATH=$${PIPELINE_FILES_GCS_PATH}$(if $(strip $(RESOURCE_SUFFIX)),/$(RESOURCE_SUFFIX)); \
gsutil -m rsync -r -d ./pipelines/assets "$${PIPELINE_FILES_GCS_PATH}/assets"; \
else \
echo "No assets folder found"; \
wdyt?
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
terraform/modules/scheduled_pipelines/scheduled_jobs.auto.tfvars.example
Outdated
Show resolved
Hide resolved
/gcbrun |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Adding an additional environment variable
RESOURCE_SUFFIX
so that GCS asset paths and BQ tables that get overwritten can be suffixed with it, permitting multiple developers on the same project to run pipelines without deleting each others' changes.I have also tried using pipeline arguments in the compiler for the pipeline. In my opinion this is a worse solution as:
If approach is ok, I will extend to other pipelines (as draft only includes XGBoost training) and update documentation where needed.05/06/2023: Updated for all pipelines and updated comment in env.sh.example to be comprehensiveHow has this been tested?
2 successful pipelines that (with/without ADDITIONAL SUFFIX environment variable set )
Checklist
Pipeline run links: