Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ADDITIONAL_SUFFIX environment variable #18

Merged
23 commits merged into from
Jun 23, 2023
Merged

Conversation

jan-zajac-dt
Copy link

@jan-zajac-dt jan-zajac-dt commented May 31, 2023

  • include in env.sh.example
  • adjust xgboost training pipeline
  • adjust Makefile

Description

Adding an additional environment variable RESOURCE_SUFFIX so that GCS asset paths and BQ tables that get overwritten can be suffixed with it, permitting multiple developers on the same project to run pipelines without deleting each others' changes.

I have also tried using pipeline arguments in the compiler for the pipeline. In my opinion this is a worse solution as:

  1. It introduces a lot of additional args to be ran in the Makefile for compile-pipeline, sync-assets and run commands
  2. We already use environment variables for many of the pipeline arguments.

If approach is ok, I will extend to other pipelines (as draft only includes XGBoost training) and update documentation where needed. 05/06/2023: Updated for all pipelines and updated comment in env.sh.example to be comprehensive

How has this been tested?

2 successful pipelines that (with/without ADDITIONAL SUFFIX environment variable set )

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have successfully run the E2E tests, and have included the links to the pipeline runs below
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated any relevant documentation to reflect my changes
  • I have assigned a reviewer and messaged them

Pipeline run links:

@jan-zajac-dt jan-zajac-dt marked this pull request as ready for review June 5, 2023 09:54
@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt jan-zajac-dt requested review from a user and felix-datatonic June 5, 2023 11:23
@@ -45,6 +45,7 @@ steps:
- PIPELINE_TEMPLATE=${_PIPELINE_TEMPLATE}
- VERTEX_PIPELINE_ROOT=${_TEST_VERTEX_PIPELINE_ROOT}
- PIPELINE_FILES_GCS_PATH=${_PIPELINE_PUBLISH_GCS_PATH}/${COMMIT_SHA}
- ADDITIONAL_SUFFIX=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would suggest to create the suffix dynamically from cloud build variables

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create new BQ tables and GCS file paths for assets for every PR that's made - I don't think that's the best solution as it will raise cost/mess and require a clean up process

Copy link

@ghost ghost Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there are two scenarios here that we are trying to address:

  1. Multiple developers working on pipelines in the same dev project - a user-provided suffix can be used to avoid clashing (as you have done here) OR a suffix can be autogenerated for each run (but then you need to clean up)
  2. Concurrent CI runs in the same project - suffix would need to be autogenerated, but then cleanup becomes an issue

My preference would be to not require the user to provide their own suffix, for example using the pipeline ID or a timestamp of some sort (or git commit hash for CI-triggered runs perhaps)

Clean-up of assets files in GCS is less of an issue as they are not very big (alternatively, you could set up object lifecycle management on the bucket)
Clean-up of bigquery tables could be managed by setting an expiration time on the table

Copy link
Collaborator

@felix-datatonic felix-datatonic Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also implement a teardown strategy for e2e tests which removes resources which were created as part of the test. This addresses (2) well while supporting auto-generated suffixes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm onboard with that. For this story, I'll implement the autogenerated suffix approach, and create another story for implementing the clean up 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@browningjp-datatonic @felix-datatonic I've been working on the concurrent pipelines capability using an auto-generated suffix and running into a couple of issues

  • Generating the suffix in env.sh doesn't work because you can't load it directly into the Makefile. You have to source env.sh in the Makefile which would break e2e tests
  • Generating the suffx and assigning it to a variable in the Makefile doesn't work because it gets re-generated in the make commands so you'd end up with different values going into sync-assets and compile-pipeline commands

The options I've come up with are

  1. Combine sync-assets and compile-pipeline into one command reduces modularity but means we can use one suffix for both
  2. We could use arguments in the make commands and pass through the commit hash but it would introduce a lot of if statements depending on whether to use suffix or not for sync-assets, compile-pipeline and run commands
  3. Allow users to specify suffix themselves as is already the case in this PR

I think number 3 would be the cleanest and easiest 😬 but I may have missed something up to now, what do you think?

Copy link
Author

@jan-zajac-dt jan-zajac-dt Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As decided in the call, I'll implement the following

  • Use the git username as suffix and store in env.sh as RESOURCE_SUFFIX env variable
  • For e2e tests in CI, use the commit hash for the RESOURCE_SUFFIX env variable
  • Clean up to be implemented in a later story/PR
  • Add documentation around functionality in main README.md

env.sh.example Outdated Show resolved Hide resolved
@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

Makefile Outdated Show resolved Hide resolved
Makefile Outdated
@@ -73,7 +74,8 @@ test-all-components-coverage: ## Run tests with coverage
sync-assets: ## Sync assets folder to GCS.
@if [ -d "./pipelines/assets/" ] ; then \
echo "Syncing assets to GCS" && \
gsutil -m rsync -r -d ./pipelines/assets ${PIPELINE_FILES_GCS_PATH}/assets ; \
. ./env.sh && \
gsutil -m rsync -r -d ./pipelines/assets $$PIPELINE_FILES_GCS_PATH; \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$PIPELINE_FILES_GCS_PATH should be the parent folder of the assets directory, because (in CI/CD) the compiled pipelines are copied directly to $PIPELINE_FILES_GCS_PATH and the contents of the assets folder are copied to the assets subdirectory

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the environment variable to include the /assets subdirectory to be able to construct it more easily with the RESOURCE_SUFFIX. I've updated in the cloudbuild/e2e-test.yaml as well

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break release.yaml because training.json and prediction.json are copied to $PIPELINE_FILES_GCS_PATH and assets is a directory within

I suggest you keep assets as a subdirectory, and instead have (in env.sh):

export PIPELINE_FILES_GCS_PATH=gs://${VERTEX_PROJECT_ID}-pl-assets/${RESOURCE_SUFFIX}

Copy link
Author

@jan-zajac-dt jan-zajac-dt Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't break the release.yaml because the compiled pipeline jsons and assets are copied into ${_PIPELINE_PUBLISH_GCS_PATHS}. It doesn't use the Makefile either. @browningjp-datatonic

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I still think you should keep it consistent though - ${_PIPELINE_PUBLISH_GCS_PATHS} in release.yaml is assumed not to contain /assets in the path, while $PIPELINE_FILES_GCS_PATH now is expected to contain /assets in the path

I've also realised we might need to update terraform/modules/scheduled_pipelines/scheduled_jobs.auto.tfvars.example

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the release.yaml pipeline, the compiled pipelines (training.json and prediction.json) are published to _PIPELINE_PUBLISH_GCS_PATHS, and the assets folder is then a subdirectory within _PIPELINE_PUBLISH_GCS_PATHS

(With KFP V2 we will be publishing pipelines to Artifact Registry instead of GCS, so this won't really be a problem in the future).

I was thinking the following

In env.sh:

export RESOURCE_SUFFIX=_yourname # replace with a suffix unique to you
export PIPELINE_FILES_GCS_PATH=gs://${VERTEX_PROJECT_ID}-pl-assets/${RESOURCE_SUFFIX}

In pipeline.py:

@pipeline
def pipeline(
  ...
  pipeline_files_gcs_path: str = os.environ.get("PIPELINE_FILES_GCS_PATH")
  ...
):
  ...
  train_script_uri = f"{pipeline_files_gcs_path}/assets/train_xgb_model.py"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(CI/CD pipelines can then be left as-is)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted PIPELINE_FILES_GCS_PATH to your suggestion.

I made an update to the sync-assets make command to handle the case of empty RESOURCE_SUFFIX wdyt?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would put this logic into env.sh.example instead

PIPELINE_FILES_GCS_PATH = gs://${VERTEX_PROJECT_ID}-pl-assets
# If $RESOURCE_SUFFIX is not empty, append it to the end of $PIPELINE_FILES_GCS_PATH
if [[ ! -z "$RESOURCE_SUFFIX" ]]; then
  PIPELINE_FILES_GCS_PATH = $PIPELINE_FILES_GCS_PATH/${RESOURCE_SUFFIX}
fi

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work unfortunately because the bash syntax is different to Makefile syntax (at least when using include env.sh). The fix would be to include Makefile syntax in env.sh.example i.e.:

# If $RESOURCE_SUFFIX is not empty, append it to the end of $PIPELINE_FILES_GCS_PATH
ifdef RESOURCE_SUFFIX
  export PIPELINE_FILES_GCS_PATH := $(PIPELINE_FILES_GCS_PATH)/$(RESOURCE_SUFFIX)
endif

I can update sync-assets to update the $PIPELINE_FILES_GCS_PATH on the fly as in your suggestion, would only add one line of code:

sync-assets: ## Sync assets folder to GCS.
	@if [ -d "./pipelines/assets/" ]; then \
		echo "Syncing assets to GCS"; \
		PIPELINE_FILES_GCS_PATH=$${PIPELINE_FILES_GCS_PATH}$(if $(strip $(RESOURCE_SUFFIX)),/$(RESOURCE_SUFFIX)); \
		gsutil -m rsync -r -d ./pipelines/assets "$${PIPELINE_FILES_GCS_PATH}/assets"; \
	else \
		echo "No assets folder found"; \

wdyt?

@jan-zajac-dt
Copy link
Author

/gcbrun

env.sh.example Outdated Show resolved Hide resolved
cloudbuild/e2e-test.yaml Outdated Show resolved Hide resolved
@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

@jan-zajac-dt
Copy link
Author

/gcbrun

@ghost
Copy link

ghost commented Jun 20, 2023

/gcbrun

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants