Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScheduledWorkflow CRD: Investigate need for retries beyond the ones provided by Argo #5

Closed
vicaire opened this issue May 26, 2018 · 5 comments

Comments

@vicaire
Copy link
Contributor

vicaire commented May 26, 2018

Currently, the ScheduledWorkflow CRD reliably starts Argo workflows, but does not monitor that they complete successfully. It relies on retries embedded in the Argo workflow itself.

The ScheduledWorkflow CRD could provide a retry functionality.

@vicaire vicaire changed the title ScheduledWorkflow CRD: Make backfill reliable. ScheduledWorkflow CRD: Retry workflows until success. May 26, 2018
@vicaire
Copy link
Contributor Author

vicaire commented May 30, 2018

Notes:

The current implementation works in the following cases:

  • It reliably starts workflows.
  • Individual workflows can use the retry functionality of Argo to retry the whole workflow in case of failures.
  • If the Argo controller fails temporarily, it should continue execution of the Argo workflow where it left off.

This issue tracks the case where retries should be performed by the ScheduledWorklfow instead of the Workflow.

@vicaire
Copy link
Contributor Author

vicaire commented Jun 13, 2018

ynqa@, I am not sure I understand your question.

So far the Scheduled Workflow CRD reliably launches Argo workflows.

I am not exactly sure what are the use cases where retries would be needed in the Scheduled Workflow controller:

  • Argo already takes care of retries.
  • If the Argo controller is temporarily down, it would pick up the work where it left once it is restarted.

As part of this issue, we need to first identify if there are any use cases where retrying in the Scheduled Workflow CRD is needed. If there are no cases, let's close this issue.

@ynqa
Copy link
Contributor

ynqa commented Jun 13, 2018

Ah sorry, I was just misreading. I understood this issue objectives.
By the way, What is the main use cases? I think that is in order to training the ML model with some intervals.

@vicaire
Copy link
Contributor Author

vicaire commented Jun 14, 2018

Yes, this is the main use case

@vicaire vicaire changed the title ScheduledWorkflow CRD: Retry workflows until success. ScheduledWorkflow CRD: Investigate need for retries beyond the ones provided by Argo Jun 14, 2018
@vicaire
Copy link
Contributor Author

vicaire commented Mar 26, 2019

Resolving since Argo provides the needed retries.

@vicaire vicaire closed this as completed Mar 26, 2019
RedbackThomson pushed a commit to RedbackThomson/pipelines that referenced this issue May 15, 2020
# This is the 1st commit message:

Add initial scripts

# This is the commit message kubeflow#2:

Add working pytest script

# This is the commit message kubeflow#3:

Add initial scripts

# This is the commit message kubeflow#4:

Add environment variable files

# This is the commit message kubeflow#5:

Remove old cluster script
RedbackThomson pushed a commit to RedbackThomson/pipelines that referenced this issue May 15, 2020
# This is the 1st commit message:

Add initial scripts

# This is the commit message kubeflow#2:

Add working pytest script

# This is the commit message kubeflow#3:

Add initial scripts

# This is the commit message kubeflow#4:

Add environment variable files

# This is the commit message kubeflow#5:

Remove old cluster script
k8s-ci-robot pushed a commit that referenced this issue May 20, 2020
* # This is a combination of 5 commits.
# This is the 1st commit message:

Add initial scripts

# This is the commit message #2:

Add working pytest script

# This is the commit message #3:

Add initial scripts

# This is the commit message #4:

Add environment variable files

# This is the commit message #5:

Remove old cluster script

* Add initial scripts

Add working pytest script

Add initial scripts

Add environment variable files

Remove old cluster script

Update pipeline credentials to OIDC

Add initial scripts

Add working pytest script

Add initial scripts

Add working pytest script

* Remove debugging mark

* Update example EKS cluster name

* Remove quiet from Docker build

* Manually pass env

* Update env list vars as string

* Update use array directly

* Update variable array to export

* Update to using read for splitting

* Move to helper script

* Update export from CodeBuild

* Add wait for minio

* Update kubectl wait timeout

* Update minor changes for PR

* Update integration test buildspec to quiet build

* Add region to delete EKS

* Add wait for pods

* Updated README

* Add fixed interval wait

* Fix CodeBuild step order

* Add file lock for experiment ID

* Fix missing pytest parameter

* Update run create only once

* Add filelock to conda env

* Update experiment name ensuring creation each time

* Add try/catch with create experiment

* Remove caching from KFP deployment

* Remove disable KFP caching

* Move .gitignore changes to inside component

* Add blank line to default .gitignore
kumare3 referenced this issue in EngHabu/pipelines May 27, 2020
* [UI] Show step pod yaml and events in RunDetails page (#3304)

* [UI Server] Pod info handler

* [UI] Pod info tab in run details page

* Change pod info preview to use yaml editor

* Fix namespace

* Adds error handling for PodInfo

* Adjust to warning message

* [UI] Pod events in RunDetails page

* Adjust error message

* Refactor k8s helper to get rid of in cluster limit

* Tests for pod info handler

* Tests for pod event list handler

* Move pod yaml viewer related components to separate file.

* Unit tests for PodYaml component

* Fix react unit tests

* Fix error message

* Address CR comments

* Add permission to ui role

* [Backend]Cache - Cache logic with db interaction (#3266)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* Add initial server logic

* Add const

* Change folder name

* Change execution key name

* Fix unit test

* Add Dockerfile and OWNERS file

This commit adds Dockerfile for building source code and OWNERS file for
easy review. This commit also renames some functions.

* fix go.sum

This PR fixes changes on go.sum

* Add local deployment scripts

This commit adds local deployment scripts which can deploy cache service
to an existing cluster with KFP installed.

* refactor src code

* Add standalone deployment scripts and yamls

This commit adds execution cache deployment scripts and yaml files in
KFP standalone deployment. Including a deployer which will generate the
certification and mutatingwebhookconfiguration and execution cache
deployment.

* Minor fix

* Add execution cache image build in test folder

* fix test cloudbuild

* Fix cloudbuild

* Add execution cache deployer image to test folder

* Add copyright

* Fix deployer build

* Add license for execution cache and cloudbuild for execution cache
images

This commit adds licenses for execution cache source code. Also adds
cloud build step for building cache image and cache deployer image.
Change the manifest name based on changed image.

* Refactor license intermediate data

* Fix execution cache image manifest

* Typo fix for cache and cache deployer images

* Add arguments in ca generation scripts and change deployer base image to google/cloud

* minor fix

* fix arg

* Mirror source code with MPL in execution_cache image

* Minor fix

* minor refactor on error handling

* Refactor cache source code, Docker image and manifest

* Fix variable names

* Add images in .release.cloudbuild.yaml

* Change execution_cache to generic name

* revice readme

* Move deployer job out of upgrade script

* fix tests

* fix tests

* Seperate cache service and cache deployer job

* mysql set up

* wip

* WIP

* WIP

* work mysql connection

* initial cache logic

* watcher

* WIP pod watching with mysql

* worked crud

* Add sql unit test

* fix manifest

* Add copyright

* Add watcher check and update cache key generation logic

* test replace container images

* work cache service

* Add configmap for cache service

* refactor

* fix manifest

* Add unit tests

* Remove delete table

* Fix sql dialect

* Add cached step log

* Add metadata execution id

* minor fix

* revert go.mod and go.sum

* revert go.sum and go.mod

* revert go.sum and go.mod

* revert go.mod and go.sum

* SDK - Added support for maxCacheStaleness (#3318)

* SDK - Added support for maxCacheStaleness

* Added the vendor prefix to the annotation

* Update Watson ML example to take output param path (#3316)

* update watson components with output path args to support tekton

* fix store bug and stop batch logs

* update pipeline with explicit helper function

* add missing commit

* SDK - Moved python op pipeline compilation test to bridge tests (#3323)

* SDK - Moved the @python_component decorator test to dsl tests (#3324)

* SDK - Moved the @python_component decorator test to dsl tests

* Deprecate @python_component

* Release be497983cda7a1d17f3883c67e39a969cf0868a9 (#3327)

* Updated component images to version be497983cda7a1d17f3883c67e39a969cf0868a9

* Updated components to version 2df775a28045bda15372d6dd4644f71dcfe41bfe

* update setup.py

* Style - Moved imports to the start of the file (#3325)

* SDK - Support kubernetes client v11 (#3319)

Fixes https://github.com/kubeflow/pipelines/issues/3275

* Bump version to 0.3.0 (#3329)

* Bump version to 0.3.0

* Fix formatting

* More formatting fixes

* More formatting fixes

* update requirements.txt

* update version

* Reduce steps for release cloud build yaml (#3331)

* Reduce steps for release cloud build yaml

* Update .release.cloudbuild.yaml

* Disables cache and cache-deployer temporarily because they block upgrade tests (#3333)

* Add namespace to experiment SDK calls (#3272)

* Post-submit test for Hosted/MKP (mpdev verify) (#3193)

* try generate MKP binary for each submit

* try run

* fix format

* fix format

* fix format

* it works, gcloud builds submit --config test/cloudbuild/mkp_verify.yaml --project ml-pipeline-test

* test commit trigger

* backup codes

* test

* fix

* pass manual test before submit

* 0.3.0

Co-authored-by: Renmin Gu <[email protected]>

* Update CHANGELOG for 0.3.0 (#3349)

* kfp UI node server support preview and handles gzip, tarball, and raw artifacts in a consistent manner. (#2992)

* Fix README formatting. (#3348)

* Fix README formatting.

* more fixes

* [UI Server] Blocks non public KFP report APIs (#3334)

* [UI Server] Blocks reportMetrics KFP api

* Also reject report workflow endpoint

* Also block report swf endpoint

* Add hostNetwork for marketplace proxy-agent manifest (#3330)

* SDK - Tests - Improved tests for serializing lists containing objects (#3326)

Added test_fail_on_handling_list_arguments_containing_python_objects
Added test_handling_list_arguments_containing_serializable_python_objects
Moved test_handling_list_arguments_containing_pipelineparam to component_bridge_tests

* [UI] Stops experiment list from leaking previous error message (#3350)

* [UI] Stops experiment list from leaking previous error message

* Move the fix to Page component so it's more generic

* Update ExperimentList.test.tsx

* [UI] Add namespace filter for All and Archived Runs page (#3351)

* [UI] Stops experiment list from leaking previous error message

* Move the fix to Page component so it's more generic

* [UI] Add namespace to AllRunsList api request

* [UI] Add namespace to archived run page

* Fix snapshot

* Fix tensorboard image parsing (#3356)

I introduced a bug when parsing the image for Tensorboard in
https://github.com/kubeflow/pipelines/pull/3235. This fixes it.

* Integration test fix (#3357)

* try generate MKP binary for each submit

* try run

* fix format

* fix format

* fix format

* it works, gcloud builds submit --config test/cloudbuild/mkp_verify.yaml --project ml-pipeline-test

* test commit trigger

* backup codes

* test

* fix

* pass manual test before submit

* 0.3.0

* quick fix for test path

Co-authored-by: Renmin Gu <[email protected]>

* [Manifest] Cache - Fix upgrade manifest (#3338)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* Change cache deployer job to stateful set

* Delete cache deployer job

* Delete cache deployer job after it completes

* minor fix

* fix indention

* Change cache deployer job to statefulset

* Remove extra cluster role for cache deployer

* remove cache in base kustomize file for upgrade test

* minor fix

* Add authorization check on ListExperiments (#3341)

* apiserver: Handle BucketExists() error (#3132)

* [UI] Tensorboard support for multi user (#3355)

* [UI Server] Add namespace argument for tensorboard endpoints

* Allow local node server to talk to minio in cluster

* Use tensorboard namespace in UI

* Add unit tests for tensorboard UI server

* Fix tests

* Fix tensorboard proxy url

* Fix tensorboard proxy failure

* Fix tests

* Remove unecessary encodeURIComponent

* Add old comment back

* [Test] Add argo retry in sample/integration tests to reduce flakiness. (#3365)

* add retry

* test

* revert test only change

* add retry to e2e tests

* try to parameterize retry limit

* Revert "try to parameterize retry limit"

This reverts commit 46451e3a

* update the retry limit to 2

* update e2e retry

* Manifests: Rename metadata gRPC server's resources to metadata-grpc-* (#3108)

* Manifests: Rename metadata gRPC server's resources to metadata-grpc-*

The metadata service deployed is a gRPC server.

Proper KF installation deploys both an HTTP server, naming the required
resources as 'metadata-deployment' and 'metadata-service', as well as a
gRPC server, naming the corresponding resources
'metadata-grpc-deployment' and 'metadata-grpc-service'.

KFP standalone installation manifests deploy solely the gRPC server, but
use naming identical to the KF's HTTP server one.
Applying them on top of an existing KF cluster breaks Metadata service.

In this PR we change the naming making it not diverge from a proper KF
installation. We also make MetadataWriter aware of that change.

Closes #2889.

Signed-off-by: Ilias Katsakioris <[email protected]>

* Fix ConfigMaps' label

* metadata-configmap
* metadata-mysql-configmap

* README: Link to KF installation & reference KFP version

* [Sample] CI Sample: Kaggle (#3021)

* kaggle sample

* code path

* fix typo

* visualize table component

* visualize html

* train model step

* submit result

* real image

* fix typo

* push before use

* sed to replace image in component.yaml

* general instructions

* typos; more robust; better code style

* notice about gcp sa and workload identity choice

* [Backend][Multi-user] Adjust/implement run api for multiuser support (#3337)

* Adjust/implement run api for multiuser support

* Fix error message

* use consistent run name in test

* add unit test

* ListRuns must specify filter either by namespace or by experiment

* fix comments

* SDK - Added pinned dependency snapshot (#3303)

* SDK - Added pinned dependency snapshot

* Downgraded zipp

The zipp package has dropped support for python3.5. https://zipp.readthedocs.io/en/latest/history.html#v2-0-0
https://github.com/jaraco/zipp/issues/28

* Fixing sample building in the backend Dockerfile

Installing SDK using pip.
Using SDK's requirements.txt.

* Enabled kubernetes v11

* Reverted the backend/Dockerfile for now

* Fixed the version of kfp-server-api

* pass token outside of SDK for server-to-server case (#3363)

* pass token outside of SDK for server-to-server case

* add more docs

* fix merge issue

* fix merge issue

Co-authored-by: Renmin Gu <[email protected]>

* Fix lstrip + regex bug in the KFP client (#3396)

* [Backend] Cache - Add cache_enabled label for cache filtering (#3352)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* Add cache enabled annotation to pod annotaion for cache filtering

* fix go.sum

* Add cache disable annotation value for future use

* Rename annotation key to cache qualified

* revert cache_qualified to cache_enabled

* Fix code comment

* Change cache_enabled annotation to label

* Add value check

* Read cache_enabled flag from config

* Add comments on set template labels

* Testing - Upgraded GKE master version to fix tests (#3404)

* [Backend]Cache - KFP pod filter logic looking for cache_enabled = true label selector (#3368)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* fix go.sum

* Change kfp annotation for pod filtering

* update filter logic

* Remove unused const

* [Manifest]Cache - mkp deployment (#3343)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* Add cache manifests for mkp deployment

* revert go.sum

* Add helm on delete policy for cache deployer job

* Change cache deployer job to statefulset

* remove unnecessary cluster role

* seperate clusterrole and role

* add role and rolebinding to mkp

* change secret role to clusterrole

* Add cloudsql support to cache

* Fix presubmit failure by avoiding license downloading when building image (#3406)

* Commit licenses of visualization server dependencies into repo to avoid flakiness from download during image building

* Fix script

* Add licenses

* Remove avro

* Remove two packages

* remove two licenses

* update image (#3395)

* quick fix envoy (#3413)

Co-authored-by: Renmin Gu <[email protected]>

* [Manifest]Fix - Cache mkp deployment (#3414)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* Add cache manifests for mkp deployment

* revert go.sum

* Add helm on delete policy for cache deployer job

* Change cache deployer job to statefulset

* remove unnecessary cluster role

* seperate clusterrole and role

* add role and rolebinding to mkp

* change secret role to clusterrole

* Add cloudsql support to cache

* fix comma

* [UI] No longer pass namespace to createRun api (#3403)

* revert kfp-cache from Hosted/MKP (#3416)

Co-authored-by: Renmin Gu <[email protected]>

* enable native Keras + TFMA (#3409)

Co-authored-by: Renmin Gu <[email protected]>

* [Manifest] Cache - Enable cache and cache deployer in base kustomization file (#3376)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* Change cache deployer job to stateful set

* Delete cache deployer job

* Delete cache deployer job after it completes

* minor fix

* fix indention

* Change cache deployer job to statefulset

* Remove extra cluster role for cache deployer

* remove cache in base kustomize file for upgrade test

* minor fix

* Enable cache and cache-deployer in base kustomization file

* fix

* fix

* test

* test

* test

* Refactor cluster scope resources

* refactor

* Add namespace for sa

* Fix

* Add crds folder to cluster kustomization yaml

* namespace change

* fix

* fix

* fix

* update test

* Rename cluster to cluster-scoped-resource

* test adding namespace in kustomization file

* revert namespace for clusterrolebinding

* fix

* Add db_name in cache_deployment manifest

* rename

* change secret cluster role to role

* [Backend][Multi-user] support multi-user mode for job APIs (#3384)

* Backend multi-user support for job

* Fix UT

* Clean up unused code.

* cleanup, merge duplicate code

* Skip host name preprocess for the IAP case (#3427)

* [Backend]Cache - Fix flag parse (#3429)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* fix go.sum

* Add flag.Parse() to read in flags

* Add new instructions to ensure compatibility for managed ai platform … (#3400)

* Add new instructions to ensure compatibility for managed ai platform pipeline

* change description to AI Platform Pipelies

* add instruction and clarification for AI Platform Pipeline in the first setup notebook

Co-authored-by: luoshixin <[email protected]>

* [Fix]Cache - Revert objectSelector in mutatingwebhookconfiguration (#3433)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* fix go.sum

* Change kfp annotation for pod filtering

* update filter logic

* Remove unused const

* revert objectSelector in mutatingwebhookconfig

* remove objectSelector

* remove recursive pipeline in e2e test to prevent infinite loop with cache

* updated version (#3421)

* enable CloudSQL+GCSObjStore without default credential (#3378)

* enable CloudSQL+GCSObjStore without default credential

* refresh document

* fix schema

* minio project ID is required

* fix several

* self throtting Github requests to let build be stable

* can work now

* upsize and lowercase for bucket name

Co-authored-by: Renmin Gu <[email protected]>

* [SDK][Multi-user] refine sdk for multi-user support (#3417)

* Allow writing/reading user namespace to/from local context file

* update docstring

* Move LOCAL_KFP_CONTEXT into Client class

* Fix docstring

Signed-off-by: Chen Sun <[email protected]>

* Make context_setting an instance variable and load from file during init.

* [Backend]Cache - Max cache staleness support (#3411)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* fix go.sum

* Initial max_cache_staleness

* Add max_cache_staleness=-1 support

* Unit tests

* fix test key

* Revise getCacheEntry logic

* minor fix

* [SDK/CLI] Add version param to run_pipeline (#3339)

* [SDK/CLI] Add version param to run_pipeline

* Set PIPELINE_VERSION relationship to CREATOR

Also adds a note about pipeline_id taking precedence over version_id

* SDK  - Components - Fixed bug in loading input-less graph components (#3446)

* [Manifest] Cache - MKP deployment (#3430)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* Add cache manifests for mkp deployment

* revert go.sum

* Add helm on delete policy for cache deployer job

* Change cache deployer job to statefulset

* remove unnecessary cluster role

* seperate clusterrole and role

* add role and rolebinding to mkp

* change secret role to clusterrole

* Add cloudsql support to cache

* fix comma

* Change cache secret clusterrole to role

* Adjust sequences of resources

* Update values and schema

* remove extra tab

* Change statefulset to job

* Add pod delete permission to cache deployer role

* Test changing cache deployer job to deployment

* remove extra permission

* remove statefulset check

* Change cache-deployer to strategy recreate (#3456)

* AWS sagemaker : Added license files and updated Dockerfile to use AmazonLinux (#3397)

* Added new LICENSE file

* added 2 more license files

* copy license files into the docker image

* pinned pip packages and rearranged the dockerfile

* [Backend] Keep workflow service account when not default or empty (#3435)

* [Backend] Keep workflow service account when not default or empty

* Fix unit tests

* Rename const to be consistent in style

* Refactor the legacy way of using pipeline id to create run in KFP backend (#3437)

* For legacy interface, we switch to the new presentation underhood

* when create run, if user specify a pipeline, we subsitute it with the pipeline's default version

* Add a case where a version and a pipeline are both specified

* comment; get ready pipeline

* comments

* fix upgrade integration test

* comments of todo; expected run/job now has resource references

* fix upgrade test expected value according to the new response

* fix a typo

* a quick hack for upgrade test

* surface err from conversion

* AWS Sagemaker : Updated documents  (#3440)

* Initial readme for Train component

* example input

* add train pipeline

* added simple_train_pipeline

* Updated readme to include kmeans-hpo-pipeline.py

* Updated train component readme

* fix typo

* Update details about how to get sample data for Train component

* update comment and give a defaault path for output

* change s3 bucket to match other sample pipelines

* Release eb69a6b8ae2d82cd8574ed11f04af4607756061c (#3466)

* Updated component images to version eb69a6b8ae2d82cd8574ed11f04af4607756061c

* Updated components to version 0e794e8a0eff6f81ddc857946ee8311c7c431ec2

* update version number

* Make endpoint_url None (#3374)

* update version (#3467)

* presubmit for MKP/Hosted (#3438)

* presubmit for MKP

* activate service account

Co-authored-by: Renmin <[email protected]>

* version bump fix (#3472)

* Release 0.4.0: Update change log (#3468)

* Updated component images to version eb69a6b8ae2d82cd8574ed11f04af4607756061c

* Updated components to version 0e794e8a0eff6f81ddc857946ee8311c7c431ec2

* update version number

* update change log

* [Test] fix upgrade test (#3469)

* update deploy-pipeline-lite.sh

* fix

* fix?

* revert

* [UI - multiuser] Fix pod log namespace source (#3477)

* Fix pod log namespace source

* Fix unit tests

* Update documentation for AWS components (#3410)

* deploy_createModel_readme

* readme for batch and minor updates to deploy and create_model

* updates based on review comments 1

* correct SageMaker typo

* [Fix]Fix release (#3476)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* fix go.sum

* Add cache server and cache deployer in release cloud build file

* backend/metadata_writer: Pin python dependencies (#3408)

* [Frontend] Fix npm reported vulnerabilities (#3480)

* Fix server vulnerability

* Fix vulnerability in frontend

* Fix frontend vulnerabilities

* pin webpackv ersion

* [UI] Show execution details in Run Details Page ML Metadata tab of steps (#3457)

* ML metadata tab in run details page

* Show execution details UI in run details step tab

* Fix tests

* Revert unnecessary changes

* SDK - Components - Restored the yaml formatting style (#3488)

Fixing compatibility with PyYaml 5.3

* [Test]Fix e2e test (#3471)

* Initial execution cache

This commit adds initial execution cache service. Including http service
and execution key generation.

* fix master

* fix go.sum

* Fix e2e test

* Add max_cache_staleness for flipA

* add comments

* [UI] Redirect to experiment list page when namespace changes in RunDetails or Compare page (#3483)

* Redirect to experiment list page when namespace changes

* Fix namespace initialize case

* Update KubeflowClient.tsx

* Components - Add model URL to AutoML - Create model/dataset for tables  (#3486)

* Re-generated the components

* Components - Add model URL to AutoML - Create model for tables

Fixes https://github.com/kubeflow/pipelines/issues/3246

* Added dataset URL to the AutoML - Create dataset for tables component

* Qwiklab caip updates (#3423)

* updates to AI Platform sample for Qwiklabs

* notebook updates

* Qwiklab changes

* Upadate backend BUILD files (#3455)

* Upadate BUILD files

* add workspace file

* reenable visualization server in multi-user mode (#3475)

* [Deployment] Move crds to cluster-scoped kustomize folders (#3498)

* [Deployment] Move crds to cluster-scoped kustomize folders

* Fix naming

* Rename folder

* Add STRUCTURE.md, fix bug

* fix

* one project share one default bucket (#3478)

* pass projectID from env/configmap without user input (#3458)

* Metadata Writer - Log pod names (#3479)

Fixes https://github.com/kubeflow/pipelines/issues/3462

* [Testing] Reduce image build flakiness by share and retry cloudbuild jobs (#3492)

* Let presubmit tests share and retry cloudbuild

* Fix ongoing_build_ids

* Add retry for workload identity binding

* Fix build id

* fix

* Parralelize image buidling for api server and others

* Fix

* fix

* fix

* Fix again

* Allow retry twice instead

* Update deploy-pipeline-lite.sh

* Update batch_build.yaml

* Refine log and retry tests

* Update log and retry

* Update and retry

* Update build-images.sh

* [UI] Groups executions by run if it exists (#3485)

* Fix concurrent IAM policy changes flakiness (#3504)

* Enable NFS dynamic PVC (#3314)

* GPU with Kubeflow Pipeline Standalone (#3484)

* GPU with Kubeflow Pipeline Standalone

* done

* dont' check in compiled pipeline

* gpu tpu preemptible

* done

* scope and quota comment

* Update metadata-envoy-deployment.yaml (#3502)

* AWS sagemaker: fixed a bug in ground_truth and updated all components to use images from new docker hub repo (#3474)

* Don't leave active_learning_model_arn.txt empty

* updated readme for ground_truth_pipeline_demo

* update docker repo

* Small changes to readme of ground truth sample pipeline

* [SDK] Make service account configurable for build_image_from_working_dir (#3419)

* Add kfp-container-builder sa

* Allow service account to be configurable

* Fix tests

* Fix test

* Use documentation for service account to introduce compatibility with different types of installation

* updated doc

* clean up

* Update container_builder_test.py

* Update _build_image_api.py

* Update kustomization.yaml

* Add executable permission for presubmit tests mkp.sh

* add user agent header to boto3 client for aws components (#3487)

* add user agent header to boto client

* add component version according to license file

* fetch version from license file at runtime

* Add archive experiment feature in backend (#3359)

* add new field in db schema and api schema

* auto genereted types for experiment storage state

* add archive and unarchive methods to backend for experiments.

* auto generated archive/unarchive methods for epxeriments

* add archive and unarchive to client

* set proper storage state when creating experiment

* retrieve storage state when we get/list epxeriment(s)

* change expection in test to have storage state

* add storage state in resource manager test

* revise experiemnt server test

* revise api converter test

* integration test of experiment archive

* archive/unarchive experiment affect the storage state of runs in it

* test all the runs in archive/unarchive experiment

* test all runs are archived/unarchived with their experiment in experiment server

* integration test

* integration test: value type mismatch in assertion

* unused import; default value for storage state

* autogen code for frontend

* reorder the fields in api experiment schema

* switch the position of the two enum to verify a hypothesis

* Put a place hodler to prevent any valid item to take the value 0

* Get rid of the place holder since the cause of issue related to value 0 is found and fixed.

* The returned api experiment now has storage state field

* create experiment return doesn't contain storege state

* Cleanup needs to clean runs and pipeliens now

* a missing client

* use resource reference as fileter instead of experiment uuid

* use same namespace in archive unit test

* Leave archive/unarchive experiment integration test to a separate PR

* also need to update jobs when experiments are archived

* Change of unarchiving logic. When experiment is unarchived, jobs/runs in
it stay archived

* add unit test for the job status in archived/unarchived experiment

* change archive state to 3 value enum; add experiment integration test

* make archive state 3 value enum to avoid 0 value mapped to available; add integration test

* run swagger autogen

* fix an expected value

* fix experiment server test

* add job check in experiment server test

* update job crds

* fix a typo

* remove accidentally included irrelevant changes

* add missing licenses for viz server (#3529)

* add missing licenses for viz server

* removing unused licenses.

* SDK - Made YAML dumping more awesome (#3520)

See the root cause explanation in https://github.com/kubeflow/pipelines/issues/3519

* Components - Fixed BugQuery - Query component (#3514)

Working around an Argo bug.
Revert this when we upgrade to Argo version which has the fix: https://github.com/argoproj/argo/pull/1653

* Revise run_pipeline comment (as expected) (#3506)

* revise run_pipeline comment (as expected)

* add the explanation of behavior if old appoarch is used

* add periods to the end of the sentences

* Upload pipeline/pipeline version with a description (#3511)

* add desscription to client interface

* autogen

* version doesn't have description field

* swagger autogen

* remove two accidentally committed local python package

* Fix confusing .gitignore config

* Testing - Fixed python requirements for sample tests (#3536)

Cleaned up requirements.in
Included kfp package requirements.
Fixed version conflicts.
Generated requirements.txt using the improved script:
https://github.com/kubeflow/pipelines/pull/3535

* Infra - Improved the update_requirements script (#3535)

This helps generating requirements for multiple requirements.in files.
It also fixes the locking issues that seem to be caused by mounting directories using Docker.

* use reflect.DeepEqual to compare run structs (#3546)

* Fix list_run bug (#3539)

* add missing license for component image (#3543)

* [Backend][Multi-user] Support creating visualization in user namespace (#3495)

* Add namespace field to CreateVisualizationRequest

* Support getting visualization service URL with namespace

* fix typo

* Add auth checking & allow empty namespace

* Update check-build-image-status.sh (#3533)

* Services - Metadata Writer - Added support for custom_properties in all helper methods (#3556)

Fixes https://github.com/kubeflow/pipelines/issues/3552

* Sort resource references before checking for run struct equality (#3547)

* OSS 1.0 Kustomize part-2 parameterize & fix CloudSQL (#3540)

submit without wait for fix for following as no dependency
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/3540/kubeflow-pipeline-e2e-test/1252173721301422090

* use better sample name (#3558)

* Fix test which uses Kustomize edit image but can't work with valueRef (#3572)

pass upgrade / installation test. submitting. now.

The e2e test fails but not due to this PR. Submit this PR to unlock KIR side

* Clusterrolebinding is using namespace which not parametrized (#3573)

submit quick to make sure others won't got confused

* Update ml-pipeline-ui-deployment.yaml (#3586)

* In single-user mode, experiment APIs shouldn't contain user namespace. (#3544)

* Update kustomization.yaml (#3582)

* Minor updates to packages (#3428)

* Qwiklab caip updates (#3512)

* updates to AI Platform sample for Qwiklabs

* notebook updates

* Qwiklab changes

* minor naming changes in sample notebook

* SDK - CLI - Fixed incompatibility with Python 3.5 (#3567)

* Enable cache-deployer as fixed the root cause in other PR (#3574)

* default to kubeflow

* done

* include cache as we found root cause is namespace

* fix

* change the default to kubeflow, more for manual upgrade

* Release ad9bd5648dd0453005225779f25d8cebebc7ca00 (#3560)

* Updated component images to version ad9bd5648dd0453005225779f25d8cebebc7ca00

* Updated components to version 01a23ae8672d3b18e88adf3036071496aca3552d

* update version to 0.5.0 (#3566)

* Metadata  Writer - Fixed pod name property setting (#3563)

* Remove compiled manifests (#3592)

* [UI] Multi user permission separation for artifact api (#3522)

* [UI Server] Proxy /namespaces/:namespace/artifacts/get requests to namespace specific artifact services

* [UI] Show artifacts by namespace

* Fix minio artifact link tests

* Fix DetailsTable tests

* Fix OutputArtifactLoader.test

* Change artifact proxy to use query param instead

* Add integration tests for artifact proxy

* Fix unit tests

* Rename service name

* Add comment

* add more comments

* Fix import

* Refactored how to spy on internal methods from tests

* [UI] Get execution details from metadata writer (#3553)

* [UI] fix - create run missing-param warning not reacting to param value changes (#3559)

* SDK - Removed the ArtifactLocation feature (#3517)

* SDK - Removed the ArtifactLocation feature

The feature was deprecated in v0.1.34 https://github.com/kubeflow/pipelines/pull/2326

* Removed the artifact_location sample

* fix #2802: Set ImagePullPolicy per pipeline.  (#3534)

* bump version

* default image pull policy

* Update sdk/python/kfp/dsl/_pipeline.py

* task setting should dominate

* Update sdk/python/kfp/dsl/_pipeline.py

* fixed merge misstake

* Add method to schedule a recurring run to python client (#2978)

* python_kfp_client: add method to create recurring run

* client: add list_recurring_runs, get_recurring_run

* kfp_client: swap _create_job_config <-> run_pipeline

* kfp_client: mk propper trigger

* [UI Server] Enable strict type checking and fix errors (#3593)

* wip

* Fix typing

* Fix build error

* Add type checking to tests

* Fix server typing

* Clean up

* Fix server typing

* AWS Sagemaker : Use json.dumps() to better organize the input and remove data_locations (#3518)

* construct channel input using json.dumps()

* remover data_location parameters

* add changelog

* Update version in license file and small changes to readme

* [API] Include namespace in visualization.swagger.json (#3588)

* include namespace in CreateVisualization

* include namespace in post body

* put namespace in the path and in front of visualization

* post /apis/v1beta1/visualizations/{namespace}

* [UI] Migrate to namespaced visualization request (#3603)

* Regenerate and use new visualization api

* [UI] Support namespaced visualization api

* [UI] Show cached steps (#3602)

* [UI] Show cached steps

* Tests for parseNodePhase

* Complete unit tests

* Update StatusUtils.test.tsx

* Add index on run details on experiment UUID & conditions & finished time (#3610)

* SDK - Compiler - Include the SDK version information in the compiled workflows (#3583)

* SDK - Compiler - Include the SDK version information in the compiled workflows

* Fixed the unit tests

* Removed the sdk_version annotation.

* [UI] get run id from both property and custom property (#3501)

* [UI] Groups executions by run if it exists

* Get run_id from both custom property and property

* [UI] Fix artifact url for multi namespace (#3605)

* Regenerate and use new visualization api

* [UI] Support namespaced visualization api

* Fix minio artifact link

* Fix tests

* Fixing volume size default value from 1 to 30 (#3598)

* SDK - Components - Task objects now have the .output attribute when component has only one output (#3622)

* [Caching] Add a cached label for cached pods (#3623)

* Update mutation.go

* Update mutation.go

* Update mutation.go

* Update mutation.go

* fix issue of creating default bucket (#3626)

* [Viewer] Service needs port name for istio (#3619)

* [Backend] Authorization service (#3627)

* Authorization service proto

* implement auth service

* Add unit tests

* [SDK] Add pod labels for telemetry purpose. (#3578)

* add telemetry pod labels

* revert the id label

* update compiler tests

* update cli arg

* bypass tfx

* update docstring

* SDK - Enabled file inputs to be optional (#3620)

* SDK - Enabled file inputs to be optional

* Added unit tests

* Components - Added readme for TFX components (#3637)

* Components - Added readme for TFX components

* Resolved review feedback

* Add two scripts to load test our api endpoints with measurement of run durations and api call latencies (#3587)

* script to profile pipeline api endpoint

* two plots

* another run api test

* clear cell output

* add some comments

* pipeline uses create pipeline

* add client

* checkpoint

* polish two scripts

* remove accidentally committed files

* added a success vs non-success plot; only measure run durations for succeeded runs

* Fix source for mlpipeline-ui-metadata in WorkflowParser (#3379)

* WorkflowParser->loadNodeOutputPaths
source: s3.endpoint === 's3.amazonaws.com' ? StorageService.S3 : StorageService.MINIO

* Use isS3Endpoint (server/aws-helper.ts) to identify artifact source

* npm run format

* created src/lib/AwsHelper.ts (for frontend code), because frontend client and frontend server do not share code for now.

* [UI] authorize tensorboard actions (#3639)

* Authorization service proto

* implement auth service

* Add unit tests

* Generate auth api client

* Authorization checks for tensorboard apis

* UI Server authorization checks

* Clean up error parsing

* Revert changes

* Fix portable-fetch not found bug

* Fix unit test

* Include portable-fetch required by api client

* Fix portable-fetch module import error

* Fix portable-fetch again

* Add unit tests

* Address CR comments

* add unit test for header

* Update readme

* Components - Upgraded the TFX components to 0.21.4 (#3641)

* Updated and synced the generated code

There is only 1 line of component specific code in each component function (apart frm the sunction signature).

* Updated some components that had older version of the generated code. The generated code is now the same everywhere.
* `input_channels_with_splits` is now generated based on the input artifact types
* TFX broke back compat: Removed `.split` from the artifacts. The components seem to now assume there is a single artifact in the channel.
* TFX broke back compat: changed the way artifact instances are created
* Updated container image to 0.21.4. There might have been backwards incompatible input/output changes - need to check and update.

* Updated component signatures

* Updated the generated component.yaml files

* Updated the sample notebook notebook

* Removed the optional output in Evaluator

Optional outputs are not supported yet. I'm not sure they're even correct according to MLMD.

* Updated the sample

* Sort job resource references before equality check (#3561)

* Cache - Stabilized the deployer script (#3634)

* Cache - Stabilized the deployer script

Fixed the bug in the waiting forever mechanism. Previously it would restart teh script if there is any Kubernetes connectivity problem. Should fix https://github.com/kubeflow/pipelines/issues/3609
The script no longer reinstalls the resources if the MutatingWebhookConfiguration already exists.

* Fixed the webhook detection

* Resolved review feedback

* Fix container.set_image_pull_policy documentation string (#3653)

The API function seemed to be not correct, which is now fixed.

Signed-off-by: Sascha Grunert <[email protected]>

* kfp_client: fix wrong check (#3652)

* SDK - Components - Split load_component functions into loading the spec and creating task factory (#3614)

The PR is a refactoring.
Split all load_component* methods in _components and _component_store into _load_component_spec* and creating task factory from that spec.
This makes it easier to load the spec without having to create task factory functions.

* [Sample] Update base images used in pre-built samples (#3422)

* update base image

* refactor

* fix typo

* Revert "refactor"

This reverts commit 499f9604

* Revert "fix typo"

This reverts commit e2faeb46

* Revert "update base image"

This reverts commit 4f8d0977

* update docker file

* test

* insert line for changing default image

* Backend - Upgraded MLMD client to fix Metadata Writer (#3657)

There was a backwards-incompatible MLMD server-side change that caused the get_artifacts_by_uri API to start silently returning empty lists of artifacts.
Fixes https://github.com/kubeflow/pipelines/issues/3656

* [Backend] Add service account field to run and job api objects (#3649)

* Add service account field to run and job api objects

* Update description

* Fix field casing

* Add comment about the next field number

* Move namespace to cluster-scoped (#3662)

* move namespace to cluster-scoped-resource

* fix doc

* AWS Sagemaker : Add unit tests (#3642)

* Initial changes

* add one test for each component

* Add readme for unit tests

* add empty string test and dockerfile

* added dockerfile

* use python3 in dockerfile

* add coverage report to unit tests

* update readme for PR

* small changes to resolve git comments

* copy requirements.txt separately in dockerfile

* small changes

* pin pip package versions in unit_tests

* fix proxy URL issue (#3663)

* fix proxy URL issue

* fix another issue in same PR

* done (#3665)

* Add an operator to configure toleration of the GKE GPU node taints (#3671)

* Add an operator to configure toleration of the GKE GPU taints

https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool

* Make sure of the google coding style with yapf

* Change the function name to add_gpu_toleration

* raise error when LRO has an error (#3666)

* [AWS SageMaker] Add CodeBuild Steps (#3668)

* Add initial unit test buildspec

* Add docker log output

* Add force no pytest color

* Update docker build to be quiet

* Add pass all environment variables

* Update unit test container env file

* Update env to use different syntax

* Remove daemon mode

* Remove TTY from docker run

* Add dryrun and dockercfg setup

* Update dryrun into CodeBuild logic

* Add mkdir for Docker config

* Update app version temporarily

* Revert app version temporarily

* Update unit test log file

* Add tag minor and major versions

* Update version temporarily

* Add print for major and minor tags

* Revert version back down

* Add deploy version override

* Update path to testing directories

* Fix tab formatting

* Fix pytest log directory

* gpu example wording cleanup (#3682)

* Katib Launcher Experiment Name Conflict (#3508)

* added uuid to experiment name so that it does not conflict when trying katib experiment with same name

* added uuid to experiment name so that it does not conflict when trying katib experiment with same name

* experiment name limited to 64 characters

* updated user name and email for the repo

* fixed the typo

Co-authored-by: Sandhya <[email protected]>

* SDK - Components - Fixed bug in _strip_type_hints_using_lib2to3 (#3679)

* Support execution throttling for executing the pipelines (#3346) (#3439)

* Add parallelism limits to pipeline in kfp sdk

* fix lint error

* Add Nodeselector to pipelineconfig fix issue #2863 (#3616)

* updated version

* added pipeline nodeselector

* removed old legacy

* renaming

* update test

* Update sdk/python/kfp/compiler/compiler.py

* Travis - Disabled coveralls (#3659)

* Fixed the TFX conflicts in tests (#3690)

Fixes https://github.com/kubeflow/pipelines/issues/3689

* Update @kubeflow/frontend with fixes for #3625 and #3648 (#3695)

* [Backend] Use service account passed from api object (#3650)

* Add service account field to run and job api objects

* Update description

* Fix field casing

* Use service account from api object

* Fix bug and add unit test

* Save patched workflow spec into db instead

* Save service account to db model

* Fix unit tests

* Fix integration tests

* Fix upgrade test

* Update upgrade_test.go

* Experiment archiving related UI changes (#3615)

* archive button to experiment details page

* list of archived exp

* remove a prop

* restore; new run only in unarchived experiment

* in all experiments tab, only unarchived ones are listed

* refine dialog messages; disable selection of runs in archived experiment list

* unit test for experiment list component

* more unit tests

* remove unnecessary methods and props

* using tabs instead of radio buttons to switch between archived runs and archived experiments

* added tests

* fix rundetails test

* add tests

* sidenav snapshot

* sidenav snapshot update

* address comments

* format

* Cache - Fixed the cache deployer script (#3700)

The grep version seems to be different in the base image

* Cache - Add namespace to webhook config's name (#3702)

* Cache - Add namespace to webhook config's name

* Update deploy-cache-service.sh

* Updated the ComponentSpec schema (#3698)

* Components - Added support for Dataflow in TFX components (#3684)

* Components - Added support for Dataflow in TFX components

To use Dataflow, pass beam_pipeline_args to a component.
```
transformer_op(
    ...,
    beam_pipeline_args = [
        '--runner=DataflowRunner',
        '--experiments=shuffle_mode=auto',
        '--project=' + project_id,
        '--temp_location=' + gcs_bucket + '/tmp'),
        '--region=' + gcp_region,
        '--disk_size_gb=50',
    ],
)
```

These components use URI-based I/O since TFX with Beam's DataflowRunner only supports GCS URIs for inputs and outputs. With URI-based IO, the user must specify all output URIs themselves (e.g. `CsvEampleGen(..., output_examples_uri=...)`). Do not forget to do so. The `kfp.dsl.EXECUTION_ID_PLACEHOLDER` object can help construct execution-unique URIs, but if the component has multiple URIs, you will need to add some prefixes that are different for each output.

There is a bug in TFX+Beam which prevents using DataflowRunner, but these componenct contain a workaround. The workaround can be removed when the fixed verson of TFX is released https://github.com/tensorflow/tfx/commit/ddb01c02426d59e8bd541e3fd3cbaaf68779b2df

* Added the TFX on KFP Dataflow sample

* Updated the README.md file

* Enabled the blessing output of the Evaluator

The Evaluator does not always write to that URI, but for components with URI-based I/O this does not matter.

* Fixed the indent in YAML

* Addressed the review feedback

* Updated the sample after the component changes

* Fixed the Dataflow casing in the sample name

* Using channel_utils.unwrap_channel_dict

* Updated the sample pipeline

* Sjortened the .get expressions

* Updated the sample

* Make it possible to upload a version of the pipeline with CLI (#3672)

* Add upload_pipeline_version to kfp.Client

* Add the 'kfp pipeline upload_version' command to CLI

* Make sure to use upload_pipeline_version wihtout a helper func

* Make sure of the google coding style with yapf

* Fix up the pipeline id option

* Make the version and id options to the required options

* 3674: check if gcs bucket exist before creation (#3675)

* update version (#3694)

* [UI] textbox to select KSA when creating runs/jobs (#3651)

* Add service account field to run and job api objects

* Update description

* Fix field casing

* Use service account from api object

* Fix bug and add unit test

* [UI] Allow choosing Kubernetes service account

* fix unit tests

* fix format

* Also clone service account

* service account UI features

* Add unit test for cloning service account

* Fix frontend integration tests

* Integration tests for AWS SageMaker Components (#3654)

* integration tests for aws sagemaker components with comment

* address comment related to S3 dataset creation

* rev3: bug fix in conda env yaml and resuse sagemaker method to get image URI

* Add createModel test

	- reduce code duplication
	- add some utility methods

* 0.5.1 changelog (#3706)

* [UI] Improve TFX artifact visualization speed (#3712)

* Improve TFX artifact visualization speed

* Update OutputArtifactLoader.ts

* Fix loading

* Add nodeId back as an identifier

* [UI] Hide empty resource op manifest tab in run details page (#3713)

* Hide manifest tab when empty

* fix unit tests

* Fix tests

* [UI] Cleanup, remove types from urls in artifact/execution details page (#3715)

* Remove types from urls in artifact/execution details page

* Remove unused route params

* Fix snapshot tests

* Fixed small syntax error in a sample notebook (#3721)

* remove an accidentally committed debugging log (#3716)

* When patching the {{}} placeholder in parameter, check for possible nil pointer (#3714)

* check nil for a pointer before using it

* if parameter's value is nil pointer, use parameter's default

* [UI] Make visualization tab easier to understand (#3717)

* Rename artifacts tab to visualizations and add documentation link

* Show a banner when no visualizations

* Clean up code

* Update snapshots

* Fix banner tests

* Add unit test for visualization creator

* Update VisualizationCreator.tsx

* Update VisualizationCreator.tsx

* [frontend] Show artifact preview in UI (#2172) (#3707)

* show a preview of an artifact in the ui

* Add styling to preview box

* fix minor typo in unit test

* minor fixes + DetailsTable now accepts a ValueComponent again

* encode uricomponent for generate artifact url

* fix classname typo

* Added valueComponentProps for DetailsTable for better type checking.

* fix mock bug in test. peek -> maxbytes & maxlines for MinioArtifactPreview

* fix format

* mock Editor

* Travis - Made flake8 test optional (#3739)

* SDK - Annotate pods with component_ref (#3727)

* SDK - Annotate pods with component_ref

This preserves the information about the digest of the component and the location from which the component was loaded.

* Fixed compiler tests

* Travis - Use latest pip version (#3732)

* SDK - Prioritize lib2to3 when stripping type annotations (#3724)

* SDK - Prioritize lib2to3 when stripping type annotations

It's a standard python library (although not well supported) and it doe not leave training spaces.

* Fixed compiler test data

* [AWS SageMaker] Specify component input types (#3683)

* Replace all string types with Python types

* Update HPO yaml

* Update Batch YAML

* Update Deploy YAML

* Update GroundTruth YAML

* Update Model YAML

* Update Train YAML

* Update WorkTeam YAML

* Updated samples to remove strings

* Update to temporary image

* Remove unnecessary imports

* Update image to newer image

* Update components to python3

* Update bool parser type

* Remove empty ContentType in samples

* Update to temporary image

* Update to version 0.3.1

* Update deploy to login

* Update deploy load config path

* Fix export environment variable in deploy

* Fix env name

* Update deploy reflow env paths

* Add debug config line

* Use username and password directly

* Updated to 0.3.1

* Update field types to JsonObject and JsonArray

* Upgraded Argo to v2.7.5 (#3537)

* Upgraded Argo to v2.7.4

* Downgraded the Argo CLI version to 2.4.3

See https://github.com/argoproj/argo/issues/2793

* Removed the argo cli arg that had been removed

* Updated to Argo 2.7.5

* Added workflowtemplates and cronworkflows to the Role

* Added the new Argo CRDs

* [Backend] Allow capital letters and underscore in metric names (#3741)

* Allow capital letters and underscore in metric names

* Fix tests, add comments

* Update run_metric_util.go

* Fix bug in #3707 - href should show full artifact content instead of preview (#3745)

* MetadataStore: Upgrade tool (#3295)

* Show version tag in UI (#3743)

* Show version tag in UI

* Add new arguments to test cloudbuild configuration

* backend cloudbuild should use commit_sha as argument

* Fix minor bug during dev

* [UI] Wrap parameter/urls on overflow (#3747)

* [UI] Wrap parameter/urls on overflow

* Add comment about css

* Let artifact preview take over full width

* SDK - Components - Calculate component hash digest (#3726)

* SDK - Components - Calculate component hash digest

The digest is calculated when loading the component from URL, tfile or text.
Slightly refactored component loading - streams are no longer used, only bytes.
TODO: Calculate the digest if missing
TODO: Report possible digest conflicts

* Updated the test graph component

* Using the actual digest in the test

* SDK - Made outputs with original names available in ContainerOp.outputs (#3734)

* SDK - Made outputs with original names available in ContainerOp.outputs

Previously, ContainerOp had strict requirements for the output names, so we had to convert all the names before passing them to the ContainerOp constructor. Outputs with non-pythonic names could not be accessed using their original names.
Now ContainerOp supports any output names, so we're now using the original output names.
However to support legacy pipelines, we're also adding output references with pythonic names.

* Fixed the compiler test data

* Fixed the duplicate parameter outputs in the compiled workflow

* Fixed long line

* Stabilized the output naming conflict resolution

* Fix case of missing special outputs

* [UI] Fix artifact preview with outdated content (#3749)

* Fix preview with outdated content

* Update snapshot

* [UI] Show tooltip on long version names (#3750)

* [UI] Show tooltip on version name in selector

* Update snapshots

* Add tooltip for pipeline version in run list

* refactor code

* Metadata Writer - Preserve all Argo artifact information (#3725)

* Metadata Writer - Preserve all information in artifact URI

Previously only s3 artifacts were supported and only bucket and key were included (not endpoint, for example).

* Move Argo artifact information to artifact's custom_property

* [UI Server] Refactor for configurable auth header (#3753)

* [UI] Make auth header configurable

* Refactor authorizeFn to move side effects out

* Refactor tests to reduce duplication

* SDK - Components - Improved stability of the input and output renaming (#3738)

In some cases the input and output names need to be converted (for example, the input names need to be converted to python function parameter names).
With naive renaming, multiple inputs might be mapped to the same parameter name in some edge cases. The `generate_unique_name_conversion_table` creates a correct mapping.

However, in some really rare cases the resulting mapping could be confusing since it might rename an input whose name was already a correct parameter name and map a different input name to that parameter. E.g. {'AAA' -> 'aaa', 'aaa' -> 'aaa_2'}.
This PR fixes that. Names that do not change when applying the conversion_func will remain unchanged in the mapping. {'AAA' -> 'aaa_2', 'aaa' -> 'aaa'}.

* [AWS SageMaker] Unit tests for Training component (#3722)

* Added additional training unit tests

* Add main training function tests

* Add full training test coverage

* Fix import sys

* Fix poorly named test

* Components - Tensorboard visualization (#3760)

* [Servers] Add liveness and readiness probes (#3757)

* probes for ml-pipeline-ui

* clean up comments

* Use wget instead of curl, because wget is included in alpine

* Also update marketplace manifest

* Add readiness/liveness probe for api server

* Add probes for python vis server

* Add probes to metadata grpc service (#3765)

* Add probes to metadata grpc service

* Fix port name length limit

* Update README.md

* manual merge as the change it self is correct

but MKP mpdev:latest has an issue block our tests

* SDK - Moved some data from the component_ref annotation to the component_spec annotation (#3751)

Removing the component spec from component_ref (since it would be a duplicate), but making sure the whole spec if available in component_spec.

* AWS Sagemaker Components - enhance integration test coverage (#3720)

* AWS Sagemaker Components - enhance integration test coverage
	- Add tests for create endpoint, hpo job and batch transform
	- Minor bug fixes and documentation

* rev2: Address comments and clean up generated artifacts

* rev3: address more comments

* rev4: add canary test marker

* Trigger Build

* Add more approvers in AWS sagemaker components (#3740)

* SDK - Components - Removed the deprecated _python_op.get_default_base_image and set_default_base_image functions (#3773)

* SDK - Moved the tests closer to the code (#3774)

This makes switching from code to tests easier

* fix(testing) - Fix "1.14.10-gke.27" is unsupported (#3781)

* [Manifest] Use kustomize native image transformer to override image (#3776)

* [Manifest] Use kustomize native image transformer to override image

* Revert unintended changes

* Fix kustomization.yaml location

* Fix inverse proxy image

* SDK - Tests - Use relative imports (#3784)

This makes testing easier to run in local dev scenarios.

* [Backend] Make user identity header configurable (#3772)

* Make user identity header configurable

* use constants in UT.

* Allow PipelineParams in dict keys too. (#3565)

Co-authored-by: Thi Nguyen <[email protected]>

* [ScheduledWorkflow] Fix events permission missing (#3785)

* Infer artifact store endpoint in metadata writer (#3530)

Signed-off-by: Jiaxin Shan <[email protected]>

* Changing the default volume size to 30 (#3792)

* Client - Added documentation for the generated members (#3787)

* [AWS SageMaker] Integration tests automation (#3768)

* # This is a combination of 5 commits.
# This is the 1st commit message:

Add initial scripts

# This is the commit message #2:

Add working pytest script

# This is the commit message #3:

Add initial scripts

# This is the commit message #4:

Add environment variable files

# This is the commit message #5:

Remove old cluster script

* Add initial scripts

Add working pytest script

Add initial scripts

Add environment variable files

Remove old cluster script

Update pipeline credentials to OIDC

Add initial scripts

Add working pytest script

Add initial scripts

Add working pytest script

* Remove debugging mark

* Update example EKS cluster name

* Remove quiet from Docker build

* Manually pass env

* Update env list vars as string

* Update use array directly

* Update variable array to export

* Update to using read for splitting

* Move to helper script

* Update export from CodeBuild

* Add wait for minio

* Update kubectl wait timeout

* Update minor changes for PR

* Update integration test buildspec to quiet build

* Add region to delete EKS

* Add wait for pods

* Updated README

* Add fixed interval wait

* Fix CodeBuild step order

* Add file lock for experiment ID

* Fix missing pytest parameter

* Update run create only once

* Add filelock to conda env

* Update experiment name ensuring creation each time

* Add try/catch with create experiment

* Remove caching from KFP deployment

* Remove disable KFP caching

* Move .gitignore changes to inside component

* Add blank line to default .gitignore

* Add the 'kfp experiment' commands (#3705)

* Add the 'kfp experiment list' command

* Add the 'kfp experiment get' command

* Add the 'kfp experiment create' command

* Add the 'kfp experiment delete' command

* Add a caution to 'kfp experiment delete'

* Use directly the backend api method to list experiments

* Update a message based on the suggestion

https://github.com/kubeflow/pipelines/pull/3705#discussion_r424751792

* AWS SageMaker : Use IAM Roles for Service Account (#3719)

* don't use aws-secret and update readme for sample pipelines

* Addressed comments on PR and few more readme changes

* small changes to readme

* nit change

* Address comments

* [UI] Fix confusion matrix wrong axes (#3817)

* [UI] Fix confusion matrix wrong axes

* Fix confusion matrix background opacity

* Docs - Added kfp.dsl placeholders to docs (#3813)

* Adding HPO unit test (#3791)

* Adding HPO unit test

* Adding best training job

* Addressing comment

* Client - Allow specifying pipeline description when uploading (#3828)

* Client - Allow specifying pipeline description when uploading

Fixes https://github.com/kubeflow/pipelines/issues/3825

* Implemented review feedback

* [UI] Also cloning recurring run schedule, fixes #3761 (#3840)

* [UI] Also cloning recurring run schedule

* Fix unit test for trigger and utils

* Add and fix unit tests for Trigger

* Add NewRun page unit tests

* Fix unit tests

* Fix jest test timezone

* Testing - Pin numpy version to fix TFX installation instability in Travis tests (#3833)

TFX package is has inconsistent dependencies wwhich causes the installation to be flaky and install different numpy version every time leading to failures.

* [AWS SageMaker] Integration Test for AWS SageMaker GroundTruth Component (#3830)

* Integration Test for AWS SageMaker GroundTruth Component

* Unfix already fixed bug

* Fix the README I overwrote by mistake

* Remove use of aws-secret for OIDC

* Rev 2: Fix linting errors

* Components - Moved TFX components to deprecated directory (#3854)

* Added README for Amazon SageMaker Components for Kubeflow Pipelines (#3824)

* Create README.md

* Added README

Updated page to include information on Amazon SageMaker components

* Update README.md

* Integrated feedback

* A more accurate grpc error code for duplicate pipeline/pipeline version/experiment names (#3846)

* a more accurate grpc error code

* remove accidentally checked in file

* Add labels to plots (#3811)

* 5 runs

* 50 runs

* (1) add labels (2) instead of plotting kde, plotting histogram and rug

Co-authored-by: Yuan (Bob) Gong <[email protected]>
Co-authored-by: Rui Fang <[email protected]>
Co-authored-by: Alexey Volkov <[email protected]>
Co-authored-by: Tommy Li <[email protected]>
Co-authored-by: Ajay Gopinathan <[email protected]>
Co-authored-by: IronPan <[email protected]>
Co-authored-by: Chen Sun <[email protected]>
Co-authored-by: Renmin <[email protected]>
Co-authored-by: Renmin Gu <[email protected]>
Co-authored-by: Eterna2 <[email protected]>
Co-authored-by: Rafael Barreto <[email protected]>
Co-authored-by: Johannes 'fish' Ziemke <[email protected]>
Co-authored-by: Jiaxiao Zheng <[email protected]>
Co-authored-by: Ilias Katsakioris <[email protected]>
Co-authored-by: dldaisy <[email protected]>
Co-authored-by: Samuel Ngahane <[email protected]>
Co-authored-by: Alexey Volkov <[email protected]>
Co-authored-by: Shixin <[email protected]>
Co-authored-by: luoshixin <[email protected]>
Co-authored-by: Niklas Hansson <[email protected]>
Co-authored-by: Paul Selden <[email protected]>
Co-authored-by: Kartik Kalamadi <[email protected]>
Co-authored-by: jingzhang36 <[email protected]>
Co-authored-by: Suraj Kota <[email protected]>
Co-authored-by: dhodun <[email protected]>
Co-authored-by: Kartik Kalamadi <[email protected]>
Co-authored-by: Suraj Kota <[email protected]>
Co-authored-by: hongye-sun <[email protected]>
Co-authored-by: Mark Mirchandani <[email protected]>
Co-authored-by: Jonas De Beukelaer <[email protected]>
Co-authored-by: faweis <[email protected]>
Co-authored-by: frozeNinK <[email protected]>
Co-authored-by: Gautam Kumar <[email protected]>
Co-authored-by: Dmitry B <[email protected]>
Co-authored-by: Sascha Grunert <[email protected]>
Co-authored-by: Shotaro Kohama <[email protected]>
Co-authored-by: Nicholas Thomson <[email protected]>
Co-authored-by: Amy <[email protected]>
Co-authored-by: Sandhya Gopchandani <[email protected]>
Co-authored-by: Sandhya <[email protected]>
Co-authored-by: Pavel Taraskin <[email protected]>
Co-authored-by: dushyanthsc <[email protected]>
Co-authored-by: Jiaxin Shan <[email protected]>
Co-authored-by: Thi Nguyen <[email protected]>
Co-authored-by: Thi Nguyen <[email protected]>
Co-authored-by: Gautam Kumar <[email protected]>
Co-authored-by: Meghna Baijal <[email protected]>
Co-authored-by: IvyBazan <[email protected]>
dstnluong added a commit to dstnluong/pipelines that referenced this issue Jul 28, 2020
# This is the 1st commit message:

parent 1551e34
author Dustin Luong <[email protected]> 1593625846 -0700
committer Dustin Luong <[email protected]> 1595961236 -0700

Unit test passes

Updated unit tests, working on integration tests

Updated READMEs, Logged rules status and errors to UI

    squash a94f809 Unfinished sample pipeline for debugger
    squash 8128572 Edge case: empty rules config

Edge case: empty rules config

Cleaned up example pipeline and fixed empty case of debug rules

Finished integration test, debug rule logs only print after training has completed, refactored various code

Unhardcoded rules registry

New sample pipeline, minor changes to utils

Refactored wait_for_debug_rules, added unit tests, updated readme for debugger demo, fixed typos and small errors

rm .gz

Changed defaults for train.template.yaml, updated example pipeline, removed exceptions from utils which are handled by boto3

removed trust.json

minor clean ups

Minor fixes

Refactored code to incorporate changes from design review, notably removing collection_config

# This is the commit message kubeflow#2:

custom_rules.py

# This is the commit message kubeflow#3:

Added tensorboard

# This is the commit message kubeflow#4:

restored run_integration_tests

# This is the commit message kubeflow#5:

removed custom rule
Jeffwan pushed a commit to Jeffwan/pipelines that referenced this issue Dec 9, 2020
* # This is a combination of 5 commits.
# This is the 1st commit message:

Add initial scripts

# This is the commit message kubeflow#2:

Add working pytest script

# This is the commit message kubeflow#3:

Add initial scripts

# This is the commit message kubeflow#4:

Add environment variable files

# This is the commit message kubeflow#5:

Remove old cluster script

* Add initial scripts

Add working pytest script

Add initial scripts

Add environment variable files

Remove old cluster script

Update pipeline credentials to OIDC

Add initial scripts

Add working pytest script

Add initial scripts

Add working pytest script

* Remove debugging mark

* Update example EKS cluster name

* Remove quiet from Docker build

* Manually pass env

* Update env list vars as string

* Update use array directly

* Update variable array to export

* Update to using read for splitting

* Move to helper script

* Update export from CodeBuild

* Add wait for minio

* Update kubectl wait timeout

* Update minor changes for PR

* Update integration test buildspec to quiet build

* Add region to delete EKS

* Add wait for pods

* Updated README

* Add fixed interval wait

* Fix CodeBuild step order

* Add file lock for experiment ID

* Fix missing pytest parameter

* Update run create only once

* Add filelock to conda env

* Update experiment name ensuring creation each time

* Add try/catch with create experiment

* Remove caching from KFP deployment

* Remove disable KFP caching

* Move .gitignore changes to inside component

* Add blank line to default .gitignore
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
* # This is a combination of 6 commits.
# This is the 1st commit message:

Modify agent test case for code coverage (kubeflow#1849)

* Modifies the test case for sync models config

Signed-off-by: Andrews Arokiam <[email protected]>

# This is the commit message kubeflow#2:

Add test cases for agent storage utils (kubeflow#1849)

* Add test case for FileExists function
* Add test case for RemoveDir function

Signed-off-by: Andrews Arokiam <[email protected]>

# This is the commit message kubeflow#3:

Add test case for agent storage utils

* Add test case for GetProvider function

Signed-off-by: Andrews Arokiam <[email protected]>

# This is the commit message kubeflow#4:

Add test case for gcs model downloader (kubeflow#1849)

* Add test case for gcs model downloader in agent

Signed-off-by: Andrews Arokiam <[email protected]>

# This is the commit message kubeflow#5:

Add test cases for agent downloader

Signed-off-by: Andrews Arokiam <[email protected]>

# This is the commit message kubeflow#6:

Add test cases for configmap (kubeflow#1849)

* Add test cases for v1beta1 configmap

Signed-off-by: Andrews Arokiam <[email protected]>

* Add test cases for inference service defaults (kubeflow#1849)

* Add test cases for all model runtimes
* Add test cases for all runtime defaults

Signed-off-by: Andrews Arokiam <[email protected]>

* fmt (kubeflow#1849)

Signed-off-by: Andrews Arokiam <[email protected]>

* Modify agent test case for code coverage (kubeflow#1849)

* Modifies the test case for sync models config

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for agent storage utils (kubeflow#1849)

* Add test case for FileExists function
* Add test case for RemoveDir function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test case for agent storage utils

* Add test case for GetProvider function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test case for gcs model downloader (kubeflow#1849)

* Add test case for gcs model downloader in agent

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for agent downloader

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for configmap (kubeflow#1849)

* Add test cases for v1beta1 configmap

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for inference service defaults (kubeflow#1849)

* Add test cases for all model runtimes
* Add test cases for all runtime defaults

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for predictor model (kubeflow#1849)

* Add test cases for isFrameworkSupported function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for sklearn predictor (kubeflow#1849)

* Add test cases for GetProtocol function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for configmap (kubeflow#1849)

* Add test case for creating empty model config

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for utils (kubeflow#1849)

* Add test cases for IncludesArg function
* Add test cases for IsGPUEnabled function
* Add test cases for FirstNonNilError function
* Add test cases for RemoveString function
* Add test cases for IsPrefixSupported function

Signed-off-by: Andrews Arokiam <[email protected]>

* Add test cases for creds_utils (kubeflow#1849)

* Add test cases for set_gcs_credentials function
* Add test cases for create_secret function
* Add test cases for set_service_account function
* Add test cases for create_service_account function
* Add test cases for patch_service_account function
* Add test cases for get_creds_name_from_config_map function

Signed-off-by: Andrews Arokiam <[email protected]>

* Add test cases for creds_utils (kubeflow#1849)

* Add test cases for set_gcs_credentials function
* Add test cases for create_secret function
* Add test cases for set_service_account function
* Add test cases for create_service_account function
* Add test cases for patch_service_account function
* Add test cases for get_creds_name_from_config_map function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for creds_utils (kubeflow#1849)

* Add test cases for set_s3_credentials function
* Add test cases for set_azure_credentials function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for v1beta1 component (kubeflow#1849)

* Add test cases for validateStorageSpec function
* Add test cases for validateLogger function
* Add test cases for FirstNonNilComponent function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for inference service status (kubeflow#1849)

* Add test cases for PropagateRawStatus function
* Add test cases for PropagateStatus function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for predictor (kubeflow#1849)

* Add test cases for GetPredictorImplementations function
* Add test cases for GetPredictorImplementation function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for agent_injector (kubeflow#1849)

* Add test cases for getLoggerConfigs function
* Add test cases for getAgentConfigs function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for batcher_injector (kubeflow#1849)

* Add test cases for getBatcherConfigs function

Signed-off-by: Andrews Arokiam <[email protected]>

* Add test cases for storage initializer injector (kubeflow#1849)

* Add test cases for getStorageInitializerConfigs function
* Add test cases for parsePvcUri function

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for storage initializer injector (kubeflow#1849)

* Add test cases for parsePvcUri function

Signed-off-by: Andrews Arokiam <[email protected]>

fmt (kubeflow#1849)

Signed-off-by: Andrews Arokiam <[email protected]>

Add test cases for controller utils (kubeflow#1849)

* Add test cases for GetDeploymentMode function

Signed-off-by: Andrews Arokiam <[email protected]>

Remove double import of same package (kubeflow#1849)

Signed-off-by: Andrews Arokiam <[email protected]>

temp commit

Signed-off-by: Andrews Arokiam <[email protected]>

Updated coverage for inference_service_default_test

Signed-off-by: Andrews Arokiam <[email protected]>

Added scripts for code coverage

Signed-off-by: Andrews Arokiam <[email protected]>

Updated make to track coverage including subpackages

Added more coverage

Added ignore to client package - generated code
Added coverage script to workflow

Signed-off-by: Andrews Arokiam <[email protected]>

* Temporarily commenting couple of test cases

Signed-off-by: Andrews Arokiam <[email protected]>

Temporary changes to debug e2e

Signed-off-by: Andrews Arokiam <[email protected]>

Commented configmap test
Reverted accidental commit of generated code.

Signed-off-by: Andrews Arokiam <[email protected]>

Added -v to debug failing tests

Signed-off-by: Andrews Arokiam <[email protected]>

Updated tests to remove dependency on k8s cluster

Signed-off-by: Andrews Arokiam <[email protected]>

* Updated readme to show coverage
HumairAK referenced this issue in HumairAK/data-science-pipelines Jan 19, 2024
UPSTREAM: <carry>: add mlmd grpc dockerfile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants