Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(proxy): make the proxy resilient on mlmd failure #700

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Al-Pragliola
Copy link
Contributor

@Al-Pragliola Al-Pragliola commented Jan 14, 2025

Description

This PR aims to improve the resiliency of the model registry. Previously, if mlmd was down when the proxy started, it would exit with an error. With this update, the proxy will now start and attach a dynamic router that will respond with a 503 status to every request until mlmd is up and running. Once mlmd is up and running, the router will switch to the correct one.

There's also a time limit of ~5 minutes. If mlmd is not up and running within this timeframe, the proxy will still exit with an error.

E2E automated testing will follow in another PR using the strategy described here. #194 (comment)

How Has This Been Tested?

Testing scenarios:

MR UP AND RUNNING - DB GOES DOWN

TIME 0

  • MR is up and running
  • MLMD is up and running
  • DB is up and running

TIME 1

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 0}}'
  • DB goes down

TIME 2

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS   AGE
model-registry-deployment-cb6987594-psbj2   2/2     Running   0          11m
curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: application/json; charset=UTF-8
< Vary: Origin
< Date: Tue, 14 Jan 2025 17:42:05 GMT
< Content-Length: 102
<
{"code":"","message":"rpc error: code = Internal desc = mysql_real_connect failed: errno: , error: "}
* Connection #0 to host localhost left intact
  • MR is up and running
  • MLMD is up and running

TIME 3

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS   AGE
model-registry-db-7c4bb9f76f-lkmmb          1/1     Running   0          8s
model-registry-deployment-cb6987594-psbj2   2/2     Running   0          21m

curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Vary: Origin
< Date: Tue, 14 Jan 2025 17:49:59 GMT
< Content-Length: 54
<
{"items":[],"nextPageToken":"","pageSize":0,"size":0}
* Connection #0 to host localhost left intact

MR STARTING UP WHILE DB IS DOWN

TIME 0

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 0}}'
kubectl patch deployment -n kubeflow model-registry-deployment --patch '{"spec": {"replicas": 0}}'

kubectl get pod -n kubeflow

No resources found in kubeflow namespace.
  • MR is down
  • MLMD is down
  • DB is down

TIME 1

kubectl patch deployment -n kubeflow model-registry-deployment --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS             RESTARTS      AGE
model-registry-deployment-cb6987594-gkrf8   1/2     CrashLoopBackOff   1 (20s ago)   21s

k describe pod model-registry-deployment-cb6987594-gkrf8

....
Warning  BackOff    3s (x8 over 40s)   kubelet            Back-off restarting failed container grpc-container in pod model-registry-deployment-cb6987594-gkrf8_kubeflow(1bdd5e06-5939-4dd9-b1c9-4e8e68190245)

kubectl logs model-registry-deployment-cb6987594-gkrf8 -c grpc-container

WARNING: Logging before InitGoogleLogging() is written to STDERR
E0114 17:59:16.438417     1 mysql_metadata_source.cc:174] MySQL database was not initialized. Please ensure your MySQL server is running. Also, this error might be caused by starting from MySQL 8.0, mysql_native_password used by MLMD is not supported as a default for authentication plugin. Please follow <https://dev.mysql.com/blog-archive/upgrading-to-mysql-8-0-default-authentication-plugin-considerations/>to fix this issue.
F0114 17:59:16.438586     1 metadata_store_server_main.cc:555] Check failed: absl::OkStatus() == status (OK vs. INTERNAL: mysql_real_connect failed: errno: , error:  [mysql-error-info='']) MetadataStore cannot be created with the given connection config.
*** Check failure stack trace: ***

curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 14 Jan 2025 18:02:39 GMT
< Content-Length: 101
<
MLMD server is down or unavailable. Please check that the database is reachable and try again later.
* Connection #0 to host localhost left intact
  • MR up and running
  • MLMD is down
  • DB is down

TIME 2

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS             RESTARTS      AGE
model-registry-db-7c4bb9f76f-qwp5m          1/1     Running            0             8s
model-registry-deployment-cb6987594-gkrf8   1/2     CrashLoopBackOff   1 (18s ago)   19s
  • MR up and running
  • MLMD is restarting
  • DB is up and running

TIME 3

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS      AGE
model-registry-db-7c4bb9f76f-qwp5m          1/1     Running   0             1m
model-registry-deployment-cb6987594-gkrf8   2/2     Running   2 (38s ago)   2m
  • MR is up and running
  • MLMD is up and running
  • DB is up and running

Merge criteria:

  • All the commits have been signed-off (To pass the DCO check)
  • The commits have meaningful messages; the author will squash them after approval or in case of manual merges will ask to merge with squash.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work.
  • Code changes follow the kubeflow contribution guidelines.

@Al-Pragliola Al-Pragliola marked this pull request as ready for review January 15, 2025 21:42
@Al-Pragliola
Copy link
Contributor Author

/cc @tarilabs

@google-oss-prow google-oss-prow bot requested a review from tarilabs January 15, 2025 21:44
@tarilabs
Copy link
Member

love this @Al-Pragliola ❤️ thanks a lot !!

@Al-Pragliola
Copy link
Contributor Author

/cc @pboyd

Copy link

@Al-Pragliola: GitHub didn't allow me to request PR reviews from the following users: pboyd.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @pboyd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@pboyd pboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, but /lgtm.


err := http.ListenAndServe(fmt.Sprintf("%s:%d", cfg.Hostname, cfg.Port), router)
if err != nil {
errChan <- err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the other goroutine closes this channel this send might be a problem (admittedly a rare one, but it might be an issue someday if that goroutine ever panics early).

You could perhaps add the first goroutine to the WaitGroup and close errCh in the parent. Or, it looks like ListenAndServe errors were fatal before, maybe just make it fatal again? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @pboyd , I think we can just revert to it being a Fatal error like before

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pboyd
Once this PR has been reviewed and has the lgtm label, please assign ckadner for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants