-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(proxy): make the proxy resilient on mlmd failure #700
base: main
Are you sure you want to change the base?
feat(proxy): make the proxy resilient on mlmd failure #700
Conversation
Signed-off-by: Alessio Pragliola <[email protected]>
Signed-off-by: Alessio Pragliola <[email protected]>
/cc @tarilabs |
love this @Al-Pragliola ❤️ thanks a lot !! |
/cc @pboyd |
@Al-Pragliola: GitHub didn't allow me to request PR reviews from the following users: pboyd. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, but /lgtm.
|
||
err := http.ListenAndServe(fmt.Sprintf("%s:%d", cfg.Hostname, cfg.Port), router) | ||
if err != nil { | ||
errChan <- err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the other goroutine closes this channel this send might be a problem (admittedly a rare one, but it might be an issue someday if that goroutine ever panics early).
You could perhaps add the first goroutine to the WaitGroup
and close errCh
in the parent. Or, it looks like ListenAndServe
errors were fatal before, maybe just make it fatal again? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch @pboyd , I think we can just revert to it being a Fatal error like before
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pboyd The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
This PR aims to improve the resiliency of the model registry. Previously, if mlmd was down when the proxy started, it would exit with an error. With this update, the proxy will now start and attach a dynamic router that will respond with a 503 status to every request until mlmd is up and running. Once mlmd is up and running, the router will switch to the correct one.
There's also a time limit of ~5 minutes. If mlmd is not up and running within this timeframe, the proxy will still exit with an error.
E2E automated testing will follow in another PR using the strategy described here. #194 (comment)
How Has This Been Tested?
Testing scenarios:
MR UP AND RUNNING - DB GOES DOWN
TIME 0
TIME 1
kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 0}}'
TIME 2
TIME 3
MR STARTING UP WHILE DB IS DOWN
TIME 0
TIME 1
TIME 2
kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 1}}' kubectl get pod -n kubeflow NAME READY STATUS RESTARTS AGE model-registry-db-7c4bb9f76f-qwp5m 1/1 Running 0 8s model-registry-deployment-cb6987594-gkrf8 1/2 CrashLoopBackOff 1 (18s ago) 19s
TIME 3
Merge criteria:
DCO
check)