Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Add TorchServe FVT #294

Merged
merged 4 commits into from
Feb 21, 2023
Merged

test: Add TorchServe FVT #294

merged 4 commits into from
Feb 21, 2023

Conversation

rafvasq
Copy link
Member

@rafvasq rafvasq commented Dec 22, 2022

Motivation

Support for TorchServe was added in #250 and kserve/modelmesh-runtime-adapter#34. A test should be added for it as well.

Modifications

  • Adds basic FVT for load/inference with a TorchServe MAR model using the native TorchServe gRPC API
  • Disables OVMS runtime and tests to allow TorchServe to be tested due to resource constraints

Result

Closes #280

@rafvasq rafvasq removed the request for review from tedhtchang January 6, 2023 21:22
@rafvasq rafvasq force-pushed the torch-fvt branch 3 times, most recently from 8b6ae3b to 43d1c75 Compare January 9, 2023 20:19
@rafvasq rafvasq marked this pull request as ready for review January 9, 2023 20:28
@rafvasq rafvasq added do-not-merge/hold test testing related bugs and fixes labels Jan 9, 2023
@rafvasq
Copy link
Member Author

rafvasq commented Jan 12, 2023

To unblock this PR, will work with @ckadner to add the TorchServe model used for the new FVT test to the modelmesh-minio-dev-examples image. From there, I can confirm whether the PR passes the FVTs as expected.

@rafvasq rafvasq self-assigned this Jan 18, 2023
@ckadner
Copy link
Member

ckadner commented Jan 20, 2023

To unblock this PR, will work with @ckadner to add the TorchServe model used for the new FVT test to the modelmesh-minio-dev-examples image. From there, I can confirm whether the PR passes the FVTs as expected.

@rafvasq -- I pushed a new modelmesh-minio-dev-examples:latest image with the FVT models you added in PR #300

https://hub.docker.com/r/kserve/modelmesh-minio-dev-examples/tags

@ckadner
Copy link
Member

ckadner commented Jan 20, 2023

Even though the FVTs are failing, I no longer see the pytorch-mar predictor tests fail, like this one 2 weeks ago

Summarizing 2 Failures:
  [FAIL] Predictor update pytorch-mar predictor [BeforeEach] should successfully update and reload the model
  /home/runner/work/modelmesh-serving/modelmesh-serving/fvt/helpers.go:290
  [FAIL] Predictor create pytorch-mar predictor [It] should successfully load a model
  /home/runner/work/modelmesh-serving/modelmesh-serving/fvt/helpers.go:290

Ran 42 of 136 Specs in 200.237 seconds
FAIL! - Interrupted by Other Ginkgo Process -- 40 Passed | 2 Failed | 0 Pending | 94 Skipped

So I assume the new modelmesh-minio-dev-examples image works okay, no?

@rafvasq
Copy link
Member Author

rafvasq commented Jan 20, 2023

@ckadner, that's right. I can see that the pytorch-mar predictor is being loaded in the test:

1.6742378737294984e+09	INFO	Watcher got event with object	{"name": "pytorch-mar-predictor-4vnh9", "status.available": true, "status.activeModelState": "Loaded", "status.targetModelState": "", "status.transitionStatus": "UpToDate", "status.lastFailureInfo": null}

I did have to update .github/workflows/run-fvt.yml to load the TorchServe runtime too since it was disabled. I'm aware that there could be resource issues related to loaded all 4 runtimes at once, but I'm debugging to figure out if it's failing because of that.

Seems like it's getting past the TorchServe tests though so I believe the new modelmesh-minio-dev-examples image is working just fine. :)

@rafvasq rafvasq force-pushed the torch-fvt branch 4 times, most recently from 828b052 to 8bf62f9 Compare January 25, 2023 20:51
- TS FVTs have been proven to pass but inconsistently due to resource constraints

Signed-off-by: Rafael Vasquez <[email protected]>
@ckadner
Copy link
Member

ckadner commented Jan 25, 2023

Hi @rafvasq -- I see you are still battling FVT errors. When I checked yesterday, I saw this error which I assume was caused by resource restriction.

https://github.com/kserve/modelmesh-serving/actions/runs/4002549941/jobs/6869875571#step:13:3630

  Predictor pytorch-mar-predictor-46866 state should not be changed from 'Loading' to 'FailedToLoad'
  Expected
      <bool>: false
  to be true
  In [It] at: /home/runner/work/modelmesh-serving/modelmesh-serving/fvt/helpers.go:286

Should we try to increase the resources on the FVT cluster to accommodate the additional workload?

@rafvasq
Copy link
Member Author

rafvasq commented Jan 31, 2023

I've created #320 to address FVT errors going forward and as @ckadner noted there, the Github constraints are separate from the nightly e2e-pipeline on IBM Cloud.

Re: this PR, TorchServe FVTs have been proven to pass and will stay disabled for now due to resource-constraints.

Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rafvasq LGTM, just a small comment.

@@ -112,6 +112,15 @@ var predictorsArray = []FVTPredictor{
differentPredictorName: "xgboost",
differentPredictorFilename: "xgboost-predictor.yaml",
},
// TorchServe test is currently disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the reason here?

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ckadner, njhill, rafvasq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kserve-oss-bot kserve-oss-bot merged commit ff0d455 into kserve:main Feb 21, 2023
@rafvasq rafvasq deleted the torch-fvt branch February 21, 2023 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm test testing related bugs and fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add FVT for TorchServe runtime
4 participants