KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases. KServe ModelMesh Serving is a recently added feature intended to increase KServe's scalability. It is designed to handle large volumes of models, where the deployed models change frequently. It loads and unloads models aiming to balance between responsiveness to users, and computational footprint. Leveraging existing third-party model servers, a number of standard ML/DL model formats are supported out-of-the box with more to follow: TensorFlow, PyTorch ScriptModule, ONNX, scikit-learn, XGBoost, LightGBM, OpenVINO IR. It's also possible to extend with custom runtimes to support arbitrary model formats, such as Watson NLP runtime.
This tutorial will walk you through the steps to deploy a Watson NLP model to the KServe ModelMesh Serving sandbox environment on IBM Technology Zone (TechZone).
- Create a KServe ModelMesh Serving sandbox environment on TechZone
- Install IBM Cloud CLI: ibmcloud
- Install Kubernetes CLI: kubectl
- Install Minio Client CLI: mc
- Install gRPCurl
When you first reserve a TechZone sandbox environment for KServe ModelMesh Serving, an instance of ModelMesh Serving would be created for you in a dedicated Kubernetes namespace on an IBM Cloud Kubernetes Service (IKS) cluster.
When the sandbox environment is ready, you will receive an email that includes a link to the Kubernetes Dashboard. Clicking on this link will open the Dashboard in your browser and show the Kubernetes service
resources in your namespace.
- For new users, you would receive an email invitation from IBM Cloud to join the
account. - You need to have an active login session on IBM Cloud Console before you can open the Kubernetes Dashboard.
A sample Watson NLP application, written in Dash, is also deployed in your sandbox environment. On the Kubernetes Dashboard, find a service
named dash-app-lb
, which has an external endpoint. Click on the link to the external endpoint to open it in your browser. The sample Dash App allows you to provide some text input data, and get visualized Emotion Classification prediction results, using a pretrained models served by the ModelMesh Serving instance with Watson NLP Runtime in your sandbox environment.
- It might take a few minutes for the DNS record of the Dash App's external endpoint to propagate across the Internet.
You need to login to the IKS cluster with the CLI tools to run the kubectl
commands in this tutorial.
ibmcloud plugin install ks
ibmcloud login
Use ibmcloud login --sso
command to login, if you have a federated ID.
The following command will update the kubeconfig file specified by the KUBECONFIG
environment variable, or ~/.kube/config
by default.
ibmcloud ks cluster config --cluster <iks-cluster-name>
- The name of the IKS cluster can be found in the email from TechZone when the sandbox environment is ready.
kubectl config set-context --current --namespace=<your-namespace>
- The name of your
can be found in the email from TechZone when the sandbox environment is ready.
Your TechZone sandbox environment comes with 3 pretrained Watson NLP models. They are stored in an AWS S3 compatible IBM Cloud Object Storage (COS) bucket, so that they can be served by KServe ModelMesh Serving. A Kubernetes custom resource named InferenceService must also be created to register the model with the service.
You should be able to see the InferenceService predictors created for those 3 pretrained Watson NLP models in your namespace.
$ kubectl get inferenceservice
ensemble-classification-wf-en-emotion-stock-predictor grpc://modelmesh-serving.ibmid-6620037hpc-669mq7e2:8033 True 93m
sentiment-document-cnn-workflow-en-stock-predictor grpc://modelmesh-serving.ibmid-6620037hpc-669mq7e2:8033 True 93m
syntax-izumo-en-stock-predictor grpc://modelmesh-serving.ibmid-6620037hpc-669mq7e2:8033 True 93m
KServe ModelMesh Serving currently provides a gRPC API using a ClusterIP service on port 8033
, which is on the internal network of the Kubernetes cluster.
$ kubectl get service/modelmesh-serving
modelmesh-serving ClusterIP None <none> 8033/TCP,8008/TCP,2112/ 4h34m
You can run the kubectl port-forward
command to forward a port on your local machine to the modelmesh-serving
service on port 8033
kubectl port-forward service/modelmesh-serving <local-port>:8033
- You could let kubectl choose an available local port for you, by not specifying the
You can interact with the gRPC service using the grpcurl
CLI tool on your local machine. With this tool you could browse the schema for gRPC services, either by querying a server that supports server reflection, or by reading Protocol Buffers (Protobufs) source files, or .proto
files. Since modelmesh-serving
doesn't support server reflection, we'll use the .proto
files here.
The protobuf files are included in the Watson NLP Runtime container image. You can extract them from the running pods and save them to a local directory.
Create a directory named protos
and make it your current working directory:
mkdir protos && cd protos
Run the following command to extract the protobuf files:
kubectl exec deployment/modelmesh-serving-watson-nlp-runtime -c watson-nlp-runtime -- jar cM -C /app/protos . | jar x
You should be able to see the .proto
files in the current directory.
$ ls
category-types.proto emotion-types.proto nounphrases-types.proto target-mention-types.proto
classification-types.proto entity-types.proto producer-types.proto text-primitive-types.proto
clustering-types.proto keyword-types.proto relation-types.proto text-similarity-types.proto
common-service.proto lang-detect-types.proto rules-types.proto topic-types.proto
concept-types.proto language-types.proto sentiment-types.proto vectorization-types.proto
embedding-types.proto matrix-types.proto syntax-types.proto
Assuming local port 18033 was chosen in Step 9, you should now be able to send an inference call at
to one of the pretrained models loaded into the Watson NLP Runtime on KServe ModelMesh Serving.
grpcurl -plaintext -proto ./common-service.proto \
-H 'mm-vmodel-id: syntax-izumo-en-stock-predictor' \
-d '
"parsers": [
"rawDocument": {
"text": "This is a test."
' \ watson.runtime.nlp.v1.NlpService.SyntaxPredict
If you get a response like the following, the Watson NLP Runtime is working properly.
"text": "This is a test.",
"producerId": {
"name": "Izumo Text Processing",
"version": "0.0.1"
"tokens": [
"span": {
"end": 4,
"text": "This"
"span": {
"begin": 5,
"end": 7,
"text": "is"
"span": {
"begin": 8,
"end": 9,
"text": "a"
"span": {
"begin": 10,
"end": 14,
"text": "test"
"span": {
"begin": 14,
"end": 15,
"text": "."
"sentences": [
"span": {
"end": 15,
"text": "This is a test."
"paragraphs": [
"span": {
"end": 15,
"text": "This is a test."
The KServe ModelMesh Serving instance in TechZone comes with a dedicated COS bucket, where you can store your own models and serve them through the KServe ModelMesh Serving instance. Several CLI tools can be used to upload your models to the COS bucket. We'll use the Minio Client here as an example.
You will need the HMAC credential stored in a Kubernetes secret
object named storage-config
to access the COS bucket. Here is how you can retrieve it.
kubectl get secret/storage-config -o json | jq -r '."data"."'$BUCKET'"' | base64 -d
- Replace
with the name of the dedicated COS bucket, which should be the same as the your Kubernetes namespace.
$ kubectl get secret/storage-config -o json | jq -r '."data"."'$BUCKET'"' | base64 -d
"type": "s3",
"access_key_id": "683a3fb50e0a49d5ae2463725b3e83f5",
"secret_access_key": "86b13e59da3a28d1b134d11ace6913705043c4289d976e37",
"endpoint_url": "https://s3.us-south.cloud-object-storage.appdomain.cloud",
"region": "us-south",
"default_bucket": "ibmid-6620037hpc-669mq7e2"
To add an entry in your Minio Client configuration for your COS bucket, run the following command:
- Replace
with a short alias for referencing Object Storage in commands. - Replace
with theendpoint_url
of the HMAC credential. - Replace
with theaccess_key_id
of the HMAC credential. - Replace
with thesecret_access_key
of the HMAC credential.
Use mc cp --recursive
command to upload your model.
mc cp --recursive /path/to/mymodel ${ALIAS}/${BUCKET}
Check the content of your COS bucket with mc tree --files
mc tree --files ${ALIAS}/${BUCKET}
More details regarding Minio and other tools can be found in the following IBM Cloud docs:
- https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-minio
- https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-upload
A Kubernetes custom resource can be created for the uploaded model as follows.
kubectl create -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: $NAME
serving.kserve.io/deploymentMode: ModelMesh
name: watson-nlp
key: $BUCKET
bucket: $BUCKET
- Replace
with any valid unique name. - Replace
with the folder path inside the bucket. - Replace
with the name of the COS bucket.
Once the model is successfully loaded, you will see the READY
status is True
, when checked with the following command:
kubectl get inferenceservice
You should now be able to make inference calls to your own custom model from your local machines, in a way similar to what you did with the pretrained model. More details regarding custom serving runtime for KServe ModelMesh Serving can be found here.
After you've completed this tutorial, you can go to the My reservations page on TechZone and delete your Sandbox Environment for KServe ModelMesh Serving.