MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with KFServing's V2 Dataplane spec. The list of cool features include
- Adaptive batching, to group inference requests together on the fly.
- Parallel Inference Serving, for vertical scaling across multiple models through a pool of inference workers.
- Multi-model serving to run multiple models within the same process
- Support for the standard V2 Inference Protocol on both the gRPC and REST flavours
- Scalability with deployment in Kubernetes native frameworks, including Seldon Core and KServe, where MLServer is the core Python inference server used to serve machine learning models.
Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. It also provides supports inference runtimes for many frameworks such as:
In this exercise, we will deploy the sentiment analysis huggingface transformer model. Since MLServer does not provide out-of-the-box support for PyTorch or Transformer models, we will write a custom inference runtime to deploy this model.
pip install mlserver
# to install out-of-box frameworks
pip install mlserver-sklearn # or any of the frameworks supported above
It's very easy to extend MLServer for any framework other than the supported ones by writing a custom inference runtime. To add support for our framework, we extend mlserver.MLModel
abstract class and overload two main methods:
load(self) -> bool
: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).predict(self, payload: InferenceRequest) -> InferenceResponse
: Responsible for using a model to perform inference on an incoming data point.
class SentimentModel(MLModel):
"""
Implementationof the MLModel interface to load and serve custom hugging face transformer models.
"""
# load the model
async def load(self) -> bool:
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_uri = await get_model_uri(self._settings)
self.model_name = model_uri
self.model = DistilBertForSequenceClassification.from_pretrained(
self.model_name
)
self.model.eval()
self.model.to(self.device)
self.tokenizer = DistilBertTokenizer.from_pretrained(self.model_name)
self.ready = True
return self.ready
# output predictions
async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
input_id, attention_mask = self._preprocess_inputs(payload)
prediction = self._model_predict(input_id, attention_mask)
return types.InferenceResponse(
model_name=self.name,
model_version=self.version,
outputs=[
types.ResponseOutput(
name="predictions",
shape=prediction.shape,
datatype="FP32",
data=np.asarray(prediction).tolist(),
)
],
)
# preprocess input payload
def _preprocess_inputs(self, payload: types.InferenceRequest):
inp_text = defaultdict()
for inp in payload.inputs:
inp_text[inp.name] = json.loads(
"".join(self.decode(inp, default_codec=StringCodec))
)
inputs = self.tokenizer(inp_text['text'], return_tensors="pt")
input_id = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
return input_id, attention_mask
# run inference
def _model_predict(self, input_id, attention_mask):
with torch.no_grad():
outputs = self.model(input_id, attention_mask)
probs = F.softmax(outputs.logits, dim=1).numpy()[0]
return probs
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
Test the sentiment classifier model
docker build -t sentiment -f sentiment/Dockerfile.sentiment sentiment/
docker run --rm -it sentiment
Test MLServer locally
# download trained models
bash get_models.sh
# create a docker image
mlserver build . -t 'sentiment-app:1.0.0'
docker run -it --rm -p 8080:8080 -p 8081:8081 sentiment-app:1.0.0
In a separate terminal,
# test inference request (REST)
python3 test_local_http_endpoint.py
# test inference request (gRPC)
python3 test_local_http_endpoint.py
- Deploy the MLServer application on SeldonCore or KServe.