This document shows how you can serve a LitGPT for deployment.
This section illustrates how we can set up an inference server for a phi-2 LLM using litgpt serve
that is minimal and highly scalable.
# 1) Download a pretrained model (alternatively, use your own finetuned model)
litgpt download microsoft/phi-2
# 2) Start the server
litgpt serve microsoft/phi-2
Tip
Use litgpt serve --help
to display additional options, including the port, devices, LLM temperature setting, and more.
You can now send requests to the inference server you started in step 2. For example, in a new Python session, we can send requests to the inference server as follows:
import requests, json
response = requests.post(
"http://127.0.0.1:8000/predict",
json={"prompt": "Fix typos in the following sentence: Exampel input"}
)
print(response.json()["output"])
Executing the code above prints the following output:
Example input.
The 2-step procedure described above returns the complete response all at once. If you want to stream the response on a token-by-token basis, start the server with the streaming option enabled:
litgpt serve microsoft/phi-2 --stream true
Then, use the following updated code to query the inference server:
import requests, json
response = requests.post(
"http://127.0.0.1:8000/predict",
json={"prompt": "Fix typos in the following sentence: Exampel input"},
stream=True
)
print(response.json()["output"])
b'{"output": "The"}'b'{"output": " corrected"}'b'{"output": " sentence"}'b'{"output": " is"}'b'{"output": ":"}'b'{"output": " Example"}'b'{"output": " input"}'