π± Build model inference APIs and multi-model serving systems with any open-source or custom AI models. π Join our Slack community!
BentoML is an open-source model serving framework, simplifying how AI/ML models gets into production:
- π± Easily build APIs for Any AI/ML Model. Turn any model inference script into a REST API server with just a few lines of code and standard Python type hints.
- π³ Docker Containers made simple. No more dependency hell! Manage your environments, dependencies and models with a simple config file. BentoML automatically generates Docker images, ensures reproducibility, and simplifies how you run inference across different environments.
- π§ Maximize CPU/GPU utilization. Improve your API throughput and latency performance leveraging built-in serving optimization features like dynamic batching, model parallelism, multi-stage pipeline and multi-model inference-graph orchestration.
- π©βπ» Build Custom AI Applications. BentoML is highly flexible for advanced customizations. Easily implement your own API specifications, asynchronous inference tasks; customize pre/post-processing, model inference logic; and define model composition; all using Python code. Supports any ML framework, modality, and inference runtime.
- π Build for Production. Develop, run and debug locally. Seamlessly deploy to production with Docker containers or BentoCloud.
Install BentoML:
# Requires Pythonβ₯3.8
pip install bentoml torch transformers
Define APIs in a service.py
file.
import bentoml
from transformers import pipeline
from typing import List
@bentoml.service
class Summarization:
def __init__(self):
self.pipeline = pipeline('summarization')
@bentoml.api(batchable=True)
def summarize(self, texts: List[str]) -> List[str]:
results = self.pipeline(texts)
return list(map(lambda res: res['summary_text'], results))
Run the service code locally (serving at http://localhost:3000 by default):
bentoml serve service.py:Summarization
Now you can run inference from your browser at http://localhost:3000 or with a Python script:
import bentoml
with bentoml.SyncHTTPClient('http://localhost:3000') as client:
text_to_summarize: str = input("Enter text to summarize: ")
summarized_text: str = client.summarize([text_to_summarize])[0]
print(f"Summarized text: {summarized_text}")
To deploy your BentoML Service code, first create a bentofile.yaml
file to define its dependencies and environments. Find the full list of bentofile options here.
service: "service:Summarization" # Entry service import path
include:
- "*.py" # Include all .py files in current directory
python:
packages: # Python dependencies to include
- torch
- transformers
Then, choose one of the following ways for deployment:
π³ Docker Container
Run bentoml build
to package necessary code, models, dependency configs into a Bento - the standardized deployable artifact in BentoML:
bentoml build
Ensure Docker is running. Generate a Docker container image for deployment:
bentoml containerize summarization:latest
Run the generated image:
docker run --rm -p 3000:3000 summarization:latest
βοΈ BentoCloud
BentoCloud is the AI inference platform for fast moving AI teams. It lets you easily deploy your BentoML code in a fast-scaling infrastructure. Sign up for BentoCloud for personal access; for enterprise use cases, contact our team.
# After signup, follow login instructions upon API token creation:
bentoml cloud login --api-token <your-api-token>
# Deploy from current directory:
bentoml deploy .
For detailed explanations, read Quickstart.
- LLMs: Llama 3, Mixtral, Solar, Mistral, and more
- Image Generation: Stable Diffusion, Stable Video Diffusion, Stable Diffusion XL Turbo, ControlNet, LCM LoRAs
- Text Embeddings: SentenceTransformers
- Audio: XTTS, WhisperX, Bark
- Computer Vision: YOLO
- Multimodal: BLIP, CLIP
- Compound AI systems: Serving RAG with custom models
Check out the examples folder for more sample code and usage.
- Model composition
- Workers and model parallelization
- Adaptive batching
- GPU inference
- Distributed serving systems
- Concurrency and autoscaling
- Model packaging and Model Store
- Observability
- BentoCloud deployment
See Documentation for more tutorials and guides.
Get involved and join our Community Slack π¬, where thousands of AI/ML engineers help each other, contribute to the project, and talk about building AI products.
To report a bug or suggest a feature request, use GitHub Issues.
There are many ways to contribute to the project:
- Report bugs and "Thumbs up" on issues that are relevant to you.
- Investigate issues and review other developers' pull requests.
- Contribute code or documentation to the project by submitting a GitHub pull request.
- Check out the Contributing Guide and Development Guide to learn more.
- Share your feedback and discuss roadmap plans in the
#bentoml-contributors
channel here.
Thanks to all of our amazing contributors!
The BentoML framework collects anonymous usage data that helps our community improve the product. Only BentoML's internal API calls are being reported. This excludes any sensitive information, such as user code, model data, model names, or stack traces. Here's the code used for usage tracking. You can opt-out of usage tracking by the --do-not-track
CLI option:
bentoml [command] --do-not-track
Or by setting the environment variable:
export BENTOML_DO_NOT_TRACK=True