Course covering practical aspects of deploying, optimizing, and monitoring Generative AI models. The course is divided into three modules: Deployment, Model Optimization, and Monitoring and Maintenance Deployments.
Covers various strategies for deploying Generative AI models starting from local deployment of Generative AI models on a laptop or workstation, followed by on-premise server-based deployments, then edge deployments, before finishing with cloud-based deployments. Cover the pros and cons of each strategy and the factors to consider when choosing a deployment strategy.
- LLaMA C++: Enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud. See also LLama C++ Python for Python bindings.
- LlamaFile: Make open-source LLMs more accessible to both developers and end users. Combines LLaMA C++ with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executable (called a "llamafile") that runs locally on most computers, with no installation.
- Ollama (GitHub): Get up and running with Llama 3, Mistral, Gemma 2, and other large language models. Uses LLaMA C++ as the backend.
- Open WebUI (GitHub): Extensible, self-hosted interface for AI that adapts to your workflow, all while operating entirely offline.
- Jupyter AI: A generative AI extension for JupyterLab.
Additional relevant material:
- DeepLearning AI: Open Source Models with HuggingFace
- DeepLearning AI: Building Generative AI Apps
- Blog Post: Emerging UI/UX patterns for AI applications
- Latent Space Podcast: Tiny Model Revolution
- Awesome LLMs on Device
- LitServe
- DeepLearning AI: Introduction to Device AI
- Machine Learning Compilation (GitHub)
- Deploying LLMs in your Web Browser
- NVIDIA Orin SDK
- NVIDIA Holoscan SDK (GitHub)
- NVIDIA Holohub
- DeepLearning AI: Serverless LLM Apps using AWS Bedrock
- DeepLearning AI: Developing Generative AI Apps using Microsoft Semantic Kernel
- DeepLearning AI: Understanding and Applying Text Embeddings with Vertex AI
- DeepLearning AI: Pair Programming with LLMs
Cover techniques for optimizing Generative AI models for deployment, such as model pruning, quantization, and distillation. Cover the trade-offs between model size, speed, and performance.
- DeepLearning AI: Quantization Fundamentals
- DeepLearning AI: Quantization in Depth
- GGUF My Repo
- https://www.markhneedham.com/blog/2023/10/18/ollama-hugging-face-gguf-models/
Cover the importance of monitoring the performance of deployed models and updating them as needed. Discuss potential issues that might arise during deployment and how to troubleshoot them.