-
Notifications
You must be signed in to change notification settings - Fork 3
Multimodal LLM with Hugging Face Transformers
Tip
Please see this slide presentation
Multimodal Large Language Models (LLMs) are advanced AI systems capable of processing and generating content across multiple data modalities, such as text, images, audio, and video. These models are designed to understand and generate complex interactions between different types of data, enabling tasks that require a combination of these modalities, such as generating descriptive text from an image or answering questions based on both text and images.
- Image Captioning: Automatically generating descriptive text for an image.
- Visual Question Answering (VQA): Answering questions about an image based on its content.
- Text-to-Image Generation: Creating images from text descriptions (e.g., using models like DALL-E).
- Image-to-Text Generation: Generating textual descriptions or stories from images.
- Speech-to-Text and Text-to-Speech: Converting spoken language into text or vice versa.
- Audio-Visual Analysis: Understanding and generating responses based on both audio and visual inputs, such as in video content.
- Cross-Modal Retrieval: Retrieving relevant data (e.g., images) based on a query in a different modality (e.g., text).
- Multimodal Sentiment Analysis: Determining sentiment by analyzing both text and visual elements together.
- Versatility: Hugging Face provides models that can handle multiple modalities, enabling the development of complex applications like VQA or image captioning with minimal setup.
- Pre-trained Multimodal Models: Access to state-of-the-art pre-trained models like CLIP, DALL-E, and others that can be fine-tuned for specific multimodal tasks, saving time and computational resources.
- Unified API: A consistent and user-friendly API that allows for easy integration of multimodal tasks, making it simpler to build and deploy multimodal applications.
- Extensive Documentation and Tutorials: Comprehensive resources for learning and troubleshooting multimodal tasks, backed by a large community of developers and researchers.
- Cross-Modal Interoperability: Models that seamlessly integrate text, images, and other data types, enabling innovative applications that require understanding multiple data types together.
- Rapid Prototyping: The ability to quickly prototype multimodal applications by leveraging pre-built models and pipelines.
-
Books:
- "Deep Learning for Computer Vision" by Rajalingappaa Shanmugamani (covers some multimodal aspects).
- "Speech and Language Processing" by Daniel Jurafsky and James H. Martin (includes multimodal LLM aspects) (PDF).
- "Transformers for NLP and Computer Vision", 3rd Ed. by Denis Rothman (includes chapters on multimodal tasks).
-
Online Courses:
- Hugging Face Course – Specific chapters on multimodal models and tasks.
- Stanford CS231n – Convolutional Neural Networks for Visual Recognition (touches on multimodal learning).
- Coursera Multimodal Machine Learning Specialization – Focused on multimodal learning.
-
Documentation:
- Hugging Face Multimodal Documentation – Guidelines on using Hugging Face for multimodal tasks.
- OpenAI CLIP Documentation – Details on the CLIP model for image-text multimodal tasks.
-
Tutorials and Blogs:
- Hugging Face Blog – Updates and tutorials on multimodal models and applications.
- Papers with Code – Latest research and code implementations in multimodal learning.
- Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020.
- Ramesh, A., et al. (2021). Zero-Shot Text-to-Image Generation. arXiv preprint arXiv:2102.12092.
- Tsai, Y.-H. H., et al. (2019). Multimodal Transformer for Unaligned Multimodal Language Sequences. arXiv preprint arXiv:1906.00295v1.
- Awesome Multimodal Large Language Models. BradyFU.
- deepmind/multimodal-perceiver. Hugging Face.
- Multimodal Challenge. Aman.AI.
- Multimodal Deep Learning. Papers with Code.
- Multimodal Emotion Recognition. Papers with Code.
- Multimodal Sentiment Analysis. Papers with Code.
- Vision Language Models. Aman.AI.
- Visual Question Answering (VQA). Papers with Code.
Note
📔 Read and execute the next Jupyter Notebook in Google Colab.
You will need to upload the following test files for running the examples in the notebook in your Colab session:
This workshop will introduce participants to the foundational concepts of multimodal LLMs, the tasks they enable, and how Hugging Face Transformers can be harnessed to build powerful multimodal applications.
import torch
from PIL import Image
from transformers import AutoTokenizer, ViTFeatureExtractor, AutoModelForImageAndTextRetrieval
# Basic notions: multimodal learning, image-text matching, fusion
# Advantages of Hugging Face Transformers for multimodal: pre-trained models, flexibility
# Trends: multimodal AI, vision-language models, generative models
# Load a pre-trained model and feature extractor
tokenizer = AutoTokenizer.from_pretrained("clip-vit-base-patch32")
feature_extractor = ViTFeatureExtractor.from_pretrained("clip-vit-base-patch32")
model = AutoModelForImageAndTextRetrieval.from_pretrained("clip-vit-base-patch32")
# Load a dataset (e.g., Flickr30K)
from datasets import load_dataset
dataset = load_dataset("flickr30k-images")
# Preprocess the data
def preprocess_function(examples):
images = [Image.open(i) for i in examples["file"]]
inputs = feature_extractor(images=images, return_tensors="pt")
text_inputs = tokenizer(examples["text"], padding="max_length", truncation=True)
return {"image": inputs, "text": text_inputs}
tokenized_datasets = dataset.map(preprocess_function, batched=True)
# Fine-tune the model
# ... (similar to NLP and CV fine-tuning)
# Activity: Experiment with different multimodal tasks, such as image captioning or visual question answering.
Created: 08/16/2024 (C. Lizárraga); Last update: 09/18/2024 (C. Lizárraga)
UArizona DataLab, Data Science Institute, University of Arizona, 2024.
UArizona DataLab, Data Science Institute, University of Arizona, 2024.