Skip to content

Multimodal LLM with Hugging Face Transformers

Carlos Lizarraga-Celaya edited this page Sep 19, 2024 · 10 revisions

Overview of Multimodal LLM and Hugging Face Transformers



1. Introduction to Multimodal LLM

Multimodal Large Language Models (LLMs) are advanced AI systems capable of processing and generating content across multiple data modalities, such as text, images, audio, and video. These models are designed to understand and generate complex interactions between different types of data, enabling tasks that require a combination of these modalities, such as generating descriptive text from an image or answering questions based on both text and images.

2. Main Multimodal LLM Tasks

3. Advantages of Hugging Face Transformers for Multimodal LLM Tasks

  • Versatility: Hugging Face provides models that can handle multiple modalities, enabling the development of complex applications like VQA or image captioning with minimal setup.
  • Pre-trained Multimodal Models: Access to state-of-the-art pre-trained models like CLIP, DALL-E, and others that can be fine-tuned for specific multimodal tasks, saving time and computational resources.
  • Unified API: A consistent and user-friendly API that allows for easy integration of multimodal tasks, making it simpler to build and deploy multimodal applications.
  • Extensive Documentation and Tutorials: Comprehensive resources for learning and troubleshooting multimodal tasks, backed by a large community of developers and researchers.
  • Cross-Modal Interoperability: Models that seamlessly integrate text, images, and other data types, enabling innovative applications that require understanding multiple data types together.
  • Rapid Prototyping: The ability to quickly prototype multimodal applications by leveraging pre-built models and pipelines.

4. Learning Resources

5. References


6.- Jupyter Notebook Example

Note

📔 Read and execute the next Jupyter Notebook in Google Colab.

You will need to upload the following test files for running the examples in the notebook in your Colab session:

  1. test.jpeg
  2. harvard.wav
  3. twister.wav
  4. sample.m4a

This workshop will introduce participants to the foundational concepts of multimodal LLMs, the tasks they enable, and how Hugging Face Transformers can be harnessed to build powerful multimodal applications.


import torch
from PIL import Image
from transformers import AutoTokenizer, ViTFeatureExtractor, AutoModelForImageAndTextRetrieval

# Basic notions: multimodal learning, image-text matching, fusion
# Advantages of Hugging Face Transformers for multimodal: pre-trained models, flexibility
# Trends: multimodal AI, vision-language models, generative models

# Load a pre-trained model and feature extractor
tokenizer = AutoTokenizer.from_pretrained("clip-vit-base-patch32")
feature_extractor = ViTFeatureExtractor.from_pretrained("clip-vit-base-patch32")
model = AutoModelForImageAndTextRetrieval.from_pretrained("clip-vit-base-patch32")

# Load a dataset (e.g., Flickr30K)
from datasets import load_dataset

dataset = load_dataset("flickr30k-images")

# Preprocess the data
def preprocess_function(examples):
    images = [Image.open(i) for i in examples["file"]]
    inputs = feature_extractor(images=images, return_tensors="pt")
    text_inputs = tokenizer(examples["text"], padding="max_length", truncation=True)
    return {"image": inputs, "text": text_inputs}

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Fine-tune the model
# ... (similar to NLP and CV fine-tuning)

# Activity: Experiment with different multimodal tasks, such as image captioning or visual question answering.

Created: 08/16/2024 (C. Lizárraga); Last update: 09/18/2024 (C. Lizárraga)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.