Introduction to Hugging Face

What is Hugging Face about?

Hugging Face is an open-source platform for machine learning (ML), natural language processing (NLP), computer vision, and audio, that allows users to create, train, and deploy AI models. It's also a collaborative community where developers can share and test their work. Hugging Face is sometimes called the "GitHub of machine learning".

Transformers

Transformers are a form of neural network architecture used in machine learning. They transform input data into a format that a model can process by learning the context and relationships within sequential data. Transformers employ strategies known as attention or self-attention to discern how data components influence and rely on each other. For instance, a transformer model can comprehend a word's meaning by evaluating the preceding and following words. This capability allows the model to accurately translate words, even when they have multiple meanings.

(Transformer Explainer. A. Choo, et al.)

Transformers represent an evolution of the encoder-decoder architecture. Originally, they were developed for neural machine translation. Their applications range widely, encompassing tasks such as text and speech translation, DNA and protein understanding, drug discovery, and medical research.

What Transformers can do

Transformers is a library of pretrained state-of-the-art models for natural language processing (NLP), computer vision, and audio and speech processing tasks. Not only does the library contain Transformer models, but it also has non-Transformer models like modern convolutional networks for computer vision tasks.

1. Natural language processing

Natural Language Processing (NLP) tasks involve converting text into a model-recognizable format through tokenization, which involves dividing text into separate words or subwords and converting these into numbers. This process allows a sequence of text to be represented as a sequence of numbers, enabling models to solve various NLP tasks.

Text classification

Text classification labels a sequence of text from a predefined set of classes and has practical applications such as sentiment analysis and content classification. Sentiment analysis labels text according to polarity, aiding decision-making in politics, finance, and marketing. Content classification labels text according to topic, helping organize and filter information in news and social media feeds.

Token classification

In Natural Language Processing (NLP), text is preprocessed into tokens, which are then classified into predefined classes. Two common types of token classification are named entity recognition (NER), which labels tokens according to categories like organization or person, and part-of-speech tagging (POS), which labels tokens according to grammatical categories like noun or verb.

Question answering

Question answering is a token-level task that provides answers to queries, often used in virtual assistants, customer support, and search engines. There are two types: extractive, where the answer is a span of text extracted from the provided context, and abstractive, where the answer is generated from the context.

Summarization

Summarization reduces longer texts into shorter versions, preserving the main points. It's a sequence-to-sequence task applicable to various long-form documents like legislative bills and scientific papers. There are two types: extractive summarization, which pulls key sentences from the original text, and abstractive summarization, which generates a summary that may include new words not found in the original document.

Translation

Translation, a sequence-to-sequence task, is crucial for cross-cultural communication, content accessibility, and language learning. While early translation models were primarily monolingual, recent trends show a growing interest in multilingual models capable of translating between multiple language pairs.

Language modeling

Language modeling, a task that predicts words in a text sequence, is a popular NLP task due to its versatility in fine tuning for various downstream tasks. Large language models (LLMs) are of particular interest for their zero- or few-shot learning capabilities, allowing them to solve tasks they weren't explicitly trained for. While they can generate fluent and convincing text, caution is advised as the generated text may not always be accurate.

2. Computer vision

One of the earliest successes in computer vision was the recognition of zip code numbers using a convolutional neural network (CNN). This was possible because images, composed of pixels with numerical values, can be represented as a matrix, with each combination of pixel values describing the image's colors.

Computer vision tasks can be solved either by using convolutions to learn the hierarchical features of an image from low-level to high-level abstracts, or by splitting an image into patches and using a Transformer to understand the relationship between each patch, similar to gradually bringing a blurry image into focus.

Image classification

Image classification, which labels an entire image from a predefined set of classes, has numerous practical applications. These include healthcare for disease detection or patient monitoring, environmental uses like monitoring deforestation or detecting wildfires, agricultural purposes like monitoring plant health or land use, and ecological applications such as tracking wildlife populations or endangered species.

Object detection

Object detection identifies multiple objects and their positions in an image, with applications in self-driving vehicles, remote sensing, and defect detection for tasks like identifying traffic objects, disaster monitoring, urban planning, and detecting structural damage or manufacturing defects.

Image segmentation

Image segmentation is a pixel-level task that assigns each pixel in an image to a class. It's more granular than object detection. Types of image segmentation include instance segmentation, which labels each distinct instance of an object, and panoptic segmentation, which combines semantic and instance segmentation.

Image segmentation tasks are crucial in self-driving vehicles for navigation, medical imaging for identifying abnormalities, and e-commerce for virtual try-ons and augmented reality experiences.

Depth estimation

Depth estimation, a crucial computer vision task, predicts the distance of each pixel in an image from the camera. It's vital for scene understanding, reconstruction, and applications like self-driving cars and 3D representation construction. The two approaches to depth estimation are stereo, which compares two images from slightly different angles, and monocular, which estimates depths from a single image.

3. Multimodal

Multimodal tasks involve processing multiple data types, such as text, images, audio, and video. An example is image captioning, where a model describes an image or its properties. Internally, these models convert all data types into meaningful embeddings, learning relationships between different types of embeddings, such as image and text.

Document question answering

Document question answering is a task that answers questions from an image of a document, extracting key information from structured documents, such as the total amount and change due from a receipt.

4. Audio

Audio and speech processing tasks involve sampling raw audio signals at regular intervals, with a higher sampling rate resulting in a more accurate representation of the original audio source. While previous methods involved preprocessing the audio to extract features, it's now more common to feed the raw audio waveform directly to a feature encoder, simplifying preprocessing and allowing the model to learn essential features.

Audio classification

Audio classification labels audio data into predefined classes, such as:

acoustic scene classification: assigning scene labels like "office", "beach", "stadium"
acoustic event detection: identifying sound events like "car horn", "whale calling", "glass breaking"
tagging: labeling audio with multiple sounds such as birdsongs, speaker identification in meetings
music classification: categorizing music by genre ("metal", "hip-hop", "country")

Automatic speech recognition

Automatic speech recognition (ASR) transcribes speech into text and is commonly used in smart technology. Transformer architectures have improved ASR performance, particularly in low-resource languages. By pretraining on large amounts of speech data, these models can produce high-quality results with significantly less labeled data.

Running your first Jupyter Notebook using Hugging Face Transformers Libraries.

Open a Jupyter Notebook, and install the required libraries

!pip install -q torch
!pip install -q transformers

Next, import the AutoTokenizer and AutoModelForCausalLM classes from transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

Think of Classes as code recipes for creating Objects. With these Classes we will create two Object: a model and a tokenizer Object.

Now, we define the model we want to use defining the variable. For example, we can use Microsoft's Phi-2 model,

model_id = "microsoft/phi-2"

With this in mind, we create a model object and load the model:

model = AutoModelForCausalLM.from_pretrained(model_id)

where the method .from_pretrained from the Class AutoModelForCausalLM creates the object model.

Created a tokenizer object and load the tokenizer by writing the instruction:

tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True, padding_side='left')

A tokenizer is a tool that splits sentences into smaller pieces of text (tokens) and assigns each token a numeric value called an input id. Models only understand numbers, and each model has it's own tokenizer vocabulary.

Create the inputs for the model to process.

input_text = "Who are you?"
input_ids = tokenizer(input_text, return_tensors="pt")

Run generation and decode the output

outputs = model.generate(input_ids["input_ids"], max_new_tokens=100)
decoded_outputs = tokenizer.decode(outputs[0])
print(decoded_outputs)

Models only understand numbers, so when we provided our input_ids as vectors it returned an output in the same format. To return those outputs to text we need to reverse the initial encoding we did using the tokenizer.

Running Transformers for doing a Sentiment Analysis

See Jupyter Notebook Example: Introduction to Hugging Face Transformers

References

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P. S., & Sun, L. (2023). ArXiv. /abs/2303.04226
Transformer Explainer. Aeree Cho, Grace C. Kim, Alexander Karpekov, Alec Helbling, Jay Wang, Seongmin Lee, Benjamin Hoover, and Polo Chau at the Georgia Institute of Technology.
Transformers. Hugging Face Docs.
What Transformers can do. Hugging Face Docs.

Created: 06/11/2024 (C. Lizárraga); Last update: 08/29/2024 (C. Lizárraga)

DataLab, Data Science Institute, University of Arizona.

CC BY-NC-SA 4.0

UArizona DataLab, Data Science Institute, University of Arizona, 2025.

Home

Fall 2024

Spring 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly