07 Oct 17:42

ArthurZucker

53fad64

Release v4.45.2 Latest

Latest

Patch release v4.45.2

Mostly some warnings that were not properly removed ⚠️ :

Ignore keys on validate_rope #33753 by @zucchini-nlp
remove warning v2 #33761 by @itazap
Config: lower save_pretrained exception to warning #33906 by @gante

🔴 Had a small regression with dynamic Cache 🔴
*Cache: revert DynamicCache init for BC #33861 by @gante

A small fix for idefic 🐩 :

Fixes for issue #33763 in idefics2 model #33766 by @aroun-coumar

And a fix for Siglip 🤧 !

hot fix self.position_embeddings->self.position_embedding #33958 and properly fix and RUN_SLOW #33965 thanks to @mranzinger

Contributors

mranzinger, gante, and 3 other contributors

Assets 2

26 Sep 18:07

ArthurZucker

v4.45.1

e71a01a

Patch Release v4.45.1

Patches for v4.45.1

[MllamaProcessor] Update errors and API with multiple image (#33715) by @ArthurZucker
Generate: can_generate() recursive check (#33718) by @gante
clean_up_tokenization_spaces=False if unset (#31938) by @itazap

Contributors

gante, itazap, and ArthurZucker

Assets 2

25 Sep 18:11

LysandreJik

v4.45.0

2ef31de

Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers

New model additions

mllama

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Add MLLama #33703, by @qubvel, @zucchini-nlp, @ArthurZucker

Qwen2-VL

The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.

An extract from the Qwen2-VL blogpost available here is as follows:

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

support qwen2-vl by @simonJJJ in #32318

Qwen2-Audio

The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

They introduce two distinct audio interaction modes:

voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input
audio analysis: users could provide audio and text instructions for analysis during the interaction

Add Qwen2-Audio by @faychu in #32137

OLMoE

OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.

Add OLMoE by @Muennighoff in #32406

Llava Onevision

LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.

Llava Onevision: add model by @zucchini-nlp in #32673

FalconMamba

The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.

The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.

The team releases an accompanying blog post.

Add new model by @younesbelkada in #32615

Granite Language Models

he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

Granite language models by @mayank31398 in #31502

Granite MOE

The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

Granitemoe by @mayank31398 in #33207

Descript-Audio-Codec

The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.

Add Descript-Audio-Codec model by @kamilakesbi in #31494

Pixtral

The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.

The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).

Add support for Pixtral by @ArthurZucker in #33449

Mimi

The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.

Codec integration by @ylacombe in #33565

OmDet-Turbo

The OmDet-Turbo model was proposed in Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.

Add OmDet-Turbo by @yonigozlan in #31843

Quantization

GGUF

GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.

Add Qwen2Moe GGUF loading support by @VladOS95-cyber in #33264
Fix incorrect vocab size retrieval in GGUF config by @Isotr0py in #32551
Add chat_template for tokenizer extracted from GGUF model by @Isotr0py in #32908
🚨 Support dequantization for most GGML types by @Isotr0py in #32625
Add support for GGUF Phi-3 by @a8nova in #31844

Torch AO

An ongoing effort is to add the ability to use torchao as a quantization backend. Future PRs will enable saving and fine-tuning with peft.

Add TorchAOHfQuantizer by @jerryzh168 in #32306

Liger Kernel

The Liger kernel is now supported in the Trainer class.

Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to Trainer by @JasonZhu1313 in #32860

Modular Transformers

This PR i...

Contributors

jpizarrom, ankane, and 167 other contributors

Assets 2

22 Aug 16:56

ArthurZucker

v4.44.2

1748902

Release v4.44.2

Patch release v4.44.2, mostly 2 regressions that were not caught for Jamba and for processors!

Fix: Jamba cache fails to use torch.nn.module (#32894) Authored by @xgal
Fix: No need to dtype A in Jamba (#32924) @xgal
Fix: Regression on Processor.save_pretrained caused by #31691 (#32921) Authored by @leloykun

Contributors

xgal and leloykun

Assets 2

20 Aug 17:51

ArthurZucker

v4.44.1

ca56cd7

Patch release v4.44.1

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

is_torchdynamo_compiling -- cast a wide exception net (#32476) by @gante
Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477) by @gante and @matthewdouglas
Gemma2: fix FA2 generation (#32553) by @zucchini-nlp
Fix: FA2 with packed training (#32487) by @zucchini-nlp
Fix sliding window attention used in Gemma2FlashAttention2 (#32522) by @brcps12
Automatically add transformers tag to the modelcard (#32623) by @LysandreJik
add back the position ids (#32554) by @ArthurZucker
Use head_dim if in config for RoPE (#32495) @suiyoubi @ArthurZucker
Revert PR 32299, flag users when Zero-3 was missed (#32851) by @muellerzr
fix multi-gpu with static cache (#32543) by @SunMarc
Reduce the error log when using core models that need their weights r… (#32656) by @muellerzr
Fix VLM generation issues (#32836) by @zucchini-nlp
Fix generate with inputs_embeds as input (#32493) (this PR has some cherry-pick)

Full Changelog: v4.44.0...v4.44.1

Contributors

muellerzr, gante, and 7 other contributors

Assets 2

06 Aug 18:39

ArthurZucker

v4.44.0

984bc11

Release v4.44.0

Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!

All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova

💥 End-to-end generation compile

Generate: end-to-end compilation #30788 by @gante: model.generate now supports compiling! There are a few limitations, but here is a small snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import copy

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

# compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

# compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generation_config = copy.deepcopy(model.generation_config)
generation_config.pad_token_id = model.config.eos_token_id

model_inputs = tokenizer(["Write a poem about the market crashing in summer"], return_tensors="pt")
model_inputs = model_inputs.to(model.device)
output_compiled = compiled_generate(**model_inputs, generation_config=generation_config)
print(output_compiled)

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

3-5x faster torch.compile forward compilation for autoregressive decoder models #32227* by @fxmarty .
As documented on the PR, this makes the whole generation a lot faster when you re-use the cache!
You can see this when you run model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

Offloaded KV Cache #31325* by @n17s : you just have to set cache_implementation="offloaded" when calling from_pretrained or using this:

from transformers import GenerationConfig
gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True)
outputs = model.generate(inputs["input_ids"],generation_config=gen_config)

📦 Torch export for static cache

pytorch team gave us a great gift: you can now use torch.export directly compatible with Executorch! Find examples here.

Make static cache compatible with torch.export #32168 by @guangy10

This also unlocks support for prompt reuse:

import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"

INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."

model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values

prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]
print(response)

prompt = "What is the best city to swim in?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**new_inputs, past_key_values=copy.deepcopy(prompt_cache),max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]

Gemma2: assisted decoding

Gemma 2: support assisted generation #32357 by @gante

We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.

# transformers assisted generation reference: 
# https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding 
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# we DON’T recommend using the 9b model with the 2b model as its assistant
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'

tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
   reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
   assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)

model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
   "assistant_model": assistant_model,
   "do_sample": True,
   "temperature": 0.7,
   "max_new_tokens": 64,
}

outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Nemotron support

Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See:

Add Nemotron HF Support #31699

Codestral support

Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.

It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!

Add codestral mamba2 #32080 by @molbap and @vasqu

Breaking changes:

We removed the chat template in the code, they should all be on the hub!

🚨 No more default chat templates #31733 by @Rocketknight1

Long-form decoding for whisper, even faster:

Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in

[whisper] compile compatibility with long-form decoding #31772

What's Changed

Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629
Updated ruff to the latest version by @Sai-Suraj-27 in #31926
fix by @gante in #32162
fix: Fixed an if condition that is always evaluating to true by @Sai-Suraj-27 in #32160
[docs] change temperature to a positive value by @faaany in #32077
adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer by @rohitdwivedula in #32171
fix: default value reflects the runtime environment variables rather than the ones present at import time. by @junrae6454 in #32153
Update qwen2.md by @ArtificialZeng in #32108
Remove conversational pipeline tests by @amyeroberts in #32099
RoPE: relaxed rope validation by @gante in #32182
let's not warn when someone is running a forward by @ArthurZucker in #32176
Fix resize embedding with Deepspeed by @zucchini-nlp in #32192
Fix float8_e4m3fn in modeling_utils by @SunMarc in #32193
Support dequantizing GGUF FP16 format by @PenutChen in #31783
🚨 No more default chat templates by @Rocketknight1 in #31733
fix: Replaced deprecated unittest method with the correct one by @Sai-Suraj-27 in #32198
[whisper] fix short-form output type by @sanchit-gandhi in https://github....

Contributors

kashif, n17s, and 55 other contributors

Assets 2

05 Aug 10:57

ArthurZucker

v4.43.4

868d36d

v4.43.4 Patch Release

Patch Release v4.43.4

There was a mick mack, now deepseep issues are properly pushed with:

Resize embeds with DeepSpeed #32214

🤗 Enjoy holidays

Assets 2

26 Jul 15:30

ArthurZucker

v4.43.3

47c29cc

v4.43.3 Patch deepspeed

Patch release v4.43.3:
We still saw some bugs so @zucchini-nlp added:
~~- Resize embeds with DeepSpeed #32214~~

don't log base model architecture in wandb if log model is false #32143

Other fixes:

[whisper] fix short-form output type #32178, by @sanchit-gandhi which fixes the short audio temperature fallback!
[BigBird Pegasus] set _supports_param_buffer_assignment to False #32222 by @kashif, mostly related to the new super fast init, some models have to get this set to False. If you see a weird behavior look for that 😉

Contributors

kashif, sanchit-gandhi, and zucchini-nlp

Assets 2

24 Jul 15:50

LysandreJik

v4.43.2

38d94bf

v4.43.2: Patch release

Fix float8_e4m3fn in modeling_utils (#32193)
Fix resize embedding with Deepspeed (#32192)
let's not warn when someone is running a forward (#32176)
RoPE: relaxed rope validation (#32182)

Assets 2

23 Jul 15:55

LysandreJik

v4.43.1

782bfff

v4.43.1: Patch release

fix (#32162)

Assets 2

Releases: huggingface/transformers

Release v4.45.2

Patch release v4.45.2

Contributors

Patch Release v4.45.1

Patches for v4.45.1

Contributors

Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers

New model additions

mllama

Qwen2-VL

Qwen2-Audio

OLMoE

Llava Onevision

FalconMamba

Granite Language Models

Granite MOE

Descript-Audio-Codec

Pixtral

Mimi

OmDet-Turbo

Quantization

GGUF

Torch AO

Liger Kernel

Modular Transformers

Contributors

Release v4.44.2

Contributors

Patch release v4.44.1

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

Contributors

Release v4.44.0

Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

💥 End-to-end generation compile

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

📦 Torch export for static cache

Gemma2: assisted decoding

Nemotron support

Codestral support

Breaking changes:

Long-form decoding for whisper, even faster:

What's Changed

Contributors

v4.43.4 Patch Release

Patch Release v4.43.4

v4.43.3 Patch deepspeed

Contributors

v4.43.2: Patch release

v4.43.1: Patch release