New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

🌐 [i18n-KO] Translated `ko-llm_tutorial_optimization.md` to Korean #32372

Merged

stevhliu merged 11 commits into huggingface:main from 010kim:ko-llm_tutorial_optimization.md

Aug 8, 2024

Contributor

010kim commented Aug 1, 2024 •

edited

Loading

What does this PR do?

Translated the ko-llm_tutorial_optimization.md file of the documentation to Korean. Thank you in advance for your review.

Part of #20179

Before reviewing

Check for missing / redundant translations (번역 누락/중복 검사)
Grammar Check (맞춤법 검사)
Review or Add new terms to glossary (용어 확인 및 추가)
Check Inline TOC (e.g. [[lowercased-header]])
Check live-preview for gotchas (live-preview로 정상작동 확인)

Who can review? (Initial)

@010kim, @chhaewxn, @boyunJang, @jeongiin, @harheem

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review? (Final)

@stevhliu May you please review this PR?

010kim added 7 commits

July 20, 2024 16:29


          docs: ko: llm_tutorial_optimization.md

65fe88c


          Merge branch 'huggingface:main' into ko-llm_tutorial_optimization.md

eca48cd


          feat: nmt draft

011adb9


          Merge branch 'huggingface:main' into ko-llm_tutorial_optimization.md

4c5fa34


          Merge branch 'huggingface:main' into ko-llm_tutorial_optimization.md

ee7707f


          fix: manual edits

5bd0f1f


          Merge branch 'huggingface:main' into ko-llm_tutorial_optimization.md

30d0564

jeongiin reviewed

View reviewed changes

Contributor

jeongiin left a comment

영렬님!
내용이 정말 많은데도, 이해가 쫙쫙 되도록 번역해주셔서 리뷰에 큰 어려움이 없었습니다!🥺

마이너한 부분 위주로 검토하여, 제안드려봅니다 ㅎㅎ
주로 신경쓴 부분은 다음과 같습니다.

자연스러운 한국 어투/단어로 번역
문법 오류 수정(오타, 띄어쓰기 등)
누락된 특수문자 삽입

자세한 내용은 코멘트에 추가해두었습니다!😉

docs/source/ko/llm_tutorial_optimization.md Outdated

+              -   인간과 비슷한 텍스트 이해 및 생성 능력을 보이기 위해, 현재 LLM은 수십억 개의 매개변수로 구성되어야 합니다 (참조: [Kaplan et al](https://arxiv.org/abs/2001.08361), [Wei et. al](https://arxiv.org/abs/2206.07682)). 이는 추론을 위한 메모리 요구를 크게 증가시킵니다.
+              -   많은 실제 과제에서 LLM은 방대한 맥락 정보를 제공받아야 합니다. 이는 모델이 추론 중에 매우 긴 입력 시퀀스를 처리할 수 있어야 한다는 것을 뜻합니다.
+              이러한 과제의 핵심은 LLM의 계산 및 메모리 역량을 증대시키는 데 있습니다. 특히 방대한 입력 시퀀스를 처리할 때 그렇습니다.

Contributor

jeongiin Aug 4, 2024

Suggested change

      
            이러한 과제의 핵심은 LLM의 계산 및 메모리 역량을 증대시키는 데 있습니다. 특히 방대한 입력 시퀀스를 처리할 때 그렇습니다.
          
            이러한 과제의 핵심은 LLM의 계산 및 메모리의 능력을 증대시키는 데 있습니다. 특히 방대한 입력 시퀀스를 처리할 때 이러한 능력이 중요합니다.

Contributor

jeongiin Aug 4, 2024

역량보다는 능력이 좀 더 자연스럽지 않을까(?) 해서 수정해보았습니다. 참고만 부탁드려요 ㅎㅎ

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated


		🤗 Transformers는 텐서 병렬 처리를 바로 지원하지 않습니다. 이는 모델 아키텍처가 특정 방식으로 작성되어야 하기 때문입니다. 텐서 병렬 처리를 지원하는 방식으로 모델을 작성하는 데 관심이 있다면 [the text-generation-inference library](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models/custom_modeling)를 참조해 보시기 바랍니다.

		기본적인 파이프라인 병렬 처리는 바로 지원됩니다. 이를 위해 단순히 모델을 `device="auto"`로 로드하면 [여기](https://huggingface.co/docs/accelerate/v0.22.0/en/concept_guides/big_model_inference)에 설명된 대로 사용 가능한 GPU에 모델의 서로다른 레이어를 자동으로 배치합니다. 이것은 매우 효과적이긴 하지만 이러한 기본 파이프라인 병렬 처리는 GPU 유휴 문제를 해결하지 못한다는 점을 유의해야 합니다. 더 고급 파이프라인 병렬 처리가 필요하며, 이에 대한 설명은 [여기](https://huggingface.co/docs/transformers/en/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism)에서 확인할 수 있습니다.

Contributor

jeongiin Aug 4, 2024

Suggested change

      
            기본적인 파이프라인 병렬 처리는 바로 지원됩니다. 이를 위해 단순히 모델을 `device="auto"`로 로드하면 [여기](https://huggingface.co/docs/accelerate/v0.22.0/en/concept_guides/big_model_inference)에 설명된 대로 사용 가능한 GPU에 모델의 서로다른 레이어를 자동으로 배치합니다. 이것은 매우 효과적이긴 하지만 이러한 기본 파이프라인 병렬 처리는 GPU 유휴 문제를 해결하지 못한다는 점을 유의해야 합니다. 더 고급 파이프라인 병렬 처리가 필요하며, 이에 대한 설명은 [여기](https://huggingface.co/docs/transformers/en/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism)에서 확인할 수 있습니다.
          
            기본적인 파이프라인 병렬 처리는 바로 지원됩니다. 이를 위해 단순히 모델을 `device="auto"`로 로드하면 [여기](https://huggingface.co/docs/accelerate/v0.22.0/en/concept_guides/big_model_inference)에 설명된 대로 사용 가능한 GPU에 모델의 서로 다른 레이어를 자동으로 배치합니다. 이것은 매우 효과적이지만 이러한 기본 파이프라인 병렬 처리는 GPU 유휴 문제를 해결하지 못한다는 점을 유의해야 합니다. 더 고급 파이프라인 병렬 처리가 필요하며, 이에 대한 설명은 [여기](https://huggingface.co/docs/transformers/en/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism)에서 확인할 수 있습니다.

Contributor

jeongiin Aug 4, 2024

띄어 쓰기 문법 수정하고, 마이너하지만 말을 조금 줄여보았습니다!

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated

+              훌륭합니다. 어텐션 층의 동일한 키와 값을 다시 계산하는 데 추가 시간이 소요되지 않습니다! 그러나 한 가지 문제가 있습니다. \\( \mathbf{QK}^T \\) 행렬에 필요한 최대 메모리는 크게 줄어들지만, 긴 입력 시퀀스나 다회차 채팅의 경우 키-값 캐시를 메모리에 보관하는 것이 매우 메모리 집약적이 될 수 있습니다. 키-값 캐시는 모든 자기 어텐션 층과 모든 어텐션 헤드에 대해 이전 입력 벡터 \\( \mathbf{x}_i \text{, for } i \in {1, \ldots, c - 1} \\)의 키-값 벡터를 저장해야 한다는 점을 기억하세요.
+              이전에 사용한 LLM `bigcode/octocoder`에 대해 키-값 캐시에 저장해야 하는 플로트 값의 수를 계산해 봅시다.
+              플로트 값의 수는 시퀀스 길이의 두 배에 어텐션 헤드 수, 어텐션 헤드 차원, 레이어 수를 곱한 값입니다.

Contributor

jeongiin Aug 4, 2024

Suggested change

      
            플로트 값의 수는 시퀀스 길이의 두 배에 어텐션 헤드 수, 어텐션 헤드 차원, 레이어 수를 곱한 값입니다.
          
            부동 소수점 값의 수는 시퀀스 길이의 두 배에 어텐션 헤드 수, 어텐션 헤드 차원, 레이어 수를 곱한 값입니다.

docs/source/ko/llm_tutorial_optimization.md Outdated

+              7864320000
+              ```
+              대략 80억 개의 플로트 값입니다! `float16` 정밀도로 80억 개의 플로트 값을 저장하는 데는 약 15GB의 RAM이 필요하며, 이는 모델 가중치 자체의 절반 정도입니다.

Contributor

jeongiin Aug 4, 2024

Suggested change

      
            대략 80억 개의 플로트 값입니다! `float16` 정밀도로 80억 개의 플로트 값을 저장하는 데는 약 15GB의 RAM이 필요하며, 이는 모델 가중치 자체의 절반 정도입니다.
          
            대략 80억 개의 부동 소수점 값입니다! `float16` 정밀도로 80억 개의 부동 소수점 값을 저장하는 데는 약 15GB의 RAM이 필요하며, 이는 모델 가중치 자체의 절반 정도입니다.

docs/source/ko/llm_tutorial_optimization.md


		[멀티 쿼리 어텐션 (MQA)](https://arxiv.org/abs/1911.02150)은 Noam Shazeer의 Fast Transformer Decoding: One Write-Head is All You Need 논문에서 제안되었습니다. 제목에서 알 수 있듯이, Noam은 `n_head` 키-값 프로젝션 가중치 대신, 모든 어텐션 헤드에서 공유되는 단일 헤드-값 프로젝션 가중치를 사용할 수 있으며, 이를 통해 모델 성능이 크게 저하되지 않는다는 것을 발견했습니다.

		> 단일 헤드-값 프로젝션 가중치를 사용함으로써, 키-값 벡터 \\( \mathbf{k}_i, \mathbf{v}_i \\)는 모든 어텐션 헤드에서 동일해야 하며, 이는 캐시에 `n_head` 개 대신 하나의 키-값 프로젝션 쌍만 저장하면 된다는 것을 의미합니다.

Contributor

jeongiin Aug 4, 2024

Suggested change

      
            > 단일 헤드-값 프로젝션 가중치를 사용함으로써, 키-값 벡터 \\( \mathbf{k}_i, \mathbf{v}_i \\)는 모든 어텐션 헤드에서 동일해야 하며, 이는 캐시에 `n_head` 개 대신 하나의 키-값 프로젝션 쌍만 저장하면 된다는 것을 의미합니다.
          
            > "단일 헤드-값 프로젝션 가중치를 사용함으로써, 키-값 벡터 \\( \mathbf{k}_i, \mathbf{v}_i \\)는 모든 어텐션 헤드에서 동일해야 하며, 이는 캐시에 `n_head` 개 대신 하나의 키-값 프로젝션 쌍만 저장하면 된다는 것을 의미합니다."

docs/source/ko/llm_tutorial_optimization.md Outdated

+              GQA는 최근에 제안되었기 때문에 이 노트북을 작성할 당시에는 채택이 덜 되었습니다.
+              GQA의 가장 주목할 만한 적용 사례는 [Llama-v2](https://huggingface.co/meta-llama/Llama-2-70b-hf)입니다.
+              > 결론적으로, LLM이 자가회귀 디코딩으로 배포되고 예를 들어 채팅과 같이 큰 입력 시퀀스를 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다.

Contributor

jeongiin Aug 4, 2024

Suggested change

      
            > 결론적으로, LLM이 자가회귀 디코딩으로 배포되고 예를 들어 채팅과 같이 큰 입력 시퀀스를 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다.
          
            > "결론적으로, LLM이 자가회귀 디코딩으로 배포되고 예를 들어 채팅과 같이 큰 입력 시퀀스를 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다."

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

boyunJang reviewed

View reviewed changes

Contributor

boyunJang left a comment

좀 늦었지만 리뷰 완료했습니다! 정말 긴 문서인데 고생 정말 많으셨습니다...!!
glossary랑 다른 부분 수정하고 문장 살짝 다듬었습니다:)

docs/source/ko/llm_tutorial_optimization.md Outdated


		[[open-in-colab]]

		GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf)와 같은 대형 언어 모델(LLM)의 인간 중심 과제를 해결하는 능력이 빠르게 발전하고 있으며, 현대 지식 기반 산업에서 필수 도구로 자리잡고 있습니다. 그러나 이러한 모델을 실제 과제에 배포하는 것은 여전히 어려운 과제입니다:

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf)와 같은 대형 언어 모델(LLM)의 인간 중심 과제를 해결하는 능력이 빠르게 발전하고 있으며, 현대 지식 기반 산업에서 필수 도구로 자리잡고 있습니다. 그러나 이러한 모델을 실제 과제에 배포하는 것은 여전히 어려운 과제입니다:
          
            GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf)와 같은 대규모 언어 모델(LLM)의 인간 중심 과제를 해결하는 능력이 빠르게 발전하고 있으며, 현대 지식 기반 산업에서 필수 도구로 자리잡고 있습니다. 그러나 이러한 모델을 실제 과제에 배포하는 것은 여전히 어려운 과제입니다:

LLM이 glossary에 대규모 언어 모델이라고 되어 있어서 수정했습니다!

docs/source/ko/llm_tutorial_optimization.md Outdated

+              -   인간과 비슷한 텍스트 이해 및 생성 능력을 보이기 위해, 현재 LLM은 수십억 개의 매개변수로 구성되어야 합니다 (참조: [Kaplan et al](https://arxiv.org/abs/2001.08361), [Wei et. al](https://arxiv.org/abs/2206.07682)). 이는 추론을 위한 메모리 요구를 크게 증가시킵니다.
+              -   많은 실제 과제에서 LLM은 방대한 맥락 정보를 제공받아야 합니다. 이는 모델이 추론 중에 매우 긴 입력 시퀀스를 처리할 수 있어야 한다는 것을 뜻합니다.
+              이러한 과제의 핵심은 LLM의 계산 및 메모리 역량을 증대시키는 데 있습니다. 특히 방대한 입력 시퀀스를 처리할 때 그렇습니다.

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            이러한 과제의 핵심은 LLM의 계산 및 메모리 역량을 증대시키는 데 있습니다. 특히 방대한 입력 시퀀스를 처리할 때 그렇습니다.
          
            이러한 과제의 핵심은 LLM의 계산 및 메모리 활용 능력을 증대시키는 데 있습니다. 특히 방대한 입력 시퀀스를 처리할 때 이러한 능력이 중요합니다.

정인님 리뷰에 추가로 "메모리 활용 능력"으로 의역했습니다!

docs/source/ko/llm_tutorial_optimization.md Outdated


		2. 플래시 어텐션: 플래시 어텐션은 메모리 효율성을 높일 뿐만 아니라 최적화된 GPU 메모리 활용을 통해 효율성을 향상시키는 어텐션 알고리즘의 변형입니다.

		3. 아키텍처 혁신: 추론 시 LLM은 주로 동일한 방식으로 배포되는데(예시로 긴 입력 맥락을 가진 자회귀 텍스트 생성 방식), 더 효율적인 추론을 가능하게 하는 특화된 모델 아키텍처가 제안되었습니다. 이와 관련한 가장 중요한 모델 아키텍처의 발전은 [Alibi](https://arxiv.org/abs/2108.12409), [Rotary embeddings](https://arxiv.org/abs/2104.09864), [Multi-Query Attention (MQA)](https://arxiv.org/abs/1911.02150), [Grouped-Query-Attention (GQA)]((https://arxiv.org/abs/2305.13245))입니다.

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            3.  **아키텍처 혁신:** 추론 시 LLM은 주로 동일한 방식으로 배포되는데(예시로 긴 입력 맥락을 가진 자회귀 텍스트 생성 방식), 더 효율적인 추론을 가능하게 하는 특화된 모델 아키텍처가 제안되었습니다. 이와 관련한 가장 중요한 모델 아키텍처의 발전은 [Alibi](https://arxiv.org/abs/2108.12409), [Rotary embeddings](https://arxiv.org/abs/2104.09864), [Multi-Query Attention (MQA)](https://arxiv.org/abs/1911.02150), [Grouped-Query-Attention (GQA)]((https://arxiv.org/abs/2305.13245))입니다.
          
            3.  **아키텍처 혁신:** 추론 시 LLM은 주로 동일한 방식(긴 입력 맥락을 가진 자기회귀 텍스트 생성 방식)으로 배포되는데, 더 효율적인 추론을 가능하게 하는 특화된 모델 아키텍처가 제안되었습니다. 이와 관련한 가장 중요한 모델 아키텍처의 발전은 [Alibi](https://arxiv.org/abs/2108.12409), [Rotary embeddings](https://arxiv.org/abs/2104.09864), [Multi-Query Attention (MQA)](https://arxiv.org/abs/1911.02150), [Grouped-Query-Attention (GQA)]((https://arxiv.org/abs/2305.13245))입니다.

glossary에 autoregressive가 자기회귀로 되어 있어서 수정했습니다.
문장을 살짝 더 자연스럽게 다듬어뒀습니다!

docs/source/ko/llm_tutorial_optimization.md Outdated


		3. 아키텍처 혁신: 추론 시 LLM은 주로 동일한 방식으로 배포되는데(예시로 긴 입력 맥락을 가진 자회귀 텍스트 생성 방식), 더 효율적인 추론을 가능하게 하는 특화된 모델 아키텍처가 제안되었습니다. 이와 관련한 가장 중요한 모델 아키텍처의 발전은 [Alibi](https://arxiv.org/abs/2108.12409), [Rotary embeddings](https://arxiv.org/abs/2104.09864), [Multi-Query Attention (MQA)](https://arxiv.org/abs/1911.02150), [Grouped-Query-Attention (GQA)]((https://arxiv.org/abs/2305.13245))입니다.

		이 가이드에서는 텐서의 관점에서 자회귀 생성에 대한 분석을 제공합니다. 낮은 정밀도를 채택하는데 장단점을 논의하고, 최신 어텐션 알고리즘을 포괄적으로 탐구하며, 향상된 LLM 아키텍처에 대해 논합니다. 이 과정에서 각 기능의 개선 사항을 보여주는 실용적인 예제를 확인합니다.

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            이 가이드에서는 텐서의 관점에서 자회귀 생성에 대한 분석을 제공합니다. 낮은 정밀도를 채택하는데 장단점을 논의하고, 최신 어텐션 알고리즘을 포괄적으로 탐구하며, 향상된 LLM 아키텍처에 대해 논합니다. 이 과정에서 각 기능의 개선 사항을 보여주는 실용적인 예제를 확인합니다.
          
            이 가이드에서는 텐서의 관점에서 자기회귀 생성에 대한 분석을 제공합니다. 낮은 정밀도를 채택하는 것의 장단점을 논의하고, 최신 어텐션 알고리즘을 포괄적으로 탐구하며, 향상된 LLM 아키텍처에 대해 논합니다. 이 과정에서 각 기능의 개선 사항을 보여주는 실용적인 예제를 확인합니다.

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated

+              훌륭합니다. 어텐션 층의 동일한 키와 값을 다시 계산하는 데 추가 시간이 소요되지 않습니다! 그러나 한 가지 문제가 있습니다. \\( \mathbf{QK}^T \\) 행렬에 필요한 최대 메모리는 크게 줄어들지만, 긴 입력 시퀀스나 다회차 채팅의 경우 키-값 캐시를 메모리에 보관하는 것이 매우 메모리 집약적이 될 수 있습니다. 키-값 캐시는 모든 자기 어텐션 층과 모든 어텐션 헤드에 대해 이전 입력 벡터 \\( \mathbf{x}_i \text{, for } i \in {1, \ldots, c - 1} \\)의 키-값 벡터를 저장해야 한다는 점을 기억하세요.
+              이전에 사용한 LLM `bigcode/octocoder`에 대해 키-값 캐시에 저장해야 하는 플로트 값의 수를 계산해 봅시다.
+              플로트 값의 수는 시퀀스 길이의 두 배에 어텐션 헤드 수, 어텐션 헤드 차원, 레이어 수를 곱한 값입니다.

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            플로트 값의 수는 시퀀스 길이의 두 배에 어텐션 헤드 수, 어텐션 헤드 차원, 레이어 수를 곱한 값입니다.
          
            부동 소수점 값의 수는 시퀀스 길이의 두 배의 어텐션 헤드 수, 어텐션 헤드 차원, 레이어 수를 곱한 값입니다.

docs/source/ko/llm_tutorial_optimization.md Outdated

+              대부분의 LLM이 20에서 100 사이의 어텐션 헤드를 사용하기 때문에, MQA는 키-값 캐시의 메모리 소비를 크게 줄입니다. 이 노트북에서 사용된 LLM의 경우, 입력 시퀀스 길이 16000에서 필요한 메모리 소비를 15GB에서 400MB 미만으로 줄일 수 있습니다.
+              메모리 절감 외에도, MQA는 계산 효율성도 향상시킵니다. 다음과 같이 설명합니다.
+              자가회귀 디코딩에서는 큰 키-값 벡터를 다시 로드하고, 현재 키-값 벡터 쌍과 연결한 후 \\( \mathbf{q}_c\mathbf{K}^T \\) 계산에 매 단계마다 입력해야 합니다. 자가회귀 디코딩의 경우, 지속적인 재로드에 필요한 메모리 대역폭이 심각한 시간 병목 현상이 될 수 있습니다. 키-값 벡터의 크기를 줄이면 접근해야 하는 메모리 양이 줄어들어 메모리 대역폭 병목 현상이 감소합니다. 자세한 내용은 [Noam의 논문](https://arxiv.org/abs/1911.02150)을 참조하세요.

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            자가회귀 디코딩에서는 큰 키-값 벡터를 다시 로드하고, 현재 키-값 벡터 쌍과 연결한 후 \\( \mathbf{q}_c\mathbf{K}^T \\) 계산에 매 단계마다 입력해야 합니다. 자가회귀 디코딩의 경우, 지속적인 재로드에 필요한 메모리 대역폭이 심각한 시간 병목 현상이 될 수 있습니다. 키-값 벡터의 크기를 줄이면 접근해야 하는 메모리 양이 줄어들어 메모리 대역폭 병목 현상이 감소합니다. 자세한 내용은 [Noam의 논문](https://arxiv.org/abs/1911.02150)을 참조하세요.
          
            자기회귀 디코딩에서는 큰 키-값 벡터를 다시 로드하고, 현재 키-값 벡터 쌍과 연결한 후 \\( \mathbf{q}_c\mathbf{K}^T \\) 계산에 매 단계마다 입력해야 합니다. 자기회귀 디코딩의 경우, 지속적인 재로드에 필요한 메모리 대역폭이 심각한 시간 병목 현상을 가져올 수 있습니다. 키-값 벡터의 크기를 줄이면 접근해야 하는 메모리 양이 줄어들어 메모리 대역폭 병목 현상이 감소합니다. 자세한 내용은 [Noam의 논문](https://arxiv.org/abs/1911.02150)을 참조하세요.

docs/source/ko/llm_tutorial_optimization.md Outdated

+              -   [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
+              -   [**BLOOM**](https://huggingface.co/bigscience/bloom)
+              또한, 이 노트북에서 사용된 체크포인트 - `bigcode/octocoder` -는 MQA를 사용합니다.

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            또한, 이 노트북에서 사용된 체크포인트 - `bigcode/octocoder` -는 MQA를 사용합니다.
          
            또한, 이 노트북에서 사용된 체크포인트 `bigcode/octocoder`는 MQA를 사용합니다.

docs/source/ko/llm_tutorial_optimization.md Outdated


		#### 3.2.3 그룹 쿼리 어텐션 (GQA) [[323-grouped-query-attention-gqa]]

		[Grouped-Query-Attention (GQA)](https://arxiv.org/abs/2305.13245)는 Google의 Ainslie 등 연구진에 의해 제안되었습니다. 그들은 MQA를 사용하는 것이 종종 일반적인 멀티 키-값 헤드 프로젝션을 사용하는 것보다 품질 저하를 초래할 수 있다는 것을 발견했습니다. 이 논문은 쿼리 헤드 프로젝션 가중치의 수를 너무 극단적으로 줄이는 대신, 더 많은 모델 성능을 유지할 수 있다고 주장합니다. 단일 키-값 프로젝션 가중치 대신, `n < n_head` 키-값 프로젝션 가중치를 사용해야 합니다. `n_head`보다 훨씬 작은 `n`값, 예를 들어 2, 4 또는 8을 선택하면, MQA의 거의 모든 메모리 및 속도 이점을 유지하면서 모델 용량을 덜 희생하고 따라서 성능 저하를 줄일 수 있습니다.

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            [Grouped-Query-Attention (GQA)](https://arxiv.org/abs/2305.13245)는 Google의 Ainslie 등 연구진에 의해 제안되었습니다. 그들은 MQA를 사용하는 것이 종종 일반적인 멀티 키-값 헤드 프로젝션을 사용하는 것보다 품질 저하를 초래할 수 있다는 것을 발견했습니다. 이 논문은 쿼리 헤드 프로젝션 가중치의 수를 너무 극단적으로 줄이는 대신, 더 많은 모델 성능을 유지할 수 있다고 주장합니다. 단일 키-값 프로젝션 가중치 대신, `n < n_head` 키-값 프로젝션 가중치를 사용해야 합니다. `n_head`보다 훨씬 작은 `n`값, 예를 들어 2, 4 또는 8을 선택하면, MQA의 거의 모든 메모리 및 속도 이점을 유지하면서 모델 용량을 덜 희생하고 따라서 성능 저하를 줄일 수 있습니다.
          
            [그룹 쿼리 어텐션 (GQA)](https://arxiv.org/abs/2305.13245)은 Google의 Ainslie 등의 연구진들에 의해 제안되었습니다. 그들은 MQA를 사용하는 것이 종종 일반적인 멀티 키-값 헤드 프로젝션을 사용하는 것보다 품질 저하를 가져올 수 있다는 것을 발견했습니다. 이 논문은 쿼리 헤드 프로젝션 가중치의 수를 너무 극단적으로 줄이는 대신, 더 많은 모델 성능을 유지할 수 있다고 주장합니다. 단일 키-값 프로젝션 가중치 대신, `n < n_head` 키-값 프로젝션 가중치를 사용해야 합니다. `n_head`보다 훨씬 작은 `n`값, 예를 들어 2, 4 또는 8을 선택하면, MQA의 거의 모든 메모리 및 속도 이점을 유지하면서 모델 용량을 덜 희생하고 따라서 성능 저하를 줄일 수 있습니다.

docs/source/ko/llm_tutorial_optimization.md Outdated

+              GQA는 최근에 제안되었기 때문에 이 노트북을 작성할 당시에는 채택이 덜 되었습니다.
+              GQA의 가장 주목할 만한 적용 사례는 [Llama-v2](https://huggingface.co/meta-llama/Llama-2-70b-hf)입니다.
+              > 결론적으로, LLM이 자가회귀 디코딩으로 배포되고 예를 들어 채팅과 같이 큰 입력 시퀀스를 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다.

Contributor

boyunJang Aug 4, 2024

Suggested change

      
            > 결론적으로, LLM이 자가회귀 디코딩으로 배포되고 예를 들어 채팅과 같이 큰 입력 시퀀스를 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다.
          
            > 결론적으로, LLM이 자기회귀 디코딩으로 배포되면서 채팅과 같이 큰 입력 시퀀스를 가진 작업을 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다.

chhaewxn reviewed

View reviewed changes

Contributor

chhaewxn left a comment •

edited

Loading

정말 긴 문서 번역하시느라 고생 많으셨고, 어려운 용어에 대해서도 깔끔하게 번역해주셔서 읽어보며 많은 배움 얻을 수 있었습니다! 😊

glossary 기반으로 LLM을 "대규모 언어 모델"로 최대한 통일해보았는데, 빠진 부분이 있을 수도 있어서 다시 확인 부탁드릴게요!
콜론(:) 사용을 마침표로(.)로 대체하였습니다.
자연스러운 문장을 위해 약간의 번역 수정 제안 드렸습니다.
위의 내용 기반으로 리뷰 진행하였습니다! 감사합니다🙇‍♀️

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated Show resolved Hide resolved

docs/source/ko/llm_tutorial_optimization.md Outdated

+              이전에 사용한 LLM `bigcode/octocoder`에 대해 키-값 캐시에 저장해야 하는 플로트 값의 수를 계산해 봅시다.
+              플로트 값의 수는 시퀀스 길이의 두 배에 어텐션 헤드 수, 어텐션 헤드 차원, 레이어 수를 곱한 값입니다.
+              가상의 입력 시퀀스 길이 16000에 대해 우리 LLM에 대해 이를 계산하면 다음과 같습니다:

Contributor

chhaewxn Aug 5, 2024

Suggested change

      
            가상의 입력 시퀀스 길이 16000에 대해 우리 LLM에 대해 이를 계산하면 다음과 같습니다:
          
            가상의 입력 시퀀스 길이 16000에서 대규모 언어 모델에 대해 이를 계산하면 다음과 같습니다.

docs/source/ko/llm_tutorial_optimization.md Outdated


		> 단일 헤드-값 프로젝션 가중치를 사용함으로써, 키-값 벡터 \\( \mathbf{k}_i, \mathbf{v}_i \\)는 모든 어텐션 헤드에서 동일해야 하며, 이는 캐시에 `n_head` 개 대신 하나의 키-값 프로젝션 쌍만 저장하면 된다는 것을 의미합니다.

		대부분의 LLM이 20에서 100 사이의 어텐션 헤드를 사용하기 때문에, MQA는 키-값 캐시의 메모리 소비를 크게 줄입니다. 이 노트북에서 사용된 LLM의 경우, 입력 시퀀스 길이 16000에서 필요한 메모리 소비를 15GB에서 400MB 미만으로 줄일 수 있습니다.

Contributor

chhaewxn Aug 5, 2024

Suggested change

      
            대부분의 LLM이 20에서 100 사이의 어텐션 헤드를 사용하기 때문에, MQA는 키-값 캐시의 메모리 소비를 크게 줄입니다. 이 노트북에서 사용된 LLM의 경우, 입력 시퀀스 길이 16000에서 필요한 메모리 소비를 15GB에서 400MB 미만으로 줄일 수 있습니다.
          
            대부분의 대규모 언어 모델이 20에서 100 사이의 어텐션 헤드를 사용하기 때문에, MQA는 키-값 캐시의 메모리 소비를 크게 줄입니다. 이 노트북에서 사용된 대규모 언어 모델의 경우, 입력 시퀀스 길이 16000에서 필요한 메모리 소비를 15GB에서 400MB 미만으로 줄일 수 있습니다.

docs/source/ko/llm_tutorial_optimization.md Outdated


		여기서 이해해야 할 중요한 부분은 키-값 어텐션 헤드 수를 1로 줄이는 것이 키-값 캐시를 사용할 때만 의미가 있다는 것입니다. 키-값 캐시 없이 단일 포워드 패스에 대한 모델의 최대 메모리 소비는 변경되지 않으며, 각 어텐션 헤드는 여전히 고유한 쿼리 벡터를 가지므로 각 어텐션 헤드는 여전히 다른 \\( \mathbf{QK}^T \\) 행렬을 가집니다.

		MQA는 커뮤니티에서 널리 채택되어 현재 가장 인기 있는 많은 LLM에서 사용되고 있습니다:

Contributor

chhaewxn Aug 5, 2024

Suggested change

      
            MQA는 커뮤니티에서 널리 채택되어 현재 가장 인기 있는 많은 LLM에서 사용되고 있습니다:
          
            MQA는 커뮤니티에서 널리 채택되어 현재 가장 인기 있는 많은 대규모 언어 모델에서 사용되고 있습니다.

docs/source/ko/llm_tutorial_optimization.md Outdated

+              GQA는 최근에 제안되었기 때문에 이 노트북을 작성할 당시에는 채택이 덜 되었습니다.
+              GQA의 가장 주목할 만한 적용 사례는 [Llama-v2](https://huggingface.co/meta-llama/Llama-2-70b-hf)입니다.
+              > 결론적으로, LLM이 자가회귀 디코딩으로 배포되고 예를 들어 채팅과 같이 큰 입력 시퀀스를 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다.

Contributor

chhaewxn Aug 5, 2024

Suggested change

      
            > 결론적으로, LLM이 자가회귀 디코딩으로 배포되고 예를 들어 채팅과 같이 큰 입력 시퀀스를 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다.
          
            > 결론적으로, 대규모 언어 모델이 자기회귀 디코딩으로 배포되면서 채팅과 같이 큰 입력 시퀀스를 가진 작업을 처리해야 하는 경우 GQA 또는 MQA를 사용하는 것이 강력히 권장됩니다.

docs/source/ko/llm_tutorial_optimization.md Outdated

Comment on lines 757 to 759

		연구 커뮤니티는 점점 더 큰 LLM의 추론 시간을 가속화하기 위한 새로운 기발한 방법들을 끊임없이 찾아내고 있습니다. 예를 들어, [speculative decoding](https://arxiv.org/abs/2211.17192)이라는 유망한 연구 방향이 있습니다. 여기서 "쉬운 토큰"은 더 작고 빠른 언어 모델에 의해 생성되고, "어려운 토큰"만 LLM 자체에 의해 생성됩니다. 자세한 내용은 이 노트북의 범위를 벗어나지만, [이 멋진 블로그 포스트](https://huggingface.co/blog/assisted-generation)에서 읽어볼 수 있습니다.

		GPT3/4, Llama-2-70b, Claude, PaLM과 같은 거대한 LLM이 [Hugging Face Chat](https://huggingface.co/chat/) 또는 ChatGPT와 같은 채팅 인터페이스에서 빠르게 실행될 수 있는 이유는 위에서 언급한 정밀도, 알고리즘, 아키텍처의 개선 덕분입니다. 앞으로 GPU, TPU 등과 같은 가속기는 점점 더 빨라지고 더 많은 메모리를 사용할 것입니다. 따라서 가장 좋은 알고리즘과 아키텍처를 사용하여 최고의 효율을 얻는 것이 중요합니다 🤗

Contributor

chhaewxn Aug 5, 2024

Suggested change

      
            연구 커뮤니티는 점점 더 큰 LLM의 추론 시간을 가속화하기 위한 새로운 기발한 방법들을 끊임없이 찾아내고 있습니다. 예를 들어, [speculative decoding](https://arxiv.org/abs/2211.17192)이라는 유망한 연구 방향이 있습니다. 여기서 "쉬운 토큰"은 더 작고 빠른 언어 모델에 의해 생성되고, "어려운 토큰"만 LLM 자체에 의해 생성됩니다. 자세한 내용은 이 노트북의 범위를 벗어나지만, [이 멋진 블로그 포스트](https://huggingface.co/blog/assisted-generation)에서 읽어볼 수 있습니다.
          
            GPT3/4, Llama-2-70b, Claude, PaLM과 같은 거대한 LLM이 [Hugging Face Chat](https://huggingface.co/chat/) 또는 ChatGPT와 같은 채팅 인터페이스에서 빠르게 실행될 수 있는 이유는 위에서 언급한 정밀도, 알고리즘, 아키텍처의 개선 덕분입니다. 앞으로 GPU, TPU 등과 같은 가속기는 점점 더 빨라지고 더 많은 메모리를 사용할 것입니다. 따라서 가장 좋은 알고리즘과 아키텍처를 사용하여 최고의 효율을 얻는 것이 중요합니다 🤗
          
            연구 커뮤니티는 점점 더 큰 대규모 언어 모델의 추론 시간을 가속화하기 위한 새로운 기발한 방법들을 끊임없이 찾아내고 있습니다. 예를 들어, [추측 디코딩](https://arxiv.org/abs/2211.17192)이라는 유망한 연구 방향이 있습니다. 여기서 "쉬운 토큰"은 더 작고 빠른 언어 모델에 의해 생성되고, "어려운 토큰"만 대규모 언어 모델 자체에 의해 생성됩니다. 자세한 내용은 이 노트북의 범위를 벗어나지만, [멋진 블로그 포스트](https://huggingface.co/blog/assisted-generation)에서 읽어볼 수 있습니다.
          
            GPT3/4, Llama-2-70b, Claude, PaLM과 같은 거대한 대규모 언어 모델이 [Hugging Face Chat](https://huggingface.co/chat/) 또는 ChatGPT와 같은 채팅 인터페이스에서 빠르게 실행될 수 있는 이유는 위에서 언급한 정밀도, 알고리즘, 아키텍처의 개선 덕분입니다. 앞으로 GPU, TPU 등과 같은 가속기는 점점 더 빨라지고 더 많은 메모리를 사용할 것입니다. 따라서 가장 좋은 알고리즘과 아키텍처를 사용하여 최고의 효율을 얻는 것이 중요합니다 🤗

010kim and others added 4 commits

August 7, 2024 16:03


          Update docs/source/ko/llm_tutorial_optimization.md

25180b0

Co-authored-by: Chaewon Song <[email protected]>


          Update docs/source/ko/llm_tutorial_optimization.md

f08457b

Co-authored-by: Chaewon Song <[email protected]>


          fix: resolve suggestions - 1

f898a06

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: timdalxx <[email protected]>
Co-authored-by: boyunJang <[email protected]>


          fix: resolve suggestions - 2

f5104dc

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: timdalxx <[email protected]>

010kim marked this pull request as ready for review

August 8, 2024 00:58

HuggingFaceDocBuilderDev commented Aug 8, 2024

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu approved these changes

View reviewed changes

Member

stevhliu left a comment

Wow great job! Thanks for your time and effort translating this big doc 🤗

stevhliu merged commit 43f3fe8 into huggingface:main

8 checks passed

Contributor Author

010kim commented Aug 9, 2024 •

edited

Loading

@stevhliu No worries! I'm quite comfortable with English. Thank you for the approval.

Cemberk added a commit to Cemberk/transformers that referenced this pull request


          Automated PR: Downstream develop rebase new changes (#69)

888af19

* Added mamba.py backend (#30139)

* Update README.md

* tests: forward ok

* backward test done

* done testing

* removed check. scripts

* Update README.md

* added use_mambapy arg

* fixed typo in warning

* protected imports w/ mambapy package

* delete pscan.py + raise rather than assert

* Update import_utils.py

* fix whitespaces and unused import

* trailing whitespace + import block unformatted

* Update modeling_mamba.py

* transpose before pscan

* shape comment

* ran make style

* use_mambapy=False by default

Co-authored-by: Arthur <[email protected]>

* ran make fix-copies

---------

Co-authored-by: Arthur <[email protected]>

* Rename Phi-3 rope scaling type (#31436)

* renamed phi3 rope_scaling type

* fixed trailing whitespaces

* fixed test

* added warning

* fixed format

* Revert "Incorrect Whisper long-form decoding timestamps " (#32148)

Revert "Incorrect Whisper long-form decoding timestamps  (#32003)"

This reverts commit cd48553fc8375e1a28d4d82cfe231dedf6a23af8.

* Fix typing to be compatible with later py versions (#32155)

* feat(cache): StaticCache uses index_copy_ to avoid useless copy (#31857)

* feat(cache): StaticCache uses index_copy_ to avoid useless copy

Using index_copy_ allows for explicit in-place change of the tensor.
Some backends (XLA) will otherwise copy the tensor, making the code
slower and using more memory.

Proposed implementation will end up using less memory and on XLA will
result in less compilation, but the change is also quite generic, making
no change whatsoever on CUDA or CPU backend.

* feat(cache): SlidingWindowCache uses index_copy_ to avoid useless copy

Applying the same change done in StaticCache.

* fix(cache): fallback of index_copy_ when not implemented

* fix(cache): in index_copy_ ensure tensors are on same device

* [run slow] llama

* fix(cache): add move of cache_position to same device in SlidingWindowCache

* Revert "[run slow] llama"

This reverts commit 02608dd14253ccd464e31c108e0cd94364f0e8b9.

* Added additional kwarg for successful running of optuna hyperparameter search (#31924)

Update integration_utils.py

Added additional kwarg

* Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs (#31629)

* add DataCollatorBatchFlattening

* Update data_collator.py

* change name

* new FA2 flow if position_ids is provided

* add comments

* minor fix

* minor fix data collator

* add test cases for models

* add test case for data collator

* remove extra code

* formating for ruff check and check_repo.py

* ruff format

ruff format tests src utils

* custom_init_isort.py

* Updated `ruff` to the latest version (#31926)

* Updated ruff version and fixed the required code accorindg to the latest version.

* Updated ruff version and fixed the required code accorindg to the latest version.

* Added noqa directive to ignore 1 error shown by ruff

* Dev version: v4.44.0.dev0

* Llama 3.1 conversion

Co-authored-by: Arthur Zucker <[email protected]>

* fix (#32162)

* fix: Fixed an if condition that is always evaluating to true (#32160)

Fixed an if condition always evaluating to true.

* [docs] change temperature to a positive value (#32077)

fix

* adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer (#32171)

* adds: extra_repr() to MambaRMSNorm to include the hidden size of the layer

* style fix with ruff:

* fix: default value reflects the runtime environment variables rather than the ones present at import time. (#32153)

* fix: default value reflects the runtime environment variables rather than the ones present at import time.

* Fix: Change `deterministic` to None by default; use env var if None

* Update qwen2.md (#32108)

* Update qwen2.md

outdated description

* Update qwen2.md

amended

* Update qwen2.md

Update

* Update qwen2.md

fix wrong version code, now good to go

* Remove conversational pipeline tests (#32099)

Remove conversation pipeline tests

* RoPE: relaxed rope validation (#32182)

* relaxed rope check

* lets also accept rope_type=None, defaulting to the original implementation

* type and rope_type can coexist

* let's not warn when someone is running a forward  (#32176)

* let's not warn when someone is running a foward without cache + self.training

* more models

* fixup

* Fix resize embedding with Deepspeed (#32192)

fix resize when deepspeed

* Fix float8_e4m3fn in modeling_utils (#32193)

* Fix float8_e4m3fn in modeling_utils

* style

* fix

* comment

* Support dequantizing GGUF FP16 format (#31783)

* support gguf fp16

* support gguf bf16 with pytorch

* add gguf f16 test

* remove bf16

* :rotating_light: No more default chat templates (#31733)

* No more default chat templates

* Add the template to the GPT-SW3 tests since it's not available by default now

* Fix GPT2 test

* Fix Bloom test

* Fix Bloom test

* Remove default templates again

* fix: Replaced deprecated `unittest method` with the correct one (#32198)

Replaced deprecated unittest method with the correct one.

* [whisper] fix short-form output type (#32178)

* [whisper] fix short-form output type

* add test

* make style

* update long-form tests

* fixes

* last fix

* finalise test

* remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 (#32210)

remove unnecessary guard code related with pytorch versions 1.4.2 ~
1.7.0

* Update question_answering.py (#32208)

* [BigBird Pegasus] set _supports_param_buffer_assignment to False (#32222)

set _supports_param_buffer_assignment to False

* [warnings] fix E721 warnings (#32223)

fix E721 warnings

* Follow up for #31973 (#32025)

* fix

* [test_all] trigger full CI

---------

Co-authored-by: ydshieh <[email protected]>

* translate philosophy.md to chinese (#32177)

* translate philosophy.md to chinese

* add the missing link

* Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac (#31846)

* use currently active microphone on mac for ffmpeg_microphone

* Allow ffmpeg_microphone device to be specified

Co-authored-by: amyeroberts <[email protected]>

---------

Co-authored-by: amyeroberts <[email protected]>

* Fix code snippet for Grounding DINO (#32229)

Fix code snippet for grounding-dino

* Generation: stop at `eos` for assisted decoding (#31301)

* fix

* move changes to prompt lookup

* add test

* set eos in assistant model

* style

* fix flakiness

* changes for new `main`

* Update tests/generation/test_utils.py

Co-authored-by: amyeroberts <[email protected]>

* Update tests/generation/test_utils.py

Co-authored-by: amyeroberts <[email protected]>

* add comment to explain

---------

Co-authored-by: amyeroberts <[email protected]>

* Llava: generate without images (#32183)

* llava w/o images

* tests

* Resize embeds with DeepSpeed  (#32214)

* fix resize when deepspeed

* deepsped uses new embeds

* we needed this

* don't log base model architecture in wandb if log model is false (#32143)

* don't log base model architecture in wandb is log model is false

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: amyeroberts <[email protected]>

* convert log model setting into an enum

* fix formatting

---------

Co-authored-by: amyeroberts <[email protected]>

* Refactor: Removed un-necessary `object` base class (#32230)

* Refactored to remove un-necessary object base class.

* small fix.

* Adds: extra_repr for RMSNorm layers in most models (#32204)

* adds: extra_repr() to RMSNorm layers in multiple models

* adds: extra_repr for deprecated models as well

* formatting as per style guide

* Add check for `target_sizes is None` in `post_process_image_guided_detection` for owlv2 (#31934)

* Add check for target_sizes is None in post_process_image_guided_detection

* Make sure Owlvit and Owlv2 in sync

* Fix incorrect indentation; add check for correct size of target_sizes

* [tests] fix `static` cache implementation is not compatible with `attn_implementation==flash_attention_2` (#32039)

* add flash attention check

* fix

* fix

* Flash-Attn: fix generation when no attention mask or no pading (#32241)

* fix

* fix prev test (half of failures)

* [run-slow] llama, gemma2

* [run-slow] llama, gemma2

* More flexible trigger condition (#32251)

update

Co-authored-by: ydshieh <[email protected]>

* Llama 3.1: replace for loop by tensor ops at inv_freq initialization (#32244)

* replace for loop by tensor ops

* rm assert; readability

* 🚨 Bloom support for cache class (#31445)

* bloom dynamic cache

* bloom follows standard cache format

* no skips for bloom anymore

* use cache position when possible

* clean up

* codestyle

* Update src/transformers/models/bloom/modeling_bloom.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/bloom/modeling_bloom.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/bloom/modeling_bloom.py

Co-authored-by: amyeroberts <[email protected]>

* pr comments

* isinstance fix

* address comments

* make musicgen test happy

* [run-slow] bloom

---------

Co-authored-by: amyeroberts <[email protected]>

* Upload new model failure report to Hub (#32264)

upload

Co-authored-by: ydshieh <[email protected]>

* Optimize t5 tokenize logic to avoid redundant calls (#32270)

* Optimize t5 tokenize logic to avoid redundant calls

* fix and overwrite copies

* fix: Fixed wrong argument passed to `convert_blip_checkpoint` function call (#32262)

Removed one wrong argument passed to convert_blip_checkpoint function call.

* Repo: remove exceptions in `check_docstrings` (#32259)

remove exceptions

* make `p_mask` a numpy array before passing to `select_starts_ends` (#32076)

* fix

* bug fix

* refine

* fix

* fix(docs): Fixed a link in docs (#32274)

Fixed a link in docs.

* Generate: end-to-end compilation (#30788)

* mvp

* added test (a few models need fixes)

* fix a few test cases

* test nits

* harder test 😈

* revert changes in stablelm

* test with improved condition

* add todo

* tmp commit

* merged with main

* nits

* add todo

* final corrections

* add docs for generation compilation

* docs nits

* add  tip

* PR suggestions

* add more details to the compilation docs

* fix cache positions

* cache is now init in generate; update docs

* tag test as flaky

* docs

* post rebase make fixup and other nits

* remove unintended changes

* whisper (encoder-decoder) not supported

* move token default updates to ; add tests for token defaults

* push changes

* manual rebase

* chameleon doesn't support this

* fix test_static_cache_mha_mqa_gqa (broken in another PR)

* docs: dynamic is better with end-to-end compilation

* Whisper tokenizer word level timestamps (#32197)

* fix _fix_key in PreTrainedModel

* fix _find_longest_common_sequence

* add test

* remove result.json

* nit

* update test

* [pipeline] fix padding for 1-d tensors (#31776)

* [pipeline] fix padding for 1-d tensors

* add test

* make style

* Update tests/pipelines/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Kamil Akesbi <[email protected]>

* Update tests/pipelines/test_pipelines_automatic_speech_recognition.py

---------

Co-authored-by: Kamil Akesbi <[email protected]>

* Make static cache compatible with torch.export (#32168)

* Add stream messages from agent run for gradio chatbot (#32142)

* Add stream_to_gradio method for running agent in gradio demo

* use torch 2.4 in 2 CI jobs (#32302)

Co-authored-by: ydshieh <[email protected]>

* Docs: fix GaLore optimizer code example (#32249)

Docs: fix GaLore optimizer example

Fix incorrect usage of GaLore optimizer in Transformers trainer code example.

The GaLore optimizer uses low-rank gradient updates to reduce memory usage. GaLore is quite popular and is implemented by the authors in [https://github.com/jiaweizzhao/GaLore](https://github.com/jiaweizzhao/GaLore). A few months ago GaLore was added to the HuggingFace Transformers library in https://github.com/huggingface/transformers/pull/29588.

Documentation of the Trainer module includes a few code examples of how to use GaLore. However, the `optim_targe_modules` argument to the `TrainingArguments` function is incorrect, as discussed in https://github.com/huggingface/transformers/pull/29588#issuecomment-2006289512. This pull request fixes this issue.

* Fix GGUF dequantize for `gguf==0.9.1` (#32298)

* fix gguf dequantize for gguf==0.9.1

* fix old version

* make style

* Cast epochs_trained to int when resuming training (#32286)

* fix epochs_trained as int when resuming training

* refactor

---------

Co-authored-by: teddyferdinan <[email protected]>

* feat(ci): set `fetch-depth: 0` in trufflehog checkout step (#31663)

* Fix M4T for ASR pipeline (#32296)

* tentative fix

* do the same for M4T

* Docs: formatting nits (#32247)

* doc formatting nits

* ignore non-autodocs

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/esm/modeling_esm.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/esm/modeling_esm.py

Co-authored-by: amyeroberts <[email protected]>

* make fixup

---------

Co-authored-by: amyeroberts <[email protected]>

* Alternative agent plan (#32295)

* new agent plan

* plan type assertion

* style corrections

* better prompt naming

* make fixup

* fix: Added missing raise keyword for few exceptions (#32333)

Fixed raising of few exceptions.

* fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit (#32276)

* fixes #32329 : The Torch code is correct - to get an average of 10% o… (#32335)

fixes #32329 : The Torch code is correct - to get an average of 10% of the total, we want to take 50% of the remainder after we've already masked 80% with [MASK] in the previous step.

* Repo checks: skip docstring checks if not in the diff (#32328)

* tmp

* skip files not in the diff

* use git.Repo instead of an external subprocess

* add tiny change to confirm that the diff is working on pushed changes

* add make quality task

* more profesh main commit reference

* Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process (#32191)

* Remove user-defined tokens which can be obtained through merges

* Remove debug line

* formatting

* Refactor spm slow -> fast converter

* revert unnecessary refactor

* set comprehension

* remove test files

* Use `vocab_scores`

* Always replace spiece underline with space in decode

* we no longer need token filtering

* Add save fast load slow unit test

* Remove tokenizers version check

* Remove duplicate code

* Make `<start_of_turn>` and `<end_of_turn>` special tokens

* Bias merge priority with length if score is the same

* Add unit test for merge priority

* CI

* LLaVA-NeXT: fix anyres shapes (#32314)

fix

* Gemma2 and flash-attention (#32188)

* enable flash-attn & static cache

* this works, not the prev

* fix for sliding window layers

* not needed anymore

* Llama 3.1: Fix incorrect `inv_freq` assignment (#32330)

fix 💩

* [Idefics2] - Fix FA2 call for Perceiver layer (#32275)

* Fix FA2 call for Perciever layer

* [run_slow] idefics2

* [run_slow] idefics2

* [run_slow] idefics2

* Fix up

* [run_slow] idefics2

* [run_slow] idefics2

* [run_slow] idefics2

* Gemma 2: support assisted generation (#32357)

* Fix error when streaming to gradio with non-string tool arguments (#32360)

Fix error when streaming agent run to gradio with non-string tool arguments

* >3-5x faster torch.compile forward compilation for autoregressive decoder models (#32227)

* draft

* apply changes to all relevant archs

* rerun ci - check_docstrings.py failing?

* fix docstring

* move 2D->4D mask creation to modeling file

* repo consistency

* fix the batch size = 1 case - calling contiguous is not enough

* nit

* style

* propagate to gemma/gemma-2

* prepare inputs for gemma generation

* implement test and tiny fix in gemma2

* Update src/transformers/models/bloom/modeling_bloom.py

Co-authored-by: Arthur <[email protected]>

* fix copies

* ci pass

* fix gemma's test_compile_static_cache tests

* flacky

* retrigger ci

---------

Co-authored-by: sanchit-gandhi <[email protected]>
Co-authored-by: Arthur <[email protected]>

* fix: Removed unnecessary `@staticmethod` decorator (#32361)

* Fixed staticmethods with self as first argument.

* Fixed staticmethods with self as first argument.

* Fixed staticmethods with self as first argument.

* Fixed staticmethods with self as first argument.

* fix: warmup_steps check for training_args (#32236)

* LLaVa: add cache class attribute (#32278)

cache class flag

* [enc-dec cache] fix bug in indexing (#32370)

* [whisper] compile compatibility with long-form decoding (#31772)

* [whisper] compile compatibility with long-form decoding

* clarify comment

* fix after rebase

* finalise

* fix bsz

* fix cache split

* remove contiguous

* style

* finish

* update doc

* prevent cuda graph trace

* Remove size check between attn_weights and kv_seq_len for phi3 (#32339)

* Remove size check between attn_weights and kv_seq_len

* add unit tests

* add missing attribute _supports_param_buffer_assignment for gpt-j. (#32359)

Co-authored-by: Guoming Zhang <[email protected]>

* Check device map for saving tokenizer config on TPU (fix for issue #31971) (#32043)

* Remove TPU device map for saving tokenizer config

* Update tokenization_utils_base.py

* Fix error msg when passing non-string device into tokenizer

* Fix error message for non-string tokenizer device

* Print out tokenizer device type in error msg

* Update tokenization_utils_base.py

* update clean_up_tokenization_spaces warning (#32371)

* Empty list in defaults for LLaMA special tokens during weights conversion (#32342)

empty list in defaults

* Fix conflicting key in init kwargs in PreTrainedTokenizerBase (#31233)

* Fix conflicting key in init kwargs in PreTrainedTokenizerBase

* Update code to check for callable key in save_pretrained

* Apply PR suggestions

* Invoke CI

* Updates based on PR suggestion

* Offloaded KV Cache (#31325)

* Initial implementation of OffloadedCache

* enable usage via cache_implementation

* Address feedback, add tests, remove legacy methods.

* Remove flash-attn, discover synchronization bugs, fix bugs

* Prevent usage in CPU only mode

* Add a section about offloaded KV cache to the docs

* Fix typos in docs

* Clarifications and better explanation of streams

* Docker: add `speech` dep to the consistency docker image (#32374)

* Fixed Hybrid Cache Shape Initialization. (#32163)

* fixed hybrid cache init, added test

* Fix Test Typo

---------

Co-authored-by: Aaron Haag <[email protected]>

* Yell at the user if zero-3 init wasn't performed, but expected to have been done (#32299)

* Test this zach

* Test for improper init w/o zero3

* Move back

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* Get rid of stars in warning

* Make private

* Make clear

---------

Co-authored-by: amyeroberts <[email protected]>

* Update docs (#32368)

nits

* RoPE: Add numerical tests ✨  (#32380)

tests! :D

* [generate] only require an attention mask for mps with torch<2.4 (#32367)

* up

* style

* stopping

* fix: (issue #32124) Exception raised when running `transformers/examples/flax/language-modeling/t5_tokenizer_model.py`. (#32157)

fix: Exception raised when running .

* MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. (#31500)

* Mixtral: remove unnecessary plus 1 when calculating rotary_seq_len, allowing position_ids=None (no auto position_ids generation could be unsafe)

* fix typo [:-1] to [:, -1]

* to meet formatting requirement

* to meet formatting requirement

* remove white space

* MixtralFlashAttention2: put "+ 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. Fix format/style issue.

* propagate to startcoder2, phi3, mixtral and qwen2

* update qwen2_moe

* Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer (#32393)

Bump keras in /examples/research_projects/decision_transformer

Bumps [keras](https://github.com/keras-team/keras) from 2.8.0 to 2.13.1.
- [Release notes](https://github.com/keras-team/keras/releases)
- [Commits](https://github.com/keras-team/keras/compare/v2.8.0...v2.13.1)

---
updated-dependencies:
- dependency-name: keras
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: SeamlessM4TFeatureExtractor stride remainder (#32088)

* fix: SeamlessM4TFeatureExtractor stride remainder

* Added attention mask size test

* Reran ruff for style correction

* Phi3 tests: fix typing for Python 3.8 (#32388)

fix phi

* #32184 save total_vocab_size (#32240)

* save total_vocab_size = vocab_size + user added tokens to speed up operation

* updating length when added_tokens_decoder is set

* add test len(tokenizer)

* add values for neftune (#32399)

I always forget what typical values are, and I have to look at the paper everytime. This will be a helpful reminder.

* Fix documentation references to google/bit-50 model (#32407)

* Persist embedding type of BART and mBART models after resize (#32242)

* fix: persist embedding type of MBartConditonalGeneration after resize

* fix: persist embedding type of BartConditonalGeneration after resize

* fix: Updated `test_embeded_special_tokens` for luke and mluke models (#32413)

Fixed tokenizertests for luke, mluke models.

* Respect the config's attn_implementation if set (#32383)

* Respect the config's attn if set

* Update test - can override in from_config

* Fix

* Fix documentation links and code reference to model llava-next (#32434)

* Cache: create docs (#32150)

* draft

* updates

* works?

* try adding python example in hidden section

* another try

* hwo do i render python

* format as html code?

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* one more small update

* should render hidden secrtion now

* add outputs

* fix links

* check links

* update all links

* update with offloaded cache

* all cache is importable, so they appear in docs

* fix copies

* docstring...

---------

Co-authored-by: Joao Gante <[email protected]>

* Llava: fix checkpoint_doc (#32458)

fix: add new llava like model bug

* add the missing flash attention test marker (#32419)

* add flash attention check

* fix

* fix

* add the missing marker

* bug fix

* add one more

* remove order

* add one more

* Update kwargs validation for `preprocess` with decorator (#32024)

* BLIP preprocess

* BIT preprocess

* BRIDGETOWER preprocess

* CHAMELEON preprocess

* CHINESE_CLIP preprocess

* CONVNEXT preprocess

* DEIT preprocess

* DONUT preprocess

* DPT preprocess

* FLAVA preprocess

* EFFICIENTNET preprocess

* FUYU preprocess

* GLPN preprocess

* IMAGEGPT preprocess

* INTRUCTBLIPVIDEO preprocess

* VIVIT preprocess

* ZOEDEPTH preprocess

* VITMATTE preprocess

* VIT preprocess

* VILT preprocess

* VIDEOMAE preprocess

* VIDEOLLAVA

* TVP processing

* TVP fixup

* SWIN2SR preprocess

* SIGLIP preprocess

* SAM preprocess

* RT-DETR preprocess

* PVT preprocess

* POOLFORMER preprocess

* PERCEIVER preprocess

* OWLVIT preprocess

* OWLV2 preprocess

* NOUGAT preprocess

* MOBILEVIT preprocess

* MOBILENETV2 preprocess

* MOBILENETV1 preprocess

* LEVIT preprocess

* LAYOUTLMV2 preprocess

* LAYOUTLMV3 preprocess

* Add test

* Update tests

* Fix get large model config for Switch Transformer encoder only tester (#32438)

* Dependencies: fix typo (#32389)

deps_2

* Add Nemotron HF Support (#31699)

* Add nemotron support

* fix inference

* add unit test

* add layernorm1p as a class to avoid meta device mismatch

* test fixed

* Add copied_from statements

* remove pretraining_tp args

* remove nemotronlayernorm

* force LN computation done in FP32

* remove nemotrontokenizer and use llamatokenizer

* license update

* add option for kv_channels for minitron8b

* remove assert

* o_proj fixed

* o_proj reshape

* add gated_proj option

* typo

* remove todos

* fix broken test after merging latest main

* remove nezha/nat after meging main

* chnage default config to 15b model

* add nemo conversion script

* rename conversion script

* remove gate_proj option

* pr comment resolved

* fix unit test

* rename kv_channels to head_dim

* resolve PR issue

* add nemotron md

* fix broken tests

* refactor rope for nemotron

* test fix

* remove linearscaling

* whitespace and import

* fix some copied-from

* code style fix

* reformatted

* add position_embedding to nemotronattention

* rope refactor to only use config, copied-from fix

* format

* Run make fix-copies

* nemotron md with autodoc

* doc  fix

* fix order

* pass check_config_docstrings.py

* fix config_attributes

* remove all llama BC related code

* Use PreTrainedTokenizerFast

* ruff check examples

* conversion script update

* add nemotron to toctree

* Generate: fix end to end compilation (#32465)

* Add codestral mamba2 (#32080)

* add new model like

* draft cuda forward - mismatched keys (sharding on conv1)

* match keys successfully

* fix split

* get generation/forward running (wrong gens, norm?)

* :update

* some refactoring

* fixes

* works up until copy to cache

* fix

* update

* NON WORKING VERSION

* version that work?

* nit

* fix config

* fix conversion script

* working cuda forward

* nit

* update

* simplifcation

* make mamba slow simple work

* no einops

* todo

* fix style

* no einops

* update fix no einsum

* nit

* remove einops

* bug: scan_output differs strongly

* add rms norm option

* fix fast + slow generation with and w/o cache :heavy_check_mark:

* draft integration tests

* remove a big chunk of the einsum

* fix slow, fast generations, without any einsum

* fix copies

* fix structure

* fix up modeling and tests

* fix tests

* clamping is indeed worse

* recover mamba2 cache test

* fix copies

* no cache position (yet)

* fix tf tests

* fix matmul for generate

* fixup

* skip cache tests for now

* [run-slow]mamba2

* tune out hidden states for padding

* test batched generation

* propagate attention mask changes

* fix past length

* fix integration test

* style

* address comments

* update readme

* add mamba2 version check

* fix tests

* [run-slow]mamba2

* skip edge tests

* [run-slow]mamba2

* last fixup

* [run-slow]mamba2

* update README

---------

Co-authored-by: Arthur Zucker <[email protected]>

* Migrate import checks not need accelerate, and be more clear on min versions (#32292)

* Migrate import checks to secondary accelerate calls

* better errs too

* Revert, just keep the import checks + remove accelerate-specific things

* Rm extra'

* Empty commit for ci

* Small nits

* Final

* Documentation: BOS token_id deprecation change for NLLB (#32443)

Update nllb.md

* dev version 4.45.0

* `is_torchdynamo_compiling` -- cast a wide exception net (#32476)

* cast a wide net

* make fix-copies with a few manual changes

* add copied from

* Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477)

* Revert "fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit (#32276)"

This reverts commit 62c60a30181a65e1a3a7f19c3055a240a6a21335.

We uncovered an issue with this change that caused our training runs to hang.

* `is_torchdynamo_compiling` -- cast a wide exception net (#32476)

* cast a wide net

* make fix-copies with a few manual changes

* add copied from

---------

Co-authored-by: Joao Gante <[email protected]>

* 🌐 [i18n-KO] Translated `mask_generation.md` to Korean (#32257)

* docs: ko: tasks/mask_generation.md

* feat: nmt draft

* fix : toc local

* fix : manual edits

* fix : ko-toctree

* fix: resolve suggestions

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>

* fix: resolve suggestions

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>

* fix: resolve suggestions

* fix: resolve suggestions

* fix: resolve suggestions

---------

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>

* 🌐 [i18n-KO] Translated `idefics.md` to Korean (#32258)

* docs: ko: tasks/idefics.md

* feat: nmt draft

* fix: manual edits

* fix: resolve suggestions

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: Harheem Kim <[email protected]>
Co-authored-by: timdalxx <[email protected]>

---------

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: Harheem Kim <[email protected]>
Co-authored-by: timdalxx <[email protected]>

* 🌐 [i18n-KO] Translated `image_to_image.md` to Korean (#32327)

* docs: ko: tasks/image_to_image.md

* feat: nmt draft

* fix: manual edits

* fix: resolve suggestions

Co-authored-by: Jihun Lim <[email protected]>
Co-authored-by: Jiwook Han <[email protected]>

* fix: handle remaining suggestions

Co-authored-by: Jiwook Han <[email protected]>

---------

Co-authored-by: Jihun Lim <[email protected]>
Co-authored-by: Jiwook Han <[email protected]>

* Cache: new Cache format in decoder-only models (#31421)

* draft bart with new cache

* add cache for decoder-only models

* revert utils

* modify docstring

* revert bart

* minor fixes

* fix copies (not related)

* revert tests

* remove enc-dec related code

* remove bloom

* remove opt (enc-dec)

* update docstring

* git, codegen, gpt_neo, gpt_neox, gpj

* clean up

* copied from statements

* revert

* tmp

* update warning msg

* forgot git

* add more flags

* run-slow git,codegen,gpt_neo,gpt_neox,gpj

* add cache flag to VLMs

* remove files

* style

* video LLMs also need a flag

* style

* llava will go in another PR

* style

* [run-slow] codegen, falcon, git, gpt_neo, gpt_neox, gptj, idefics

* Update src/transformers/models/gpt_neo/modeling_gpt_neo.py

Co-authored-by: Arthur <[email protected]>

* copy from

* deprecate until v4.45 and warn if not training

* nit

* fix test

* test static cache

* add more tests and fix models

* fix copies

* return sliding window mask

* run slow tests & fix + codestyle

* one more falcon fix for alibi

---------

Co-authored-by: Arthur <[email protected]>

* Gemma2: add cache warning (#32279)

* gemma2 fallback to dynamic cache

* Update src/transformers/models/gemma2/modeling_gemma2.py

Co-authored-by: Joao Gante <[email protected]>

* Update src/transformers/models/gemma2/modeling_gemma2.py

Co-authored-by: Arthur <[email protected]>

* raise error and dont fallback to dynamic cache

* prev will break most forward calls/tests

* Update src/transformers/models/gemma2/modeling_gemma2.py

Co-authored-by: Arthur <[email protected]>

* update

* fix copies

---------

Co-authored-by: Joao Gante <[email protected]>
Co-authored-by: Arthur <[email protected]>

* enable xla fsdp (#32048)

* enable xla fsdp

* add acceleration version check for xla fsdp

* Fix typo in tokenization_utils_base.py (#32484)

* Agents use grammar (#31735)

* Allow optional use of grammars to constrain generation

* fix broken link in docs (#32491)

`https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TextGenerationPipeline.__call__`

`generate_kwargs (dict, optional) — Additional keyword arguments to pass along to the generate method of the model (see the generate method corresponding to your framework here).`

link in "here" doesnt work

* Docs: alert for the possibility of manipulating logits (#32467)

* logits

* words

* 🌐 [i18n-KO] Translated `gptq.md` to Korean (#32293)

* fix: manual edits

* fix: manual edits2

* fix: delete files

* fix: resolve suggestions

Co-authored-by: Sungmin Oh <[email protected]>
Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: 김준재 <[email protected]>

* fix: resolve suggestions

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: Sungmin Oh <[email protected]>
Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: 김준재 <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* 🌐 [i18n-KO] Translated `prompting.md` to Korean (#32294)

* docs: ko: tasks/prompting.md

* feat: nmt-draft

* fix: update translation in prompting.md

* fix: update toctree.yml

* fix: manual edits

* fix: toctree edits

* fix: resolve suggestions

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Harheem Kim <[email protected]>
Co-authored-by: timdalxx <[email protected]>

---------

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Harheem Kim <[email protected]>
Co-authored-by: timdalxx <[email protected]>

* 🌐 [i18n-KO] Translated `quantization/quanto.md` to Korean (#32281)

* docs: ko: quantization/quanto.md

* feat: nmt draft

* fix: resolve suggestions

Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: Minki Kim <[email protected]>
Co-authored-by: 김준재 <[email protected]>

* fix: resolve suggestions

Co-authored-by: SeungYoun Lee <[email protected]>

---------

Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: Minki Kim <[email protected]>
Co-authored-by: 김준재 <[email protected]>

* 🌐 [i18n-KO] Translated `image_feature_extraction.md` to Korean (#32239)

* docs: ko: tasks/images_feature_extraction.md

* feat: nmt draft

* fix: manual edits

* fix: manual edits

* fix: manual edits

* fix: manual edits

* feat: manual edits

* Update docs/source/ko/tasks/image_feature_extraction.md

Co-authored-by: Jihun Lim <[email protected]>

* Update docs/source/ko/tasks/image_feature_extraction.md

Co-authored-by: Jihun Lim <[email protected]>

* fix: manual edits

---------

Co-authored-by: Jihun Lim <[email protected]>

* Fix references to model google mt5 small (#32497)

* Docs: Fixed WhisperModel.forward’s docstring link (#32498)

Fixed WhisperModel.forward’s docstring link.

* 🌐 [i18n-KO] Translated `chat_templating.md` to Korean (#32362)

* docs: ko: chat_templating.md

* feat: nmt draft

* fix: manual edits

* Update docs/source/ko/chat_templating.md

Co-authored-by: Sungmin Oh <[email protected]>

* Update docs/source/ko/chat_templating.md

Co-authored-by: Sungmin Oh <[email protected]>

* fix: apply suggestions from code review - anchor

Co-authored-by: Sungmin Oh <[email protected]>

* fix: manual edits

Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: Minki Kim <[email protected]>

* fix: manual edits

* fix: delete 'default template' section

---------

Co-authored-by: Sungmin Oh <[email protected]>
Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: Minki Kim <[email protected]>

* Fix link to autoclass_tutorial.md in i18n.md (#32501)

* Fix typo: depracted -> deprecated (#32489)

Hello!

## Pull Request overview
* Fix typo

## Details
This should speak for itself.

cc @itazap @ArthurZucker 

- Tom Aarsen

* Fix issue #32518: Update llm_tutorial.md (#32523)

Update llm_tutorial.md

remove comma re: issue 32518

https://github.com/huggingface/transformers/issues/32518

* Change Phi3 `_supports_sdpa` to True (#32457)

* Change `_supports_sdpa` to True

* add phi3 to sdpa support list

* Uniformize kwargs for processors - GroundingDINO (#31964)

* fix typo

* uniform kwargs

* make style

* add comments

* remove return_tensors

* remove common_kwargs from processor since it propagates

* make style

* return_token_type_ids to True

* revert the default imagekwargs since does not accept any value in the image processro

* revert processing_utils.py

* make style

* add molbap's commit

* fix typo

* fix common processor

* remain

* Revert "add molbap's commit"

This reverts commit a476c6ee88318ce40d73ea31e2dc2d4faa8ae410.

* add unsync PR

* revert

* make CI happy

* nit

* import annotationformat

* Fix add-new-model-like (#31773)

* handle (processor_class, None) returned by ModelPatterns

* handle (slow, fast) image processors in add model

* handle old image processor case

* Add Qwen2-Audio (#32137)

* add qwen2audio

* Update check_repo.py

* fix style

* fix test

* fix style

* add model size

* Qwen2AudioEncoderModel->Qwen2AudioEncoder; add copy info

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* switch the attention_mask and the feature_attention_mask

* add to PRIVATE_MODELS in check_repo.py; add to MODEL_NAMES_TO_IGNORE in check_table.py

* fix initialization

* update chat_template

* fix consistency issue after copy

* add docstrings to _merge_input_ids_with_audio_features

* add copied from to prepare_inputs_for_generation

* add more details to docs

* rm comment

* add init_std

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* update

* Update docs/source/en/model_doc/qwen2_audio.md

Co-authored-by: amyeroberts <[email protected]>

* update tests

* rm ignore_index

* update processor

* rm ffmpeg_read

* Update tests/models/qwen2_audio/test_modeling_qwen2_audio.py

Co-authored-by: amyeroberts <[email protected]>

* Update docs/source/en/model_doc/qwen2_audio.md

Co-authored-by: amyeroberts <[email protected]>

* Update docs/source/en/model_doc/qwen2_audio.md

Co-authored-by: amyeroberts <[email protected]>

* Update docs/source/en/model_doc/qwen2_audio.md

Co-authored-by: amyeroberts <[email protected]>

* update

* typo

* [run_slow] qwen2_audio

* [run_slow] qwen2_audio

* [run_slow] qwen2_audio

* fix quality

* [run_slow] qwen2_audio

* [run_slow] qwen2_audio

* [run_slow] qwen2_audio

* add official model

---------

Co-authored-by: Yoach Lacombe <[email protected]>
Co-authored-by: amyeroberts <[email protected]>

* filter flash_attn optional imports loading remote code (#30954)

* filter flash_attn optional imports loading remote code

* improve pattern

* fix code style

* Update src/transformers/dynamic_module_utils.py

Co-authored-by: Matt <[email protected]>

---------

Co-authored-by: Matt <[email protected]>

* 🌐 [i18n-KO] Translated `ko-llm_tutorial_optimization.md` to Korean (#32372)

* docs: ko: llm_tutorial_optimization.md

* feat: nmt draft

* fix: manual edits

* Update docs/source/ko/llm_tutorial_optimization.md

Co-authored-by: Chaewon Song <[email protected]>

* Update docs/source/ko/llm_tutorial_optimization.md

Co-authored-by: Chaewon Song <[email protected]>

* fix: resolve suggestions - 1

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: timdalxx <[email protected]>
Co-authored-by: boyunJang <[email protected]>

* fix: resolve suggestions - 2

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: timdalxx <[email protected]>

---------

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: timdalxx <[email protected]>
Co-authored-by: boyunJang <[email protected]>

* 🌐 [i18n-KO] Translated `trainer.md` to Korean (#32260)

* docs: ko: ko-trainer

* feat: nmt draft

* fix: manual edits

* fix: manual edits

* fix: glossary

* fix: glossary

* Apply suggestions from code review

Co-authored-by: Jinuk <[email protected]>
Co-authored-by: SeongWooChoi <[email protected]>

---------

Co-authored-by: Jinuk <[email protected]>
Co-authored-by: SeongWooChoi <[email protected]>

* 🌐 [i18n-KO] Translated `eetq.md` to Korean (#32352)

* docs: ko: quantization/eetq.md

* feat: nmt draft

* fix docs: ko: quantization/eetq.md

* fix docs: ko: quantization/eetq.md

* fix: resolve suggestions

Co-authored-by: Jiwook Han <[email protected]>

* fix: resolve suggestions

* fix: resolve suggsetions

---------

Co-authored-by: Jiwook Han <[email protected]>

* 🌐 [i18n-KO] Translated `fsdp.md` to Korean (#32261)

* docs: ko: fsdp.md

* feat: nmt draft

* fix: manual edits

* Apply suggestions from code review

Co-authored-by: 김준재 <[email protected]>
Co-authored-by: Minki Kim <[email protected]>

* fix: resolve suggestions

* Update docs/source/ko/fsdp.md

Co-authored-by: 김준재 <[email protected]>

* Update docs/source/ko/fsdp.md

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: 김준재 <[email protected]>
Co-authored-by: Minki Kim <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* 🌐 [i18n-KO] Translated `bitsandbytes.md` to Korean (#32408)

* docs: ko: quantization/bitsandbytes.md

* feat: nmt draft

* fix: minor typos

* fix: manual edits

* fix: manual edits

* fix: resolve suggestions

Co-authored-by: wony617 <[email protected]>
Co-authored-by: YONGSANG <[email protected]>
Co-authored-by: Woojun Jung <[email protected]>

* fix: resolve suggestions

Co-authored-by: Steven Liu <[email protected]>

* Apply suggestions from code review

Co-authored-by: Steven Liu <[email protected]>

* Apply suggestions from code review

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: wony617 <[email protected]>
Co-authored-by: YONGSANG <[email protected]>
Co-authored-by: Woojun Jung <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* Fix generate with `inputs_embeds` as input (#32493)

* I think inputs_embeds has ndim == 3

* fix sequence length catch

* add generate test

* [run-slow]olmo, persimmon, gemma, gemma2, qwen2, llama

* skip whisper

* fix bart test

* more fixes

* Fixed test `test_static_cache_exportability` with torch 2.4.0 (#32516)

Workaround the export issue in torch 2.4

Co-authored-by: Guang Yang <[email protected]>

* Fix code example to load bigcode starcoder2 7b (#32474)

* [docs] Translation guide (#32547)

clarify

* Gemma2: fix FA2 generation (#32553)

fix FA2

* Fix a bug in Qwen2Audio (#32552)

fix _update_model_kwargs_for_generation

* fix slow integration gemma2 test (#32534)

no empty revision

* fix non contiguous tensor value error in save_pretrained (#32422)

Signed-off-by: duzhanwei <[email protected]>
Co-authored-by: duzhanwei <[email protected]>

* 🌐 [i18n-KO] Translated `agent.md` to Korean (#32351)

* docs: ko: main_classes/agent

* feat: chatgpt draft

* fix: manual edits

* �fix: resolve suggestions

Co-authored-by: Woojun Jung <[email protected]>
Co-authored-by: thsamaji <[email protected]>
Co-authored-by: SeungAhSon <[email protected]>

* fix: resolve suggestions

* fix: resolve code line number

---------

Co-authored-by: Woojun Jung <[email protected]>
Co-authored-by: thsamaji <[email protected]>
Co-authored-by: SeungAhSon <[email protected]>

* Add new model (#32615)

* v1 - working version

* fix

* fix

* fix

* fix

* rename to correct name

* fix title

* fixup

* rename files

* fix

* add copied from on tests

* rename to `FalconMamba` everywhere and fix bugs

* fix quantization + accelerate

* fix copies

* add `torch.compile` support

* fix tests

* fix tests and add slow tests

* copies on config

* merge the latest changes

* fix tests

* add few lines about instruct

* Apply suggestions from code review

Co-authored-by: Arthur <[email protected]>

* fix

* fix tests

---------

Co-authored-by: Arthur <[email protected]>

* Fix: FA2 with packed training (#32487)

* fix check

* add tests

* [run-slow] llama, gemma2

* oops, whisper actually runs but needed some special treatment

* Fix sliding window attention used in Gemma2FlashAttention2 (#32522)

* fix sliding window attention (flash2) in gemma2 model

* [run-slow] gemma

* fix slicing attention_mask for flash_attn2

* fix slicing attention_mask when flash_attn is used

* add missing comment

* slice the last seq_len tokens in the key, value states

* revert code of slicing key, value states

* fix: Fixed conditional check for `encodec` model names (#32581)

* Fixed conditional check for encodec model names.

* Reformatted conditional check.

* Fix `.push_to_hub(..., create_pr=True, revision="my-branch")` when creating PR on not-owned repo (#32094)

Fix create_pr aagainst existing revision

* Bump aiohttp from 3.9.4 to 3.10.2 in /examples/research_projects/decision_transformer (#32569)

Bump aiohttp in /examples/research_projects/decision_transformer

Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.9.4 to 3.10.2.
- [Release notes](https://github.com/aio-libs/aiohttp/releases)
- [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/aiohttp/compare/v3.9.4...v3.10.2)

---
updated-dependencies:
- dependency-name: aiohttp
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump torch from 1.13.1 to 2.2.0 in /examples/research_projects/visual_bert (#32220)

Bump torch in /examples/research_projects/visual_bert

Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.1 to 2.2.0.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md)
- [Commits](https://github.com/pytorch/pytorch/compare/v1.13.1...v2.2.0)

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Cleanup tool calling documentation and rename doc (#32337)

* Rename "Templates for Chat Models" doc to "Chat Templates"

* Small formatting fix

* Small formatting fix

* Small formatting fix

* Cleanup tool calling docs as well

* Remove unneeded 'revision'

* Move tip to below main code example

* Little bonus section on template editing

* 🌐 [i18n-KO] Translated `deepspeed.md` to Korean (#32431)

* Update _toctree.yml

* docs: ko: deepspeed.md

* Apply suggestions from code review

Co-authored-by: wony617 <[email protected]>

* Apply suggestions from code review

Co-authored-by: wony617 <[email protected]>

* Update docs/source/ko/_toctree.yml

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/ko/deepspeed.md

* Update docs/source/ko/deepspeed.md

Co-authored-by: SeungAhSon <[email protected]>

* Apply suggestions from code review

Co-authored-by: wony617 <[email protected]>

* Update docs/source/ko/_toctree.yml

---------

Co-authored-by: wony617 <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: SeungAhSon <[email protected]>

* 🌐 [i18n-KO] Translated `awq.md`to Korean (#32324)

* fix: manual edits

* Apply suggestions from code review

Co-authored-by: SeongWooChoi <[email protected]>
Co-authored-by: Chulhwa (Evan) Han <[email protected]>

* fix:manual edits

- 잘못된 경로에 번역본 파일을 생성해서 옮김

* Delete docs/source/ko/tasks/awq.md

* Update docs/source/ko/_toctree.yml

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: SeongWooChoi <[email protected]>
Co-authored-by: Chulhwa (Evan) Han <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* fix: Fixed failing `test_find_base_model_checkpoint` (#32638)

Fixed failing test_find_base_model_checkpoint.

* Bump tensorflow from 2.11.1 to 2.12.1 in /examples/research_projects/decision_transformer (#32341)

Bump tensorflow in /examples/research_projects/decision_transformer

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.11.1 to 2.12.1.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.11.1...v2.12.1)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* "to be not" -> "not to be" (#32636)

* "to be not" -> "not to be"

* Update sam.md

* Update trainer.py

* Update modeling_utils.py

* Update test_modeling_utils.py

* Update test_modeling_utils.py

* fix: Updated the `is_torch_mps_available()` function to include `min_version` argument (#32545)

* Fixed wrong argument in is_torch_mps_available() function call.

* Fixed wrong argument in is_torch_mps_available() function call.

* sorted the import.

* Fixed wrong argument in is_torch_mps_available() function call.

* Fixed wrong argument in is_torch_mps_available() function call.

* Update src/transformers/utils/import_utils.py

Co-authored-by: Arthur <[email protected]>

* removed extra space.

* Added type hint for the min_version parameter.

* Added missing import.

---------

Co-authored-by: Arthur <[email protected]>

* Expand inputs in processors for VLMs (#30962)

* let it be

* draft

* should not have changed

* add warnings

* fix & add tests

* fix tests

* ipnuts embeds cannot be passed with pixels

* more updates

* paligemma ready!

* minor typos

* update blip-2

* fix tests & raise error

* docstring

* add blip2 test

* tmp

* add image seq length to config

* update docstring

* delete

* fix tests

* fix blip

* fix paligemma

* out-of-place scatter

* add llava-next-video

* Update src/transformers/models/blip_2/modeling_blip_2.py

Co-authored-by: Pablo Montalvo <[email protected]>

* remove tmp

* codestyle

* nits

* more nits

* remove overriding in tests

* comprehension when merging video

* fix-copies

* revert changes for embeds test

* fix tests after making comprehension

* Update src/transformers/models/blip_2/processing_blip_2.py

Co-authored-by: Pablo Montalvo <[email protected]>

* Update src/transformers/models/blip_2/processing_blip_2.py

Co-authored-by: Pablo Montalvo <[email protected]>

* more updates

* fix tests

---------

Co-authored-by: Pablo Montalvo <[email protected]>

* Automatically add `transformers` tag to the modelcard (#32623)

* Automatically add `transformers` tag to the modelcard

* Specify library_name and test

* Fix tests (#32649)

* skip failing tests

* [no-filter]

* [no-filter]

* fix wording catch in FA2 test

* [no-filter]

* trigger normal CI without filtering

* fix tensors on different devices in `WhisperGenerationMixin` (#32316)

* fix

* enable on xpu

* no manual remove

* move to device

* remove to

* add move to

* Add support for GrokAdamW optimizer (#32521)

* add grokadamw

* reformat

* code review feedback, unit test

* reformat

* reformat

* Add Depth Anything V2 Metric models (#32126)

* add checkpoint and repo names

* adapt head to support metric depth estimation

* add max_depth output scaling

* add expected logits

* improve docs

* fix docstring

* add checkpoint and repo names

* adapt head to support metric depth estimation

* add max_depth output scaling

* add expected logits

* improve docs

* fix docstring

* rename depth_estimation to depth_estimation_type

* add integration test

* Refactored tests to include metric depth model inference test
* Integration test pass when the timm backbone lines are commented (L220-L227)

* address feedback

* replace model path to use organization path

* formatting

* delete deprecated TODO

* address feedback

* [run_slow] depth_anything

* Fix: Fixed directory path for utils folder in `test_tokenization_utils.py` (#32601)

* Removed un-necessary expressions.

* Fixed directory path for utils folder in test_tokenization_utils.py

* Modify ProcessorTesterMixin for better generalization (#32637)

* Add padding="max_length" to tokenizer kwargs and change crop_size to size for image_processor kwargs

* remove crop_size argument in align processor tests to be coherent with base tests

* Add pad_token when loading tokenizer if needed, change test override tokenizer kwargs, remove unnecessary test overwrites in grounding dino

* TF_Deberta supporting mixed precision (#32618)

* Update modeling_tf_deberta.py

Corrected some codes which do not support mixed precision

* Update modeling_tf_deberta_v2.py

Corrected some codes which do not support mixed precision

* Update modeling_tf_deberta_v2.py

* Update modeling_tf_deberta.py

* Add files via upload

* Add files via upload

* Fix tests recurrent (#32651)

* add fix for recurrentgemma

* [no-filter]

* trigger-ci

* [no-filter]

* [no-filter]

* attempt to fix mysterious zip error

* [no-filter]

* fix lookup error

* [no-filter]

* remove summarization hack

* [no-filter]

* Support MUSA (Moore Threads GPU) backend in transformers (#31913)

Add accelerate version check, needs accelerate>=0.33.0

* fix: Fixed failing tests in `tests/utils/test_add_new_model_like.py` (#32678)

* Fixed failing tests in tests/utils/test_add_new_model_like.py

* Fixed formatting using ruff.

* Small nit.

* Update translation docs review (#32662)

update list of people to tag

* Add TorchAOHfQuantizer (#32306)

* Add TorchAOHfQuantizer

Summary:
Enable loading torchao quantized model in huggingface.

Test Plan:
local test

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix a few issues

* style

* Added tests and addressed some comments about dtype conversion

* fix torch_dtype warning message

* fix tests

* style

* TorchAOConfig -> TorchAoConfig

* enable offload + fix memory with multi-gpu

* update torchao version requirement to 0.4.0

* better comments

* add torch.compile to torchao README, add perf number link

---------

Co-authored-by: Marc Sun <[email protected]>

* Fix `JetMoeIntegrationTest` (#32332)

JetMoeIntegrationTest

Co-authored-by: ydshieh <[email protected]>

* Update the distributed CPU training on Kubernetes documentation (#32669)

* Update the Kubernetes CPU training example

* Add namespace arg

Signed-off-by: Dina Suehiro Jones <[email protected]>

---------

Signed-off-by: Dina Suehiro Jones <[email protected]>

* fix: Fixed unknown pytest config option `doctest_glob` (#32475)

Fixed unknown config option doctest_glob.

* Unpin deepspeed in Docker image/tests (#32572)

Unpin deepspeed

* Updated workflows to the latest versions (#32405)

Updated few workflows to the latest versions.

* reopen: llava-next fails to consider padding_side during Training (#32679)

restore #32386

* fix: Corrected ` falcon-mamba-7b` model checkpoint name (#32837)

Corrected the model checkpoint.

* fix: update doc link for runhouse in README.md (#32664)

* VLMs: small clean-up for cache class (#32417)

* fix beam search in video llava

* [run-slow] video_llava

* add back the position ids (#32554)

* add back the position ids

* fix failing test

* Use head_dim if in config for RoPE (#32495)

* use head_dim if in config for RoPE

* typo

* simplify with getattr

* Generate: unify `LogitsWarper` and `LogitsProcessor` (#32626)

* [tests] make test_sdpa_equivalence device-agnostic (#32520)

* fix on xpu

* [run_all]

* Cache: use `batch_size` instead of `max_batch_size` (#32657)

* more precise name

* better docstrings

* Update src/transformers/cache_utils.py

Co-authored-by: Arthur <[email protected]>

---------

Co-authored-by: Arthur <[email protected]>

* Fix AutoConfig and AutoModel support for Llava-Next-Video (#32844)

* Fix: fix all model_type of Llava-Next-Video to llava_next_video

* Fix doc for llava_next_video

* * Fix formatting issues
* Change llava-next-video.md file name into llava_next_video.md to make it compatible with implementation

* Fix docs TOC for llava-next-video

* improve _get_is_as_tensor_fns (#32596)

* improve _get_is_as_tensor_fns

* format

* Revert PR 32299, flag users when Zero-3 was missed (#32851)

Revert PR 32299

* fix multi-gpu with static cache (#32543)

* Reduce the error log when using core models that need their weights renamed, and provide a step forward (#32656)

* Fin

* Modify msg

* Finish up nits

* Make beam_constraints.Constraint.advance() docstring more accurate (#32674)

* Fix beam_constraints.Constraint.advance() docstring

* Update src/transformers/generation/beam_constraints.py

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: Joao Gante <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* generate: missing `to` in DoLa body, causing exceptions in multi-gpu generation (#32856)

* Add Flax Dinov2 (#31960)

* tfmsenv restored in main

* installed flax

* forward pass done and all tests passed

* make fix-copies and cleaning the scripts

* fixup attempt 1

* fixup attempt 2

* fixup third attempt

* fixup attempt 4

* fixup attempt 5

* dinov2 doc fixed

* FlaxDinov2Model + ForImageClassification added to OBJECTS_TO_IGNORE

* external pos_encoding layer removed

* fixup attempt 6

* fixed integration test values

* fixup attempt 7

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/tran…

Cemberk added a commit to Cemberk/transformers that referenced this pull request


          Automated PR: Downstream develop rebase new changes (#71)

306c591

* Added mamba.py backend (#30139)

* Update README.md

* tests: forward ok

* backward test done

* done testing

* removed check. scripts

* Update README.md

* added use_mambapy arg

* fixed typo in warning

* protected imports w/ mambapy package

* delete pscan.py + raise rather than assert

* Update import_utils.py

* fix whitespaces and unused import

* trailing whitespace + import block unformatted

* Update modeling_mamba.py

* transpose before pscan

* shape comment

* ran make style

* use_mambapy=False by default

Co-authored-by: Arthur <[email protected]>

* ran make fix-copies

---------

Co-authored-by: Arthur <[email protected]>

* Rename Phi-3 rope scaling type (#31436)

* renamed phi3 rope_scaling type

* fixed trailing whitespaces

* fixed test

* added warning

* fixed format

* Revert "Incorrect Whisper long-form decoding timestamps " (#32148)

Revert "Incorrect Whisper long-form decoding timestamps  (#32003)"

This reverts commit cd48553fc8375e1a28d4d82cfe231dedf6a23af8.

* Fix typing to be compatible with later py versions (#32155)

* feat(cache): StaticCache uses index_copy_ to avoid useless copy (#31857)

* feat(cache): StaticCache uses index_copy_ to avoid useless copy

Using index_copy_ allows for explicit in-place change of the tensor.
Some backends (XLA) will otherwise copy the tensor, making the code
slower and using more memory.

Proposed implementation will end up using less memory and on XLA will
result in less compilation, but the change is also quite generic, making
no change whatsoever on CUDA or CPU backend.

* feat(cache): SlidingWindowCache uses index_copy_ to avoid useless copy

Applying the same change done in StaticCache.

* fix(cache): fallback of index_copy_ when not implemented

* fix(cache): in index_copy_ ensure tensors are on same device

* [run slow] llama

* fix(cache): add move of cache_position to same device in SlidingWindowCache

* Revert "[run slow] llama"

This reverts commit 02608dd14253ccd464e31c108e0cd94364f0e8b9.

* Added additional kwarg for successful running of optuna hyperparameter search (#31924)

Update integration_utils.py

Added additional kwarg

* Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs (#31629)

* add DataCollatorBatchFlattening

* Update data_collator.py

* change name

* new FA2 flow if position_ids is provided

* add comments

* minor fix

* minor fix data collator

* add test cases for models

* add test case for data collator

* remove extra code

* formating for ruff check and check_repo.py

* ruff format

ruff format tests src utils

* custom_init_isort.py

* Updated `ruff` to the latest version (#31926)

* Updated ruff version and fixed the required code accorindg to the latest version.

* Updated ruff version and fixed the required code accorindg to the latest version.

* Added noqa directive to ignore 1 error shown by ruff

* Dev version: v4.44.0.dev0

* Llama 3.1 conversion

Co-authored-by: Arthur Zucker <[email protected]>

* fix (#32162)

* fix: Fixed an if condition that is always evaluating to true (#32160)

Fixed an if condition always evaluating to true.

* [docs] change temperature to a positive value (#32077)

fix

* adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer (#32171)

* adds: extra_repr() to MambaRMSNorm to include the hidden size of the layer

* style fix with ruff:

* fix: default value reflects the runtime environment variables rather than the ones present at import time. (#32153)

* fix: default value reflects the runtime environment variables rather than the ones present at import time.

* Fix: Change `deterministic` to None by default; use env var if None

* Update qwen2.md (#32108)

* Update qwen2.md

outdated description

* Update qwen2.md

amended

* Update qwen2.md

Update

* Update qwen2.md

fix wrong version code, now good to go

* Remove conversational pipeline tests (#32099)

Remove conversation pipeline tests

* RoPE: relaxed rope validation (#32182)

* relaxed rope check

* lets also accept rope_type=None, defaulting to the original implementation

* type and rope_type can coexist

* let's not warn when someone is running a forward  (#32176)

* let's not warn when someone is running a foward without cache + self.training

* more models

* fixup

* Fix resize embedding with Deepspeed (#32192)

fix resize when deepspeed

* Fix float8_e4m3fn in modeling_utils (#32193)

* Fix float8_e4m3fn in modeling_utils

* style

* fix

* comment

* Support dequantizing GGUF FP16 format (#31783)

* support gguf fp16

* support gguf bf16 with pytorch

* add gguf f16 test

* remove bf16

* :rotating_light: No more default chat templates (#31733)

* No more default chat templates

* Add the template to the GPT-SW3 tests since it's not available by default now

* Fix GPT2 test

* Fix Bloom test

* Fix Bloom test

* Remove default templates again

* fix: Replaced deprecated `unittest method` with the correct one (#32198)

Replaced deprecated unittest method with the correct one.

* [whisper] fix short-form output type (#32178)

* [whisper] fix short-form output type

* add test

* make style

* update long-form tests

* fixes

* last fix

* finalise test

* remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 (#32210)

remove unnecessary guard code related with pytorch versions 1.4.2 ~
1.7.0

* Update question_answering.py (#32208)

* [BigBird Pegasus] set _supports_param_buffer_assignment to False (#32222)

set _supports_param_buffer_assignment to False

* [warnings] fix E721 warnings (#32223)

fix E721 warnings

* Follow up for #31973 (#32025)

* fix

* [test_all] trigger full CI

---------

Co-authored-by: ydshieh <[email protected]>

* translate philosophy.md to chinese (#32177)

* translate philosophy.md to chinese

* add the missing link

* Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac (#31846)

* use currently active microphone on mac for ffmpeg_microphone

* Allow ffmpeg_microphone device to be specified

Co-authored-by: amyeroberts <[email protected]>

---------

Co-authored-by: amyeroberts <[email protected]>

* Fix code snippet for Grounding DINO (#32229)

Fix code snippet for grounding-dino

* Generation: stop at `eos` for assisted decoding (#31301)

* fix

* move changes to prompt lookup

* add test

* set eos in assistant model

* style

* fix flakiness

* changes for new `main`

* Update tests/generation/test_utils.py

Co-authored-by: amyeroberts <[email protected]>

* Update tests/generation/test_utils.py

Co-authored-by: amyeroberts <[email protected]>

* add comment to explain

---------

Co-authored-by: amyeroberts <[email protected]>

* Llava: generate without images (#32183)

* llava w/o images

* tests

* Resize embeds with DeepSpeed  (#32214)

* fix resize when deepspeed

* deepsped uses new embeds

* we needed this

* don't log base model architecture in wandb if log model is false (#32143)

* don't log base model architecture in wandb is log model is false

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: amyeroberts <[email protected]>

* convert log model setting into an enum

* fix formatting

---------

Co-authored-by: amyeroberts <[email protected]>

* Refactor: Removed un-necessary `object` base class (#32230)

* Refactored to remove un-necessary object base class.

* small fix.

* Adds: extra_repr for RMSNorm layers in most models (#32204)

* adds: extra_repr() to RMSNorm layers in multiple models

* adds: extra_repr for deprecated models as well

* formatting as per style guide

* Add check for `target_sizes is None` in `post_process_image_guided_detection` for owlv2 (#31934)

* Add check for target_sizes is None in post_process_image_guided_detection

* Make sure Owlvit and Owlv2 in sync

* Fix incorrect indentation; add check for correct size of target_sizes

* [tests] fix `static` cache implementation is not compatible with `attn_implementation==flash_attention_2` (#32039)

* add flash attention check

* fix

* fix

* Flash-Attn: fix generation when no attention mask or no pading (#32241)

* fix

* fix prev test (half of failures)

* [run-slow] llama, gemma2

* [run-slow] llama, gemma2

* More flexible trigger condition (#32251)

update

Co-authored-by: ydshieh <[email protected]>

* Llama 3.1: replace for loop by tensor ops at inv_freq initialization (#32244)

* replace for loop by tensor ops

* rm assert; readability

* 🚨 Bloom support for cache class (#31445)

* bloom dynamic cache

* bloom follows standard cache format

* no skips for bloom anymore

* use cache position when possible

* clean up

* codestyle

* Update src/transformers/models/bloom/modeling_bloom.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/bloom/modeling_bloom.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/bloom/modeling_bloom.py

Co-authored-by: amyeroberts <[email protected]>

* pr comments

* isinstance fix

* address comments

* make musicgen test happy

* [run-slow] bloom

---------

Co-authored-by: amyeroberts <[email protected]>

* Upload new model failure report to Hub (#32264)

upload

Co-authored-by: ydshieh <[email protected]>

* Optimize t5 tokenize logic to avoid redundant calls (#32270)

* Optimize t5 tokenize logic to avoid redundant calls

* fix and overwrite copies

* fix: Fixed wrong argument passed to `convert_blip_checkpoint` function call (#32262)

Removed one wrong argument passed to convert_blip_checkpoint function call.

* Repo: remove exceptions in `check_docstrings` (#32259)

remove exceptions

* make `p_mask` a numpy array before passing to `select_starts_ends` (#32076)

* fix

* bug fix

* refine

* fix

* fix(docs): Fixed a link in docs (#32274)

Fixed a link in docs.

* Generate: end-to-end compilation (#30788)

* mvp

* added test (a few models need fixes)

* fix a few test cases

* test nits

* harder test 😈

* revert changes in stablelm

* test with improved condition

* add todo

* tmp commit

* merged with main

* nits

* add todo

* final corrections

* add docs for generation compilation

* docs nits

* add  tip

* PR suggestions

* add more details to the compilation docs

* fix cache positions

* cache is now init in generate; update docs

* tag test as flaky

* docs

* post rebase make fixup and other nits

* remove unintended changes

* whisper (encoder-decoder) not supported

* move token default updates to ; add tests for token defaults

* push changes

* manual rebase

* chameleon doesn't support this

* fix test_static_cache_mha_mqa_gqa (broken in another PR)

* docs: dynamic is better with end-to-end compilation

* Whisper tokenizer word level timestamps (#32197)

* fix _fix_key in PreTrainedModel

* fix _find_longest_common_sequence

* add test

* remove result.json

* nit

* update test

* [pipeline] fix padding for 1-d tensors (#31776)

* [pipeline] fix padding for 1-d tensors

* add test

* make style

* Update tests/pipelines/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Kamil Akesbi <[email protected]>

* Update tests/pipelines/test_pipelines_automatic_speech_recognition.py

---------

Co-authored-by: Kamil Akesbi <[email protected]>

* Make static cache compatible with torch.export (#32168)

* Add stream messages from agent run for gradio chatbot (#32142)

* Add stream_to_gradio method for running agent in gradio demo

* use torch 2.4 in 2 CI jobs (#32302)

Co-authored-by: ydshieh <[email protected]>

* Docs: fix GaLore optimizer code example (#32249)

Docs: fix GaLore optimizer example

Fix incorrect usage of GaLore optimizer in Transformers trainer code example.

The GaLore optimizer uses low-rank gradient updates to reduce memory usage. GaLore is quite popular and is implemented by the authors in [https://github.com/jiaweizzhao/GaLore](https://github.com/jiaweizzhao/GaLore). A few months ago GaLore was added to the HuggingFace Transformers library in https://github.com/huggingface/transformers/pull/29588.

Documentation of the Trainer module includes a few code examples of how to use GaLore. However, the `optim_targe_modules` argument to the `TrainingArguments` function is incorrect, as discussed in https://github.com/huggingface/transformers/pull/29588#issuecomment-2006289512. This pull request fixes this issue.

* Fix GGUF dequantize for `gguf==0.9.1` (#32298)

* fix gguf dequantize for gguf==0.9.1

* fix old version

* make style

* Cast epochs_trained to int when resuming training (#32286)

* fix epochs_trained as int when resuming training

* refactor

---------

Co-authored-by: teddyferdinan <[email protected]>

* feat(ci): set `fetch-depth: 0` in trufflehog checkout step (#31663)

* Fix M4T for ASR pipeline (#32296)

* tentative fix

* do the same for M4T

* Docs: formatting nits (#32247)

* doc formatting nits

* ignore non-autodocs

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/esm/modeling_esm.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/esm/modeling_esm.py

Co-authored-by: amyeroberts <[email protected]>

* make fixup

---------

Co-authored-by: amyeroberts <[email protected]>

* Alternative agent plan (#32295)

* new agent plan

* plan type assertion

* style corrections

* better prompt naming

* make fixup

* fix: Added missing raise keyword for few exceptions (#32333)

Fixed raising of few exceptions.

* fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit (#32276)

* fixes #32329 : The Torch code is correct - to get an average of 10% o… (#32335)

fixes #32329 : The Torch code is correct - to get an average of 10% of the total, we want to take 50% of the remainder after we've already masked 80% with [MASK] in the previous step.

* Repo checks: skip docstring checks if not in the diff (#32328)

* tmp

* skip files not in the diff

* use git.Repo instead of an external subprocess

* add tiny change to confirm that the diff is working on pushed changes

* add make quality task

* more profesh main commit reference

* Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process (#32191)

* Remove user-defined tokens which can be obtained through merges

* Remove debug line

* formatting

* Refactor spm slow -> fast converter

* revert unnecessary refactor

* set comprehension

* remove test files

* Use `vocab_scores`

* Always replace spiece underline with space in decode

* we no longer need token filtering

* Add save fast load slow unit test

* Remove tokenizers version check

* Remove duplicate code

* Make `<start_of_turn>` and `<end_of_turn>` special tokens

* Bias merge priority with length if score is the same

* Add unit test for merge priority

* CI

* LLaVA-NeXT: fix anyres shapes (#32314)

fix

* Gemma2 and flash-attention (#32188)

* enable flash-attn & static cache

* this works, not the prev

* fix for sliding window layers

* not needed anymore

* Llama 3.1: Fix incorrect `inv_freq` assignment (#32330)

fix 💩

* [Idefics2] - Fix FA2 call for Perceiver layer (#32275)

* Fix FA2 call for Perciever layer

* [run_slow] idefics2

* [run_slow] idefics2

* [run_slow] idefics2

* Fix up

* [run_slow] idefics2

* [run_slow] idefics2

* [run_slow] idefics2

* Gemma 2: support assisted generation (#32357)

* Fix error when streaming to gradio with non-string tool arguments (#32360)

Fix error when streaming agent run to gradio with non-string tool arguments

* >3-5x faster torch.compile forward compilation for autoregressive decoder models (#32227)

* draft

* apply changes to all relevant archs

* rerun ci - check_docstrings.py failing?

* fix docstring

* move 2D->4D mask creation to modeling file

* repo consistency

* fix the batch size = 1 case - calling contiguous is not enough

* nit

* style

* propagate to gemma/gemma-2

* prepare inputs for gemma generation

* implement test and tiny fix in gemma2

* Update src/transformers/models/bloom/modeling_bloom.py

Co-authored-by: Arthur <[email protected]>

* fix copies

* ci pass

* fix gemma's test_compile_static_cache tests

* flacky

* retrigger ci

---------

Co-authored-by: sanchit-gandhi <[email protected]>
Co-authored-by: Arthur <[email protected]>

* fix: Removed unnecessary `@staticmethod` decorator (#32361)

* Fixed staticmethods with self as first argument.

* Fixed staticmethods with self as first argument.

* Fixed staticmethods with self as first argument.

* Fixed staticmethods with self as first argument.

* fix: warmup_steps check for training_args (#32236)

* LLaVa: add cache class attribute (#32278)

cache class flag

* [enc-dec cache] fix bug in indexing (#32370)

* [whisper] compile compatibility with long-form decoding (#31772)

* [whisper] compile compatibility with long-form decoding

* clarify comment

* fix after rebase

* finalise

* fix bsz

* fix cache split

* remove contiguous

* style

* finish

* update doc

* prevent cuda graph trace

* Remove size check between attn_weights and kv_seq_len for phi3 (#32339)

* Remove size check between attn_weights and kv_seq_len

* add unit tests

* add missing attribute _supports_param_buffer_assignment for gpt-j. (#32359)

Co-authored-by: Guoming Zhang <[email protected]>

* Check device map for saving tokenizer config on TPU (fix for issue #31971) (#32043)

* Remove TPU device map for saving tokenizer config

* Update tokenization_utils_base.py

* Fix error msg when passing non-string device into tokenizer

* Fix error message for non-string tokenizer device

* Print out tokenizer device type in error msg

* Update tokenization_utils_base.py

* update clean_up_tokenization_spaces warning (#32371)

* Empty list in defaults for LLaMA special tokens during weights conversion (#32342)

empty list in defaults

* Fix conflicting key in init kwargs in PreTrainedTokenizerBase (#31233)

* Fix conflicting key in init kwargs in PreTrainedTokenizerBase

* Update code to check for callable key in save_pretrained

* Apply PR suggestions

* Invoke CI

* Updates based on PR suggestion

* Offloaded KV Cache (#31325)

* Initial implementation of OffloadedCache

* enable usage via cache_implementation

* Address feedback, add tests, remove legacy methods.

* Remove flash-attn, discover synchronization bugs, fix bugs

* Prevent usage in CPU only mode

* Add a section about offloaded KV cache to the docs

* Fix typos in docs

* Clarifications and better explanation of streams

* Docker: add `speech` dep to the consistency docker image (#32374)

* Fixed Hybrid Cache Shape Initialization. (#32163)

* fixed hybrid cache init, added test

* Fix Test Typo

---------

Co-authored-by: Aaron Haag <[email protected]>

* Yell at the user if zero-3 init wasn't performed, but expected to have been done (#32299)

* Test this zach

* Test for improper init w/o zero3

* Move back

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* Get rid of stars in warning

* Make private

* Make clear

---------

Co-authored-by: amyeroberts <[email protected]>

* Update docs (#32368)

nits

* RoPE: Add numerical tests ✨  (#32380)

tests! :D

* [generate] only require an attention mask for mps with torch<2.4 (#32367)

* up

* style

* stopping

* fix: (issue #32124) Exception raised when running `transformers/examples/flax/language-modeling/t5_tokenizer_model.py`. (#32157)

fix: Exception raised when running .

* MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. (#31500)

* Mixtral: remove unnecessary plus 1 when calculating rotary_seq_len, allowing position_ids=None (no auto position_ids generation could be unsafe)

* fix typo [:-1] to [:, -1]

* to meet formatting requirement

* to meet formatting requirement

* remove white space

* MixtralFlashAttention2: put "+ 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. Fix format/style issue.

* propagate to startcoder2, phi3, mixtral and qwen2

* update qwen2_moe

* Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer (#32393)

Bump keras in /examples/research_projects/decision_transformer

Bumps [keras](https://github.com/keras-team/keras) from 2.8.0 to 2.13.1.
- [Release notes](https://github.com/keras-team/keras/releases)
- [Commits](https://github.com/keras-team/keras/compare/v2.8.0...v2.13.1)

---
updated-dependencies:
- dependency-name: keras
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: SeamlessM4TFeatureExtractor stride remainder (#32088)

* fix: SeamlessM4TFeatureExtractor stride remainder

* Added attention mask size test

* Reran ruff for style correction

* Phi3 tests: fix typing for Python 3.8 (#32388)

fix phi

* #32184 save total_vocab_size (#32240)

* save total_vocab_size = vocab_size + user added tokens to speed up operation

* updating length when added_tokens_decoder is set

* add test len(tokenizer)

* add values for neftune (#32399)

I always forget what typical values are, and I have to look at the paper everytime. This will be a helpful reminder.

* Fix documentation references to google/bit-50 model (#32407)

* Persist embedding type of BART and mBART models after resize (#32242)

* fix: persist embedding type of MBartConditonalGeneration after resize

* fix: persist embedding type of BartConditonalGeneration after resize

* fix: Updated `test_embeded_special_tokens` for luke and mluke models (#32413)

Fixed tokenizertests for luke, mluke models.

* Respect the config's attn_implementation if set (#32383)

* Respect the config's attn if set

* Update test - can override in from_config

* Fix

* Fix documentation links and code reference to model llava-next (#32434)

* Cache: create docs (#32150)

* draft

* updates

* works?

* try adding python example in hidden section

* another try

* hwo do i render python

* format as html code?

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* Update docs/source/en/kv_cache.md

Co-authored-by: Joao Gante <[email protected]>

* one more small update

* should render hidden secrtion now

* add outputs

* fix links

* check links

* update all links

* update with offloaded cache

* all cache is importable, so they appear in docs

* fix copies

* docstring...

---------

Co-authored-by: Joao Gante <[email protected]>

* Llava: fix checkpoint_doc (#32458)

fix: add new llava like model bug

* add the missing flash attention test marker (#32419)

* add flash attention check

* fix

* fix

* add the missing marker

* bug fix

* add one more

* remove order

* add one more

* Update kwargs validation for `preprocess` with decorator (#32024)

* BLIP preprocess

* BIT preprocess

* BRIDGETOWER preprocess

* CHAMELEON preprocess

* CHINESE_CLIP preprocess

* CONVNEXT preprocess

* DEIT preprocess

* DONUT preprocess

* DPT preprocess

* FLAVA preprocess

* EFFICIENTNET preprocess

* FUYU preprocess

* GLPN preprocess

* IMAGEGPT preprocess

* INTRUCTBLIPVIDEO preprocess

* VIVIT preprocess

* ZOEDEPTH preprocess

* VITMATTE preprocess

* VIT preprocess

* VILT preprocess

* VIDEOMAE preprocess

* VIDEOLLAVA

* TVP processing

* TVP fixup

* SWIN2SR preprocess

* SIGLIP preprocess

* SAM preprocess

* RT-DETR preprocess

* PVT preprocess

* POOLFORMER preprocess

* PERCEIVER preprocess

* OWLVIT preprocess

* OWLV2 preprocess

* NOUGAT preprocess

* MOBILEVIT preprocess

* MOBILENETV2 preprocess

* MOBILENETV1 preprocess

* LEVIT preprocess

* LAYOUTLMV2 preprocess

* LAYOUTLMV3 preprocess

* Add test

* Update tests

* Fix get large model config for Switch Transformer encoder only tester (#32438)

* Dependencies: fix typo (#32389)

deps_2

* Add Nemotron HF Support (#31699)

* Add nemotron support

* fix inference

* add unit test

* add layernorm1p as a class to avoid meta device mismatch

* test fixed

* Add copied_from statements

* remove pretraining_tp args

* remove nemotronlayernorm

* force LN computation done in FP32

* remove nemotrontokenizer and use llamatokenizer

* license update

* add option for kv_channels for minitron8b

* remove assert

* o_proj fixed

* o_proj reshape

* add gated_proj option

* typo

* remove todos

* fix broken test after merging latest main

* remove nezha/nat after meging main

* chnage default config to 15b model

* add nemo conversion script

* rename conversion script

* remove gate_proj option

* pr comment resolved

* fix unit test

* rename kv_channels to head_dim

* resolve PR issue

* add nemotron md

* fix broken tests

* refactor rope for nemotron

* test fix

* remove linearscaling

* whitespace and import

* fix some copied-from

* code style fix

* reformatted

* add position_embedding to nemotronattention

* rope refactor to only use config, copied-from fix

* format

* Run make fix-copies

* nemotron md with autodoc

* doc  fix

* fix order

* pass check_config_docstrings.py

* fix config_attributes

* remove all llama BC related code

* Use PreTrainedTokenizerFast

* ruff check examples

* conversion script update

* add nemotron to toctree

* Generate: fix end to end compilation (#32465)

* Add codestral mamba2 (#32080)

* add new model like

* draft cuda forward - mismatched keys (sharding on conv1)

* match keys successfully

* fix split

* get generation/forward running (wrong gens, norm?)

* :update

* some refactoring

* fixes

* works up until copy to cache

* fix

* update

* NON WORKING VERSION

* version that work?

* nit

* fix config

* fix conversion script

* working cuda forward

* nit

* update

* simplifcation

* make mamba slow simple work

* no einops

* todo

* fix style

* no einops

* update fix no einsum

* nit

* remove einops

* bug: scan_output differs strongly

* add rms norm option

* fix fast + slow generation with and w/o cache :heavy_check_mark:

* draft integration tests

* remove a big chunk of the einsum

* fix slow, fast generations, without any einsum

* fix copies

* fix structure

* fix up modeling and tests

* fix tests

* clamping is indeed worse

* recover mamba2 cache test

* fix copies

* no cache position (yet)

* fix tf tests

* fix matmul for generate

* fixup

* skip cache tests for now

* [run-slow]mamba2

* tune out hidden states for padding

* test batched generation

* propagate attention mask changes

* fix past length

* fix integration test

* style

* address comments

* update readme

* add mamba2 version check

* fix tests

* [run-slow]mamba2

* skip edge tests

* [run-slow]mamba2

* last fixup

* [run-slow]mamba2

* update README

---------

Co-authored-by: Arthur Zucker <[email protected]>

* Migrate import checks not need accelerate, and be more clear on min versions (#32292)

* Migrate import checks to secondary accelerate calls

* better errs too

* Revert, just keep the import checks + remove accelerate-specific things

* Rm extra'

* Empty commit for ci

* Small nits

* Final

* Documentation: BOS token_id deprecation change for NLLB (#32443)

Update nllb.md

* dev version 4.45.0

* `is_torchdynamo_compiling` -- cast a wide exception net (#32476)

* cast a wide net

* make fix-copies with a few manual changes

* add copied from

* Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477)

* Revert "fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit (#32276)"

This reverts commit 62c60a30181a65e1a3a7f19c3055a240a6a21335.

We uncovered an issue with this change that caused our training runs to hang.

* `is_torchdynamo_compiling` -- cast a wide exception net (#32476)

* cast a wide net

* make fix-copies with a few manual changes

* add copied from

---------

Co-authored-by: Joao Gante <[email protected]>

* 🌐 [i18n-KO] Translated `mask_generation.md` to Korean (#32257)

* docs: ko: tasks/mask_generation.md

* feat: nmt draft

* fix : toc local

* fix : manual edits

* fix : ko-toctree

* fix: resolve suggestions

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>

* fix: resolve suggestions

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>

* fix: resolve suggestions

* fix: resolve suggestions

* fix: resolve suggestions

---------

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>

* 🌐 [i18n-KO] Translated `idefics.md` to Korean (#32258)

* docs: ko: tasks/idefics.md

* feat: nmt draft

* fix: manual edits

* fix: resolve suggestions

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: Harheem Kim <[email protected]>
Co-authored-by: timdalxx <[email protected]>

---------

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: Harheem Kim <[email protected]>
Co-authored-by: timdalxx <[email protected]>

* 🌐 [i18n-KO] Translated `image_to_image.md` to Korean (#32327)

* docs: ko: tasks/image_to_image.md

* feat: nmt draft

* fix: manual edits

* fix: resolve suggestions

Co-authored-by: Jihun Lim <[email protected]>
Co-authored-by: Jiwook Han <[email protected]>

* fix: handle remaining suggestions

Co-authored-by: Jiwook Han <[email protected]>

---------

Co-authored-by: Jihun Lim <[email protected]>
Co-authored-by: Jiwook Han <[email protected]>

* Cache: new Cache format in decoder-only models (#31421)

* draft bart with new cache

* add cache for decoder-only models

* revert utils

* modify docstring

* revert bart

* minor fixes

* fix copies (not related)

* revert tests

* remove enc-dec related code

* remove bloom

* remove opt (enc-dec)

* update docstring

* git, codegen, gpt_neo, gpt_neox, gpj

* clean up

* copied from statements

* revert

* tmp

* update warning msg

* forgot git

* add more flags

* run-slow git,codegen,gpt_neo,gpt_neox,gpj

* add cache flag to VLMs

* remove files

* style

* video LLMs also need a flag

* style

* llava will go in another PR

* style

* [run-slow] codegen, falcon, git, gpt_neo, gpt_neox, gptj, idefics

* Update src/transformers/models/gpt_neo/modeling_gpt_neo.py

Co-authored-by: Arthur <[email protected]>

* copy from

* deprecate until v4.45 and warn if not training

* nit

* fix test

* test static cache

* add more tests and fix models

* fix copies

* return sliding window mask

* run slow tests & fix + codestyle

* one more falcon fix for alibi

---------

Co-authored-by: Arthur <[email protected]>

* Gemma2: add cache warning (#32279)

* gemma2 fallback to dynamic cache

* Update src/transformers/models/gemma2/modeling_gemma2.py

Co-authored-by: Joao Gante <[email protected]>

* Update src/transformers/models/gemma2/modeling_gemma2.py

Co-authored-by: Arthur <[email protected]>

* raise error and dont fallback to dynamic cache

* prev will break most forward calls/tests

* Update src/transformers/models/gemma2/modeling_gemma2.py

Co-authored-by: Arthur <[email protected]>

* update

* fix copies

---------

Co-authored-by: Joao Gante <[email protected]>
Co-authored-by: Arthur <[email protected]>

* enable xla fsdp (#32048)

* enable xla fsdp

* add acceleration version check for xla fsdp

* Fix typo in tokenization_utils_base.py (#32484)

* Agents use grammar (#31735)

* Allow optional use of grammars to constrain generation

* fix broken link in docs (#32491)

`https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TextGenerationPipeline.__call__`

`generate_kwargs (dict, optional) — Additional keyword arguments to pass along to the generate method of the model (see the generate method corresponding to your framework here).`

link in "here" doesnt work

* Docs: alert for the possibility of manipulating logits (#32467)

* logits

* words

* 🌐 [i18n-KO] Translated `gptq.md` to Korean (#32293)

* fix: manual edits

* fix: manual edits2

* fix: delete files

* fix: resolve suggestions

Co-authored-by: Sungmin Oh <[email protected]>
Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: 김준재 <[email protected]>

* fix: resolve suggestions

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: Sungmin Oh <[email protected]>
Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: 김준재 <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* 🌐 [i18n-KO] Translated `prompting.md` to Korean (#32294)

* docs: ko: tasks/prompting.md

* feat: nmt-draft

* fix: update translation in prompting.md

* fix: update toctree.yml

* fix: manual edits

* fix: toctree edits

* fix: resolve suggestions

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Harheem Kim <[email protected]>
Co-authored-by: timdalxx <[email protected]>

---------

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Harheem Kim <[email protected]>
Co-authored-by: timdalxx <[email protected]>

* 🌐 [i18n-KO] Translated `quantization/quanto.md` to Korean (#32281)

* docs: ko: quantization/quanto.md

* feat: nmt draft

* fix: resolve suggestions

Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: Minki Kim <[email protected]>
Co-authored-by: 김준재 <[email protected]>

* fix: resolve suggestions

Co-authored-by: SeungYoun Lee <[email protected]>

---------

Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: Minki Kim <[email protected]>
Co-authored-by: 김준재 <[email protected]>

* 🌐 [i18n-KO] Translated `image_feature_extraction.md` to Korean (#32239)

* docs: ko: tasks/images_feature_extraction.md

* feat: nmt draft

* fix: manual edits

* fix: manual edits

* fix: manual edits

* fix: manual edits

* feat: manual edits

* Update docs/source/ko/tasks/image_feature_extraction.md

Co-authored-by: Jihun Lim <[email protected]>

* Update docs/source/ko/tasks/image_feature_extraction.md

Co-authored-by: Jihun Lim <[email protected]>

* fix: manual edits

---------

Co-authored-by: Jihun Lim <[email protected]>

* Fix references to model google mt5 small (#32497)

* Docs: Fixed WhisperModel.forward’s docstring link (#32498)

Fixed WhisperModel.forward’s docstring link.

* 🌐 [i18n-KO] Translated `chat_templating.md` to Korean (#32362)

* docs: ko: chat_templating.md

* feat: nmt draft

* fix: manual edits

* Update docs/source/ko/chat_templating.md

Co-authored-by: Sungmin Oh <[email protected]>

* Update docs/source/ko/chat_templating.md

Co-authored-by: Sungmin Oh <[email protected]>

* fix: apply suggestions from code review - anchor

Co-authored-by: Sungmin Oh <[email protected]>

* fix: manual edits

Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: Minki Kim <[email protected]>

* fix: manual edits

* fix: delete 'default template' section

---------

Co-authored-by: Sungmin Oh <[email protected]>
Co-authored-by: SeungYoun Lee <[email protected]>
Co-authored-by: Minki Kim <[email protected]>

* Fix link to autoclass_tutorial.md in i18n.md (#32501)

* Fix typo: depracted -> deprecated (#32489)

Hello!

## Pull Request overview
* Fix typo

## Details
This should speak for itself.

cc @itazap @ArthurZucker 

- Tom Aarsen

* Fix issue #32518: Update llm_tutorial.md (#32523)

Update llm_tutorial.md

remove comma re: issue 32518

https://github.com/huggingface/transformers/issues/32518

* Change Phi3 `_supports_sdpa` to True (#32457)

* Change `_supports_sdpa` to True

* add phi3 to sdpa support list

* Uniformize kwargs for processors - GroundingDINO (#31964)

* fix typo

* uniform kwargs

* make style

* add comments

* remove return_tensors

* remove common_kwargs from processor since it propagates

* make style

* return_token_type_ids to True

* revert the default imagekwargs since does not accept any value in the image processro

* revert processing_utils.py

* make style

* add molbap's commit

* fix typo

* fix common processor

* remain

* Revert "add molbap's commit"

This reverts commit a476c6ee88318ce40d73ea31e2dc2d4faa8ae410.

* add unsync PR

* revert

* make CI happy

* nit

* import annotationformat

* Fix add-new-model-like (#31773)

* handle (processor_class, None) returned by ModelPatterns

* handle (slow, fast) image processors in add model

* handle old image processor case

* Add Qwen2-Audio (#32137)

* add qwen2audio

* Update check_repo.py

* fix style

* fix test

* fix style

* add model size

* Qwen2AudioEncoderModel->Qwen2AudioEncoder; add copy info

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* switch the attention_mask and the feature_attention_mask

* add to PRIVATE_MODELS in check_repo.py; add to MODEL_NAMES_TO_IGNORE in check_table.py

* fix initialization

* update chat_template

* fix consistency issue after copy

* add docstrings to _merge_input_ids_with_audio_features

* add copied from to prepare_inputs_for_generation

* add more details to docs

* rm comment

* add init_std

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

Co-authored-by: Yoach Lacombe <[email protected]>

* update

* Update docs/source/en/model_doc/qwen2_audio.md

Co-authored-by: amyeroberts <[email protected]>

* update tests

* rm ignore_index

* update processor

* rm ffmpeg_read

* Update tests/models/qwen2_audio/test_modeling_qwen2_audio.py

Co-authored-by: amyeroberts <[email protected]>

* Update docs/source/en/model_doc/qwen2_audio.md

Co-authored-by: amyeroberts <[email protected]>

* Update docs/source/en/model_doc/qwen2_audio.md

Co-authored-by: amyeroberts <[email protected]>

* Update docs/source/en/model_doc/qwen2_audio.md

Co-authored-by: amyeroberts <[email protected]>

* update

* typo

* [run_slow] qwen2_audio

* [run_slow] qwen2_audio

* [run_slow] qwen2_audio

* fix quality

* [run_slow] qwen2_audio

* [run_slow] qwen2_audio

* [run_slow] qwen2_audio

* add official model

---------

Co-authored-by: Yoach Lacombe <[email protected]>
Co-authored-by: amyeroberts <[email protected]>

* filter flash_attn optional imports loading remote code (#30954)

* filter flash_attn optional imports loading remote code

* improve pattern

* fix code style

* Update src/transformers/dynamic_module_utils.py

Co-authored-by: Matt <[email protected]>

---------

Co-authored-by: Matt <[email protected]>

* 🌐 [i18n-KO] Translated `ko-llm_tutorial_optimization.md` to Korean (#32372)

* docs: ko: llm_tutorial_optimization.md

* feat: nmt draft

* fix: manual edits

* Update docs/source/ko/llm_tutorial_optimization.md

Co-authored-by: Chaewon Song <[email protected]>

* Update docs/source/ko/llm_tutorial_optimization.md

Co-authored-by: Chaewon Song <[email protected]>

* fix: resolve suggestions - 1

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: timdalxx <[email protected]>
Co-authored-by: boyunJang <[email protected]>

* fix: resolve suggestions - 2

Co-authored-by: boyunJang <[email protected]>
Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: timdalxx <[email protected]>

---------

Co-authored-by: Chaewon Song <[email protected]>
Co-authored-by: timdalxx <[email protected]>
Co-authored-by: boyunJang <[email protected]>

* 🌐 [i18n-KO] Translated `trainer.md` to Korean (#32260)

* docs: ko: ko-trainer

* feat: nmt draft

* fix: manual edits

* fix: manual edits

* fix: glossary

* fix: glossary

* Apply suggestions from code review

Co-authored-by: Jinuk <[email protected]>
Co-authored-by: SeongWooChoi <[email protected]>

---------

Co-authored-by: Jinuk <[email protected]>
Co-authored-by: SeongWooChoi <[email protected]>

* 🌐 [i18n-KO] Translated `eetq.md` to Korean (#32352)

* docs: ko: quantization/eetq.md

* feat: nmt draft

* fix docs: ko: quantization/eetq.md

* fix docs: ko: quantization/eetq.md

* fix: resolve suggestions

Co-authored-by: Jiwook Han <[email protected]>

* fix: resolve suggestions

* fix: resolve suggsetions

---------

Co-authored-by: Jiwook Han <[email protected]>

* 🌐 [i18n-KO] Translated `fsdp.md` to Korean (#32261)

* docs: ko: fsdp.md

* feat: nmt draft

* fix: manual edits

* Apply suggestions from code review

Co-authored-by: 김준재 <[email protected]>
Co-authored-by: Minki Kim <[email protected]>

* fix: resolve suggestions

* Update docs/source/ko/fsdp.md

Co-authored-by: 김준재 <[email protected]>

* Update docs/source/ko/fsdp.md

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: 김준재 <[email protected]>
Co-authored-by: Minki Kim <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* 🌐 [i18n-KO] Translated `bitsandbytes.md` to Korean (#32408)

* docs: ko: quantization/bitsandbytes.md

* feat: nmt draft

* fix: minor typos

* fix: manual edits

* fix: manual edits

* fix: resolve suggestions

Co-authored-by: wony617 <[email protected]>
Co-authored-by: YONGSANG <[email protected]>
Co-authored-by: Woojun Jung <[email protected]>

* fix: resolve suggestions

Co-authored-by: Steven Liu <[email protected]>

* Apply suggestions from code review

Co-authored-by: Steven Liu <[email protected]>

* Apply suggestions from code review

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: wony617 <[email protected]>
Co-authored-by: YONGSANG <[email protected]>
Co-authored-by: Woojun Jung <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* Fix generate with `inputs_embeds` as input (#32493)

* I think inputs_embeds has ndim == 3

* fix sequence length catch

* add generate test

* [run-slow]olmo, persimmon, gemma, gemma2, qwen2, llama

* skip whisper

* fix bart test

* more fixes

* Fixed test `test_static_cache_exportability` with torch 2.4.0 (#32516)

Workaround the export issue in torch 2.4

Co-authored-by: Guang Yang <[email protected]>

* Fix code example to load bigcode starcoder2 7b (#32474)

* [docs] Translation guide (#32547)

clarify

* Gemma2: fix FA2 generation (#32553)

fix FA2

* Fix a bug in Qwen2Audio (#32552)

fix _update_model_kwargs_for_generation

* fix slow integration gemma2 test (#32534)

no empty revision

* fix non contiguous tensor value error in save_pretrained (#32422)

Signed-off-by: duzhanwei <[email protected]>
Co-authored-by: duzhanwei <[email protected]>

* 🌐 [i18n-KO] Translated `agent.md` to Korean (#32351)

* docs: ko: main_classes/agent

* feat: chatgpt draft

* fix: manual edits

* �fix: resolve suggestions

Co-authored-by: Woojun Jung <[email protected]>
Co-authored-by: thsamaji <[email protected]>
Co-authored-by: SeungAhSon <[email protected]>

* fix: resolve suggestions

* fix: resolve code line number

---------

Co-authored-by: Woojun Jung <[email protected]>
Co-authored-by: thsamaji <[email protected]>
Co-authored-by: SeungAhSon <[email protected]>

* Add new model (#32615)

* v1 - working version

* fix

* fix

* fix

* fix

* rename to correct name

* fix title

* fixup

* rename files

* fix

* add copied from on tests

* rename to `FalconMamba` everywhere and fix bugs

* fix quantization + accelerate

* fix copies

* add `torch.compile` support

* fix tests

* fix tests and add slow tests

* copies on config

* merge the latest changes

* fix tests

* add few lines about instruct

* Apply suggestions from code review

Co-authored-by: Arthur <[email protected]>

* fix

* fix tests

---------

Co-authored-by: Arthur <[email protected]>

* Fix: FA2 with packed training (#32487)

* fix check

* add tests

* [run-slow] llama, gemma2

* oops, whisper actually runs but needed some special treatment

* Fix sliding window attention used in Gemma2FlashAttention2 (#32522)

* fix sliding window attention (flash2) in gemma2 model

* [run-slow] gemma

* fix slicing attention_mask for flash_attn2

* fix slicing attention_mask when flash_attn is used

* add missing comment

* slice the last seq_len tokens in the key, value states

* revert code of slicing key, value states

* fix: Fixed conditional check for `encodec` model names (#32581)

* Fixed conditional check for encodec model names.

* Reformatted conditional check.

* Fix `.push_to_hub(..., create_pr=True, revision="my-branch")` when creating PR on not-owned repo (#32094)

Fix create_pr aagainst existing revision

* Bump aiohttp from 3.9.4 to 3.10.2 in /examples/research_projects/decision_transformer (#32569)

Bump aiohttp in /examples/research_projects/decision_transformer

Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.9.4 to 3.10.2.
- [Release notes](https://github.com/aio-libs/aiohttp/releases)
- [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/aiohttp/compare/v3.9.4...v3.10.2)

---
updated-dependencies:
- dependency-name: aiohttp
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump torch from 1.13.1 to 2.2.0 in /examples/research_projects/visual_bert (#32220)

Bump torch in /examples/research_projects/visual_bert

Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.1 to 2.2.0.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md)
- [Commits](https://github.com/pytorch/pytorch/compare/v1.13.1...v2.2.0)

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Cleanup tool calling documentation and rename doc (#32337)

* Rename "Templates for Chat Models" doc to "Chat Templates"

* Small formatting fix

* Small formatting fix

* Small formatting fix

* Cleanup tool calling docs as well

* Remove unneeded 'revision'

* Move tip to below main code example

* Little bonus section on template editing

* 🌐 [i18n-KO] Translated `deepspeed.md` to Korean (#32431)

* Update _toctree.yml

* docs: ko: deepspeed.md

* Apply suggestions from code review

Co-authored-by: wony617 <[email protected]>

* Apply suggestions from code review

Co-authored-by: wony617 <[email protected]>

* Update docs/source/ko/_toctree.yml

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/ko/deepspeed.md

* Update docs/source/ko/deepspeed.md

Co-authored-by: SeungAhSon <[email protected]>

* Apply suggestions from code review

Co-authored-by: wony617 <[email protected]>

* Update docs/source/ko/_toctree.yml

---------

Co-authored-by: wony617 <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: SeungAhSon <[email protected]>

* 🌐 [i18n-KO] Translated `awq.md`to Korean (#32324)

* fix: manual edits

* Apply suggestions from code review

Co-authored-by: SeongWooChoi <[email protected]>
Co-authored-by: Chulhwa (Evan) Han <[email protected]>

* fix:manual edits

- 잘못된 경로에 번역본 파일을 생성해서 옮김

* Delete docs/source/ko/tasks/awq.md

* Update docs/source/ko/_toctree.yml

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: SeongWooChoi <[email protected]>
Co-authored-by: Chulhwa (Evan) Han <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* fix: Fixed failing `test_find_base_model_checkpoint` (#32638)

Fixed failing test_find_base_model_checkpoint.

* Bump tensorflow from 2.11.1 to 2.12.1 in /examples/research_projects/decision_transformer (#32341)

Bump tensorflow in /examples/research_projects/decision_transformer

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.11.1 to 2.12.1.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.11.1...v2.12.1)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* "to be not" -> "not to be" (#32636)

* "to be not" -> "not to be"

* Update sam.md

* Update trainer.py

* Update modeling_utils.py

* Update test_modeling_utils.py

* Update test_modeling_utils.py

* fix: Updated the `is_torch_mps_available()` function to include `min_version` argument (#32545)

* Fixed wrong argument in is_torch_mps_available() function call.

* Fixed wrong argument in is_torch_mps_available() function call.

* sorted the import.

* Fixed wrong argument in is_torch_mps_available() function call.

* Fixed wrong argument in is_torch_mps_available() function call.

* Update src/transformers/utils/import_utils.py

Co-authored-by: Arthur <[email protected]>

* removed extra space.

* Added type hint for the min_version parameter.

* Added missing import.

---------

Co-authored-by: Arthur <[email protected]>

* Expand inputs in processors for VLMs (#30962)

* let it be

* draft

* should not have changed

* add warnings

* fix & add tests

* fix tests

* ipnuts embeds cannot be passed with pixels

* more updates

* paligemma ready!

* minor typos

* update blip-2

* fix tests & raise error

* docstring

* add blip2 test

* tmp

* add image seq length to config

* update docstring

* delete

* fix tests

* fix blip

* fix paligemma

* out-of-place scatter

* add llava-next-video

* Update src/transformers/models/blip_2/modeling_blip_2.py

Co-authored-by: Pablo Montalvo <[email protected]>

* remove tmp

* codestyle

* nits

* more nits

* remove overriding in tests

* comprehension when merging video

* fix-copies

* revert changes for embeds test

* fix tests after making comprehension

* Update src/transformers/models/blip_2/processing_blip_2.py

Co-authored-by: Pablo Montalvo <[email protected]>

* Update src/transformers/models/blip_2/processing_blip_2.py

Co-authored-by: Pablo Montalvo <[email protected]>

* more updates

* fix tests

---------

Co-authored-by: Pablo Montalvo <[email protected]>

* Automatically add `transformers` tag to the modelcard (#32623)

* Automatically add `transformers` tag to the modelcard

* Specify library_name and test

* Fix tests (#32649)

* skip failing tests

* [no-filter]

* [no-filter]

* fix wording catch in FA2 test

* [no-filter]

* trigger normal CI without filtering

* fix tensors on different devices in `WhisperGenerationMixin` (#32316)

* fix

* enable on xpu

* no manual remove

* move to device

* remove to

* add move to

* Add support for GrokAdamW optimizer (#32521)

* add grokadamw

* reformat

* code review feedback, unit test

* reformat

* reformat

* Add Depth Anything V2 Metric models (#32126)

* add checkpoint and repo names

* adapt head to support metric depth estimation

* add max_depth output scaling

* add expected logits

* improve docs

* fix docstring

* add checkpoint and repo names

* adapt head to support metric depth estimation

* add max_depth output scaling

* add expected logits

* improve docs

* fix docstring

* rename depth_estimation to depth_estimation_type

* add integration test

* Refactored tests to include metric depth model inference test
* Integration test pass when the timm backbone lines are commented (L220-L227)

* address feedback

* replace model path to use organization path

* formatting

* delete deprecated TODO

* address feedback

* [run_slow] depth_anything

* Fix: Fixed directory path for utils folder in `test_tokenization_utils.py` (#32601)

* Removed un-necessary expressions.

* Fixed directory path for utils folder in test_tokenization_utils.py

* Modify ProcessorTesterMixin for better generalization (#32637)

* Add padding="max_length" to tokenizer kwargs and change crop_size to size for image_processor kwargs

* remove crop_size argument in align processor tests to be coherent with base tests

* Add pad_token when loading tokenizer if needed, change test override tokenizer kwargs, remove unnecessary test overwrites in grounding dino

* TF_Deberta supporting mixed precision (#32618)

* Update modeling_tf_deberta.py

Corrected some codes which do not support mixed precision

* Update modeling_tf_deberta_v2.py

Corrected some codes which do not support mixed precision

* Update modeling_tf_deberta_v2.py

* Update modeling_tf_deberta.py

* Add files via upload

* Add files via upload

* Fix tests recurrent (#32651)

* add fix for recurrentgemma

* [no-filter]

* trigger-ci

* [no-filter]

* [no-filter]

* attempt to fix mysterious zip error

* [no-filter]

* fix lookup error

* [no-filter]

* remove summarization hack

* [no-filter]

* Support MUSA (Moore Threads GPU) backend in transformers (#31913)

Add accelerate version check, needs accelerate>=0.33.0

* fix: Fixed failing tests in `tests/utils/test_add_new_model_like.py` (#32678)

* Fixed failing tests in tests/utils/test_add_new_model_like.py

* Fixed formatting using ruff.

* Small nit.

* Update translation docs review (#32662)

update list of people to tag

* Add TorchAOHfQuantizer (#32306)

* Add TorchAOHfQuantizer

Summary:
Enable loading torchao quantized model in huggingface.

Test Plan:
local test

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix a few issues

* style

* Added tests and addressed some comments about dtype conversion

* fix torch_dtype warning message

* fix tests

* style

* TorchAOConfig -> TorchAoConfig

* enable offload + fix memory with multi-gpu

* update torchao version requirement to 0.4.0

* better comments

* add torch.compile to torchao README, add perf number link

---------

Co-authored-by: Marc Sun <[email protected]>

* Fix `JetMoeIntegrationTest` (#32332)

JetMoeIntegrationTest

Co-authored-by: ydshieh <[email protected]>

* Update the distributed CPU training on Kubernetes documentation (#32669)

* Update the Kubernetes CPU training example

* Add namespace arg

Signed-off-by: Dina Suehiro Jones <[email protected]>

---------

Signed-off-by: Dina Suehiro Jones <[email protected]>

* fix: Fixed unknown pytest config option `doctest_glob` (#32475)

Fixed unknown config option doctest_glob.

* Unpin deepspeed in Docker image/tests (#32572)

Unpin deepspeed

* Updated workflows to the latest versions (#32405)

Updated few workflows to the latest versions.

* reopen: llava-next fails to consider padding_side during Training (#32679)

restore #32386

* fix: Corrected ` falcon-mamba-7b` model checkpoint name (#32837)

Corrected the model checkpoint.

* fix: update doc link for runhouse in README.md (#32664)

* VLMs: small clean-up for cache class (#32417)

* fix beam search in video llava

* [run-slow] video_llava

* add back the position ids (#32554)

* add back the position ids

* fix failing test

* Use head_dim if in config for RoPE (#32495)

* use head_dim if in config for RoPE

* typo

* simplify with getattr

* Generate: unify `LogitsWarper` and `LogitsProcessor` (#32626)

* [tests] make test_sdpa_equivalence device-agnostic (#32520)

* fix on xpu

* [run_all]

* Cache: use `batch_size` instead of `max_batch_size` (#32657)

* more precise name

* better docstrings

* Update src/transformers/cache_utils.py

Co-authored-by: Arthur <[email protected]>

---------

Co-authored-by: Arthur <[email protected]>

* Fix AutoConfig and AutoModel support for Llava-Next-Video (#32844)

* Fix: fix all model_type of Llava-Next-Video to llava_next_video

* Fix doc for llava_next_video

* * Fix formatting issues
* Change llava-next-video.md file name into llava_next_video.md to make it compatible with implementation

* Fix docs TOC for llava-next-video

* improve _get_is_as_tensor_fns (#32596)

* improve _get_is_as_tensor_fns

* format

* Revert PR 32299, flag users when Zero-3 was missed (#32851)

Revert PR 32299

* fix multi-gpu with static cache (#32543)

* Reduce the error log when using core models that need their weights renamed, and provide a step forward (#32656)

* Fin

* Modify msg

* Finish up nits

* Make beam_constraints.Constraint.advance() docstring more accurate (#32674)

* Fix beam_constraints.Constraint.advance() docstring

* Update src/transformers/generation/beam_constraints.py

Co-authored-by: Steven Liu <[email protected]>

---------

Co-authored-by: Joao Gante <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* generate: missing `to` in DoLa body, causing exceptions in multi-gpu generation (#32856)

* Add Flax Dinov2 (#31960)

* tfmsenv restored in main

* installed flax

* forward pass done and all tests passed

* make fix-copies and cleaning the scripts

* fixup attempt 1

* fixup attempt 2

* fixup third attempt

* fixup attempt 4

* fixup attempt 5

* dinov2 doc fixed

* FlaxDinov2Model + ForImageClassification added to OBJECTS_TO_IGNORE

* external pos_encoding layer removed

* fixup attempt 6

* fixed integration test values

* fixup attempt 7

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/dinov2/modeling_flax_dinov2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/tran…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet