Skip to content

Commit

Permalink
Merge branch 'master' into feature/nodejs-bindings
Browse files Browse the repository at this point in the history
  • Loading branch information
vishniakov-nikolai authored Dec 23, 2024
2 parents 13c0558 + 2fb56d4 commit 432e3de
Show file tree
Hide file tree
Showing 57 changed files with 1,369 additions and 850 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/llm_bench-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ jobs:
rm -rf ./ov_models/internvl2-1B
- name: WWB Tests
run: |
pip install git+https://github.com/huggingface/optimum-intel.git
pip install git+https://github.com/huggingface/optimum-intel.git@420fa87d039425a906b7f755e4562b65947f016a
GIT_CLONE_PROTECTION_ACTIVE=false PIP_PRE=1 PIP_EXTRA_INDEX_URL=https://storage.openvinotoolkit.org/simple/wheels/nightly pip install ${{ env.WWB_PATH }}
python -m pytest -v ${{ env.WWB_PATH }}/tests
stateful:
Expand Down Expand Up @@ -190,7 +190,7 @@ jobs:
- name: WWB Tests
run: |
pip install pytest
pip install git+https://github.com/huggingface/optimum-intel.git
pip install git+https://github.com/huggingface/optimum-intel.git@420fa87d039425a906b7f755e4562b65947f016a
GIT_CLONE_PROTECTION_ACTIVE=false PIP_PRE=1 PIP_EXTRA_INDEX_URL=https://storage.openvinotoolkit.org/simple/wheels/nightly pip install ${{ env.WWB_PATH }}
python -m pytest -v ${{ env.WWB_PATH }}/tests
Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -331,10 +331,14 @@ For more examples check out our [Generative AI workflow](https://docs.openvino.a
NOTE: Whisper Pipeline requires preprocessing of audio input (to adjust sampling rate and normalize)
### Converting and compressing image generation model from Hugging Face library
### Converting and quantizing speech-to-text model from Hugging Face library
```sh
#Download and convert to OpenVINO whisper-base model
optimum-cli export openvino --trust-remote-code --model openai/whisper-base whisper-base
#Download, convert and apply int8 static quantization to whisper-base model
optimum-cli export openvino --trust-remote-code --model openai/whisper-base \
--quant-mode int8 --dataset librispeech --num-samples 32 whisper-base-int8
```

### Run generation using Whisper Pipeline API in Python
Expand Down
85 changes: 85 additions & 0 deletions samples/cpp/whisper_speech_recognition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,91 @@ timestamps: [0, 2] text: How are you doing today?

See [SUPPORTED_MODELS.md](../../../src/docs/SUPPORTED_MODELS.md#whisper-models) for the list of supported models.

# Whisper pipeline usage

```c++
#include "openvino/genai/whisper_pipeline.hpp"

ov::genai::WhisperPipeline pipeline(model_dir, "CPU");
// Pipeline expects normalized audio with Sample Rate of 16kHz
ov::genai::RawSpeechInput raw_speech = read_wav("how_are_you_doing_today.wav");
auto result = pipeline.generate(raw_speech);
// How are you doing today?
```
### Transcription
Whisper pipeline predicts the language of the source audio automatically.
```c++
ov::genai::RawSpeechInput raw_speech = read_wav("how_are_you_doing_today.wav");
auto result = pipeline.generate(raw_speech);
// How are you doing today?
raw_speech = read_wav("fr_sample.wav");
result = pipeline.generate(raw_speech);
// Il s'agit d'une entité très complexe qui consiste...
```

If the source audio languange is know in advance, it can be specified as an argument to `generate` method:

```c++
ov::genai::RawSpeechInput raw_speech = read_wav("how_are_you_doing_today.wav");
auto result = pipeline.generate(raw_speech, ov::genai::language("<|en|>"));
// How are you doing today?

raw_speech = read_wav("fr_sample.wav");
result = pipeline.generate(raw_speech, ov::genai::language("<|fr|>"));
// Il s'agit d'une entité très complexe qui consiste...
```

### Translation

By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to "translate":

```c++
ov::genai::RawSpeechInput raw_speech = read_wav("fr_sample.wav");
auto result = pipeline.generate(raw_speech, ov::genai::task("translate"));
// It is a very complex entity that consists...
```

### Timestamps prediction

The model can predict timestamps. For sentence-level timestamps, pass the `return_timestamps` argument:

```C++
ov::genai::RawSpeechInput raw_speech = read_wav("how_are_you_doing_today.wav");
auto result = pipeline.generate(raw_speech, ov::genai::return_timestamps(true));

std::cout << std::setprecision(2);
for (auto& chunk : *result.chunks) {
std::cout << "timestamps: [" << chunk.start_ts << ", " << chunk.end_ts << "] text: " << chunk.text << "\n";
}
// timestamps: [0, 2] text: How are you doing today?
```

### Long-Form audio Transcription

The Whisper model is designed to work on audio samples of up to 30s in duration. Whisper pipeline uses sequential chunking algorithm to transcribe audio samples of arbitrary length.
Sequential chunking algorithm uses a "sliding window", transcribing 30-second slices one after the other.

### Initial prompt and hotwords

Whisper pipeline has `initial_prompt` and `hotwords` generate arguments:
* `initial_prompt`: initial prompt tokens passed as a previous transcription (after `<|startofprev|>` token) to the first processing window
* `hotwords`: hotwords tokens passed as a previous transcription (after `<|startofprev|>` token) to the all processing windows

The Whisper model can use that context to better understand the speech and maintain a consistent writing style. However, prompts do not need to be genuine transcripts from prior audio segments. Such prompts can be used to steer the model to use particular spellings or styles:

```c++
auto result = pipeline.generate(raw_speech);
// He has gone and gone for good answered Paul Icrom who...

result = pipeline.generate(raw_speech, ov::genai::initial_prompt("Polychrome"));
// He has gone and gone for good answered Polychrome who...
```


### Troubleshooting

#### Empty or rubbish output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ int main(int argc, char* argv[]) try {

std::cout << result << "\n";

std::cout << std::setprecision(2);
for (auto& chunk : *result.chunks) {
std::cout << "timestamps: [" << chunk.start_ts << ", " << chunk.end_ts << "] text: " << chunk.text << "\n";
}
Expand Down
2 changes: 1 addition & 1 deletion samples/export-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
--extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release
--extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
openvino-tokenizers~=2025.0.0.0.dev
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@420fa87d039425a906b7f755e4562b65947f016a
numpy<2.0.0; sys_platform == 'darwin'
einops==0.8.0 # For Qwen
transformers_stream_generator==0.0.5 # For Qwen
Expand Down
87 changes: 87 additions & 0 deletions samples/python/whisper_speech_recognition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,93 @@ timestamps: [0, 2] text: How are you doing today?

See [SUPPORTED_MODELS.md](../../../src/docs/SUPPORTED_MODELS.md#whisper-models) for the list of supported models.

# Whisper pipeline usage

```python
import openvino_genai
import librosa

def read_wav(filepath):
raw_speech, samplerate = librosa.load(filepath, sr=16000)
return raw_speech.tolist()

pipe = openvino_genai.WhisperPipeline(model_dir, "CPU")
# Pipeline expects normalized audio with Sample Rate of 16kHz
raw_speech = read_wav('how_are_you_doing_today.wav')
result = pipe.generate(raw_speech)
# How are you doing today?
```

### Transcription

Whisper pipeline predicts the language of the source audio automatically.

```python
raw_speech = read_wav('how_are_you_doing_today.wav')
result = pipe.generate(raw_speech)
# How are you doing today?

raw_speech = read_wav('fr_sample.wav')
result = pipe.generate(raw_speech)
# Il s'agit d'une entité très complexe qui consiste...
```

If the source audio languange is know in advance, it can be specified as an argument to `generate` method:

```python
raw_speech = read_wav("how_are_you_doing_today.wav")
result = pipe.generate(raw_speech, language="<|en|>")
# How are you doing today?

raw_speech = read_wav("fr_sample.wav")
result = pipe.generate(raw_speech, language="<|fr|>")
# Il s'agit d'une entité très complexe qui consiste...
```

### Translation

By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to "translate":

```python
raw_speech = read_wav("fr_sample.wav")
result = pipe.generate(raw_speech, task="translate")
# It is a very complex entity that consists...
```

### Timestamps prediction

The model can predict timestamps. For sentence-level timestamps, pass the `return_timestamps` argument:

```python
raw_speech = read_wav("how_are_you_doing_today.wav")
result = pipe.generate(raw_speech, return_timestamps=True)

for chunk in result.chunks:
print(f"timestamps: [{chunk.start_ts:.2f}, {chunk.end_ts:.2f}] text: {chunk.text}")
# timestamps: [0.00, 2.00] text: How are you doing today?
```

### Long-Form audio Transcription

The Whisper model is designed to work on audio samples of up to 30s in duration. Whisper pipeline uses sequential chunking algorithm to transcribe audio samples of arbitrary length.
Sequential chunking algorithm uses a "sliding window", transcribing 30-second slices one after the other.

### Initial prompt and hotwords

Whisper pipeline has `initial_prompt` and `hotwords` generate arguments:
* `initial_prompt`: initial prompt tokens passed as a previous transcription (after `<|startofprev|>` token) to the first processing window
* `hotwords`: hotwords tokens passed as a previous transcription (after `<|startofprev|>` token) to the all processing windows

The Whisper model can use that context to better understand the speech and maintain a consistent writing style. However, prompts do not need to be genuine transcripts from prior audio segments. Such prompts can be used to steer the model to use particular spellings or styles:

```python
result = pipe.generate(raw_speech)
# He has gone and gone for good answered Paul Icrom who...

result = pipe.generate(raw_speech, initial_prompt="Polychrome")
# He has gone and gone for good answered Polychrome who...
```

### Troubleshooting

#### Empty or rubbish output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def main():
parser.add_argument("wav_file_path")
args = parser.parse_args()

device = "CPU" # GPU can be used as well
device = "CPU" # GPU, NPU can be used as well
pipe = openvino_genai.WhisperPipeline(args.model_dir, device)

config = pipe.get_generation_config()
Expand All @@ -34,8 +34,9 @@ def main():

print(result)

for chunk in result.chunks:
print(f"timestamps: [{chunk.start_ts}, {chunk.end_ts}] text: {chunk.text}")
if result.chunks:
for chunk in result.chunks:
print(f"timestamps: [{chunk.start_ts:.2f}, {chunk.end_ts:.2f}] text: {chunk.text}")


if "__main__" == __name__:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ class OPENVINO_GENAI_EXPORTS Scheduler {
DDIM,
EULER_DISCRETE,
FLOW_MATCH_EULER_DISCRETE,
PNDM
PNDM,
EULER_ANCESTRAL_DISCRETE
};

static std::shared_ptr<Scheduler> from_config(const std::filesystem::path& scheduler_config_path,
Expand Down
34 changes: 33 additions & 1 deletion src/cpp/include/openvino/genai/whisper_generation_config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@

#pragma once

#include <optional>
#include <filesystem>
#include <optional>

#include "openvino/genai/tokenizer.hpp"
#include "openvino/runtime/compiled_model.hpp"
Expand Down Expand Up @@ -46,6 +46,9 @@ class OPENVINO_GENAI_EXPORTS WhisperGenerationConfig {
// Transcribe token id.
int64_t transcribe_token_id = 50359;

// Corresponds to the ”<|startofprev|>” token.
int64_t prev_sot_token_id = 50361;

// No timestamps token id.
int64_t no_timestamps_token_id = 50363;

Expand Down Expand Up @@ -75,6 +78,32 @@ class OPENVINO_GENAI_EXPORTS WhisperGenerationConfig {
// Note that a segment of text refers to a sequence of one or more words, rather than individual words.
bool return_timestamps = false;

/*
* Initial prompt tokens passed as a previous transcription (after `<|startofprev|>` token) to the first processing
* window. Can be used to steer the model to use particular spellings or styles.
*
* Example:
* auto result = pipeline.generate(raw_speech);
* // He has gone and gone for good answered Paul Icrom who...
*
* auto result = pipeline.generate(raw_speech, ov::genai::initial_prompt("Polychrome"));
* // He has gone and gone for good answered Polychrome who...
*/
std::optional<std::string> initial_prompt = std::nullopt;

/*
* Hotwords tokens passed as a previous transcription (after `<|startofprev|>` token) to the all processing windows.
* Can be used to steer the model to use particular spellings or styles.
*
* Example:
* auto result = pipeline.generate(raw_speech);
* // He has gone and gone for good answered Paul Icrom who...
*
* auto result = pipeline.generate(raw_speech, ov::genai::hotwords("Polychrome"));
* // He has gone and gone for good answered Polychrome who...
*/
std::optional<std::string> hotwords = std::nullopt;

// A list containing tokens that will be suppressed at the beginning of the sampling process.
std::vector<int64_t> begin_suppress_tokens;

Expand Down Expand Up @@ -111,9 +140,12 @@ static constexpr ov::Property<int64_t> pad_token_id{"pad_token_id"};
static constexpr ov::Property<int64_t> transcribe_token_id{"transcribe_token_id"};
static constexpr ov::Property<int64_t> translate_token_id{"translate_token_id"};
static constexpr ov::Property<int64_t> no_timestamps_token_id{"no_timestamps_token_id"};
static constexpr ov::Property<int64_t> prev_sot_token_id{"prev_sot_token_id"};
static constexpr ov::Property<std::string> language{"language"};
static constexpr ov::Property<std::string> task{"task"};
static constexpr ov::Property<bool> return_timestamps{"return_timestamps"};
static constexpr ov::Property<std::string> initial_prompt{"initial_prompt"};
static constexpr ov::Property<std::string> hotwords{"hotwords"};
static constexpr ov::Property<std::map<std::string, int64_t>> lang_to_id{"lang_to_id"};

} // namespace genai
Expand Down
Loading

0 comments on commit 432e3de

Please sign in to comment.