Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper pipeline: add perf metrics #971

Merged

Conversation

as-suvorov
Copy link
Contributor

@as-suvorov as-suvorov commented Oct 15, 2024

This PR adds:

  • support perf metrics

Common Todos for Whisper support:

  • Long-form audio support with parallel chunking.
  • update documentation
  • add cpp, python samples tests
  • support timestamps streaming
  • expose only meaningful parameters in GenerationConfig (task, language, return_timestamps, etc)
  • Move all whisper pipeline files to dedicated subfolder
  • Whisper pipeline doesn't need tokenizer, it uses detokenizer only. Implement detokenizer only initialization for ov::genai::Tokenizer
  • Check discrete GPU. Integrated GPU works as expected.
  • Investigate use of RemoteTensor for GPU
  • Add batch
  • Add sampler, inherit WhisperGenerationConfig from GenerationConfig
  • Investigate language autodetection with single decoder (without past) call
  • Update python bindings cmake to include whole directory instead of explicit list of files
  • Add samples with audio preparation examples
  • Add links to audio files so users can download them in samples
  • Move supported models list from samples README to common supported models section
  • Avoid building GenAI in each tests job as it takes a lot of time
  • Double check FP32 support
  • Fix tests sporadic fails. Sometimes whisper model cannot be downloaded from HF due to network issues
  • Fix stop criteria. Current approach stops on eos_token which is no speech token. But there could be more speech tokens further which are wrongly skipped now
  • Fix distil whisper accuracy, match with HF
  • Fix en models accuracy with timestamps, match with HF
  • Try to trim input_ids cache between chunks for long-form audio to match HF

Completed:

  • support different languages, language autodetection
  • support translation
  • support timestamps
  • Long-form audio support with sequential chunking.

Current limitations:

  • No resampling during preprocessing. Input raw speech should have 16k Hz sampling rate
  • No normalization during preprocessing. Input raw speech should be normalized to near [-1, 1] range

Tickets: CVS-147994, CVS-146010, CVS-152523

@ilya-lavrenov ilya-lavrenov added this to the 2024.5 milestone Oct 15, 2024
@as-suvorov as-suvorov marked this pull request as ready for review October 15, 2024 08:48
@ilya-lavrenov ilya-lavrenov added this pull request to the merge queue Oct 15, 2024
Merged via the queue into openvinotoolkit:master with commit a907b5f Oct 15, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants