Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SpeechToDocument #2676

Closed
wants to merge 29 commits into from
Closed

feat: SpeechToDocument #2676

wants to merge 29 commits into from

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Jun 17, 2022

Proposed changes:

  • Introduces SpeechToDocument, a node that takes in input audio files and outputs a Document.
    • What works:
      • Can deal with arbitrarily long audio files by chunking them internally into fragments
      • The speed of transcription is faster than the audio by about 2x, so in theory it could be used with streaming inputs (but in practice it's very hard now due to how the rest of Haystack works 😄 ).
      • The audio is aligned with its transcription at the word level using aeneas ("traditional" forced alignment method, so very fast)
    • Still to do, probably in the next PR:
      • Support input files other than .wav 😅 (easy)
      • Denoising the input audio (easy to remove most of it, very hard to make it perfect)
      • Adding punctuation or test with models that predict punctuation too (to investigate)
      • Fragmenting the input on voice pauses (now broken down in arbitrarily long chunks) (to investigate)
      • Silence detection (should be relatively easy)
      • Run a spellchecker on the output to improve the transcription quality (should be relatively easy)
  • Modifies AnswerToSpeech to check if the source document contains alignment data: if it does, the new AnswerToSpeech will extract the audio from the original source instead of generating it.
  • Introduces a new primitive, AudioAlignment, that contains alignment data
  • Modifies the SpeechDocument/SpeechAnswer primitives to accommodate for optional alignment data
  • Modifies Span by overriding its in statement. Now assert 10 in Span(5, 15) returns True 😁

To Dos:

  • Tutorial
  • Tests

@ZanSara ZanSara added type:feature New feature or request topic:audio labels Jun 17, 2022
@ZanSara ZanSara requested review from masci and julian-risch June 17, 2022 13:46
@ZanSara ZanSara removed request for masci and julian-risch June 20, 2022 09:34
@ZanSara ZanSara mentioned this pull request Jul 21, 2022
8 tasks
@ZanSara ZanSara mentioned this pull request Jul 21, 2022
@ZanSara ZanSara changed the title SpeechToDocument feat: SpeechToDocument Aug 11, 2022
@ZanSara
Copy link
Contributor Author

ZanSara commented Oct 21, 2022

These nodes will be added to Haystack as external nodes. Closing.

@ZanSara ZanSara closed this Oct 21, 2022
@masci masci deleted the speech2text branch September 13, 2023 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:audio type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants