Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Text-to-Speech Implementations & CLI App #57

Merged
merged 33 commits into from
Aug 7, 2024
Merged

Add Text-to-Speech Implementations & CLI App #57

merged 33 commits into from
Aug 7, 2024

Conversation

hello-amal
Copy link
Contributor

@hello-amal hello-amal commented Jul 29, 2024

Description

This PR adds a generic text-to-speech (TTS) abstract class, TextToSpeechEngine, as well as two implementations of that abstract class, one using gTTS (preferred) and the other using pyttsx3 (worse voice quality, but can be used offline). It also adds test cases for each of the engines, using ground-truth saved files. Finally, it adds a command-line interface (CLI) to allow users to easily use text-to-speech (with convenient features like storing history, loading pre-saved utterances, and tab completion).

Testing

  • Install it: cd src; pip3 install .
  • Run the tests, verify they all pass: python3 test/audio/test_text_to_speech.py
  • Run the manual test: python3 test/audio/manual_test_text_to_speech.py. Verify the first utterance in in an American accent and completes, and the second is in a British accent and gets interrupted.
  • Run the CLI: python3 -m stretch.app.text_to_speech. Verify the following:
    • Type an utterance, verify the robot says it.
    • Type an utterance, and then type S, verify the robot stops.
    • Press the up arrow key to go up in history, verify it works.
    • Start typing one of the utterances in history, press tab, verify tab-complete works (if there are multiple satisfying utterances in history, it should display them).
    • Terminate the CLI. Re-run (the first of the two possible commands) but append --history_file <path/to/new/file>.txt to the script. Type a few utterances. Quit the CLI by typing Q. Verify the history was saved in the history file.
    • Re-run the CLI with the same argument as above. Verify the history from that file is pre-loaded into the utterance.
  • Re-run the OVMM app (Open vocabulary mobile manipulation #73 ) and verify TTS still works as expected.

Checklist

  • I have performed a self-review of my code
  • If it is a core feature, I have added thorough tests
  • I have added documentation for the changes
  • [N/A] I have updated the README file if necessary
  • I have run on hardware if necessary

Additional context

This is a copy of stretchpy#61, so that stretch_ai also has TTS capabilities. Eventually, we should store this code in only one place.

DEFAULT_LOGGER = logging.getLogger(__name__)


class TextToSpeechEngineType(Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer enums in a separate file usually, and split these 2 as well. that makes it easier to make them optional dependencies/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this enum and put pyttsx3 and gtts in sepaarte files. I don't agree that enums should be in separate files as a rule; for example, in text_to_speech/executor.py I feel its appropriate to keep the enum TextToSpeechOverrideBehavior in the same file as TextToSpeechExecutor. LMK if you feel otherwise (for that specific case).


# Adapted from https://github.com/markstent/audio-similarity/blob/main/audio_similarity/audio_similarity.py
# Note that that script has other audio similarity metrics as well
def spectral_contrast_similarity(ground_truth_filepath, comparison_filepath, sample_rate=16000):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i could see this being really useful for e.g. wake words, would it make sense in audio/utils or utils/audio or something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe that would not be using filepaths though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Filepaths were the most convenient way to load into librosa, and imo we can generalize it if/when we find another use for the function.

Copy link
Collaborator

@hello-atharva hello-atharva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an application for TTS inside stretch.apps? Currently there is no main entry point inside <path/to/stretch_ai>/src/stretch/audio/text_to_speech_cli.py

@hello-cpaxton
Copy link
Collaborator

Can you add the mp3 files to .gitattributes and make sure they are under git-lfs @hello-amal ?

@hello-cpaxton
Copy link
Collaborator

We want to make sure large files are never added to git history!

@hello-amal
Copy link
Contributor Author

@hello-cpaxton @hello-atharva Done with all suggested changes from this PR and stretchpy#61, except for:

  • @hello-cpaxton suggested putting tests/audio/manual_test_text_to_speech.py into examples/demos. I agree that an "example" better describes what that file serves as, but the examples folder seems to be gone in stretch_ai. Any suggested location to put it?

I did put mp3 files in LFS, but FWIW the larges of them was just 60 kB, which is far lower than the default 500 kB in the check-added-large-files pre-commit. Anyway, for good measure I went ahead and added that pre-commit.

@hello-amal
Copy link
Contributor Author

hello-amal commented Aug 6, 2024

@hello-cpaxton could you add espeak to the docker image? It is required for pyttsx3, and is therefore needed for the automated tests; hence, CI is failing.

"openai-whisper",
"overrides", # better inheritance of docstrings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one may have caused issues during installation - did you try it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I uninstalled stretchpy and overrides, and then re-installed from src and it worked on my machine.

Copy link
Collaborator

@hello-cpaxton hello-cpaxton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending minor comments

@hello-amal hello-amal changed the title Added Text-to-Speech Add Text-to-Speech Implementations & CLI App Aug 7, 2024
Copy link
Collaborator

@hello-atharva hello-atharva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hello-amal hello-amal merged commit ac4323c into main Aug 7, 2024
1 check passed
@hello-amal hello-amal deleted the amaln/tts branch August 7, 2024 14:42
peiqi-liu pushed a commit to peiqi-liu/stretch_ai that referenced this pull request Sep 25, 2024
* Added TTS

* Update setup.py

* Install `libasound2-dev` in workflow

* Github workflows require `sudo apt-get` installs

* Add portaudio to the `apt-get` installs

* Fixes from pre-commits

* Add espeak to github actions installation

* Remove mp3s

* Configure LFS to track MP3s

* Added a check for large files in the pre-commit

* Changes from PR review

* Update github actions dep to fake audio capabilities

* Update the apt install

* updates to docker

* workflow updates

* Add espeak to README audio deps

* Add ffmpeg

* [WIP] list audioread backends in github actions

* Refactor available formats

* Implemented GoogleCloudTTS

* [WIP] list audioread backends in github actions

* [WIP] add verbose logs to failing test case

* Remove GoogleCloudTTS on GithubActions

* [WIP] verify the named temp file has size > 0

* [WIP] check if FFMPeg gets a decoder error on the mp3s

* Pull LFS files in Github Action

* Add Git LFS to the action workflow

* Mark the git directory as safe before pulling LFS files

* Move git-lfs from action workflow to docker file

* Re-trigger github actions

---------

Co-authored-by: Amal Nanavati <[email protected]>
Co-authored-by: Chris Paxton <[email protected]>
Co-authored-by: Chris Paxton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants