Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add voice APIs to communicate with stt and tts through the wyoming protocol #4637

Draft
wants to merge 4 commits into
base: mealie-next
Choose a base branch
from

Conversation

rosebeats
Copy link

@rosebeats rosebeats commented Dec 1, 2024

What type of PR is this?

  • feature

What this PR does / why we need it:

This pull request adds 2 new API endpoints:

  • /voice/tts - for converting text to speech
  • /voice/stt - for converting speech to text

Both APIs call out to a configured server using the wyoming protocol. It is not dependent on a specific speech to text or text to speech model.

I'm adding this API in order to support the use of a chat bot within Mealie. In particular, I'm aiming to support a "Cook Along" service to let users ask a chatbot questions about the recipe as they're cooking so they don't need to search around for what they're looking for. Speech to text and text to speech are important for implementing such a feature.

Changes in detail

New library

I'm adding the wyoming protocol library for communicating with tts and stt services

Dev container

I've added the docker run arguments: --add-host=host.docker.internal:host-gateway to the dev container so that mealie can communicate with tcp services on the host (specifically the wyoming protocol servers)

Services

Adds TTSService and STTService for providing text to speech and speech to text. Initialization sets up connection to wyoming protocol servers. TTSService has a synthesize method to synthesize speech. STTService has a transcribe method to transcribe text from speech.

New Routes

Adds 2 new routes

  • /voice/tts - takes a "text" query parameter and returns a wav file of the text converted to speech
  • /voice/stt - takes an audio file as the "audio" field of a multipart form and returns on object containing the text extracted from the file

New Settings

  • SPEECH_TO_TEXT_URI: The URI to connect to the speech to text service. Accepts tcp:// unix:// or stdio://.
  • SPEECH_TO_TEXT_MODEL: The name of the speech to text model to run
  • SPEECH_TO_TEXT_LANGUAGE: The language that speech is expected to be in
  • TEXT_TO_SPEECH_URI: The URI to connect to the text to speech service. Accepts tcp:// unix:// or stdio://.
  • TEXT_TO_SPEECH_VOICE: The voice to use when synthesizing speech

Which issue(s) this PR fixes:

Partially implements: discussion 4636, specifically tts and stt

Special notes for your reviewer:

I also have a pull request I'm working on to implement an LLM recipe assistant API which this is meant to support. See the linked discussion for full details.

I haven't worked on any UI for this. I may be working with @miah120 on a chat interface for this in the future.

I also haven't made pull requests to mealie before, so please let me know if there's any improvements I can make to help fit with the existing code.

In order to actually use these APIs, you need a wyoming tts and stt service running and to pass the appropriate environment variables to mealie to connect to them.

Testing

Manual testing

  • Tested /api/voice/stt using postman and passing a wav file. I used the wyoming-faster-whisper STT service
  • Tested /api/voice/tts using postman and passing some basic text. I used the wyoming-piper TTS service
  • Tested attempting to run both APIs when the relevant service was not running to ensure an appropriate error response from the API.

@michael-genson
Copy link
Collaborator

Hey there! This is pretty cool, I like the use of Wyoming, it seems to be the self-host standard (I've been toying with it in Home Assistant lately).

I think we'd want to see how this actually gets used in Mealie before merging (e.g. with the chat interface). Right now I'm hesitant to merge this since it doesn't actually enable anything for most users. I can definitely see the value in it, however we want to avoid merging a feature before it's ready and have to make significant changes later down the line, and/or remove it if it doesn't make sense. Since you have this in draft you may feel the same, but worth calling out.

Along those lines, the only broad feedback I have with the implementation is that it's a general-use endpoint, rather than a purposeful one. e.g. if the goal is to have a chatbot, I'd rather see APIs that handle the chatbot interaction which is called by the frontend, rather than a complex chat bot on the frontend calling basic APIs on the backend). This is similar to the current OpenAI integration: we don't have a "ask OpenAI" endpoint, we have endpoints for creating a recipe, parsing ingredients, etc. Internally we have one service which does all the work, but the interface is specific to the usecase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants