feat: Add voice APIs to communicate with stt and tts through the wyoming protocol #4637

rosebeats · 2024-12-01T04:45:46Z

What type of PR is this?

feature

What this PR does / why we need it:

This pull request adds 2 new API endpoints:

/voice/tts - for converting text to speech
/voice/stt - for converting speech to text

Both APIs call out to a configured server using the wyoming protocol. It is not dependent on a specific speech to text or text to speech model.

I'm adding this API in order to support the use of a chat bot within Mealie. In particular, I'm aiming to support a "Cook Along" service to let users ask a chatbot questions about the recipe as they're cooking so they don't need to search around for what they're looking for. Speech to text and text to speech are important for implementing such a feature.

Changes in detail

New library

I'm adding the wyoming protocol library for communicating with tts and stt services

Dev container

I've added the docker run arguments: --add-host=host.docker.internal:host-gateway to the dev container so that mealie can communicate with tcp services on the host (specifically the wyoming protocol servers)

Services

Adds TTSService and STTService for providing text to speech and speech to text. Initialization sets up connection to wyoming protocol servers. TTSService has a synthesize method to synthesize speech. STTService has a transcribe method to transcribe text from speech.

New Routes

Adds 2 new routes

/voice/tts - takes a "text" query parameter and returns a wav file of the text converted to speech
/voice/stt - takes an audio file as the "audio" field of a multipart form and returns on object containing the text extracted from the file

New Settings

SPEECH_TO_TEXT_URI: The URI to connect to the speech to text service. Accepts tcp:// unix:// or stdio://.
SPEECH_TO_TEXT_MODEL: The name of the speech to text model to run
SPEECH_TO_TEXT_LANGUAGE: The language that speech is expected to be in
TEXT_TO_SPEECH_URI: The URI to connect to the text to speech service. Accepts tcp:// unix:// or stdio://.
TEXT_TO_SPEECH_VOICE: The voice to use when synthesizing speech

Which issue(s) this PR fixes:

Partially implements: discussion 4636, specifically tts and stt

Special notes for your reviewer:

I also have a pull request I'm working on to implement an LLM recipe assistant API which this is meant to support. See the linked discussion for full details.

I haven't worked on any UI for this. I may be working with @miah120 on a chat interface for this in the future.

I also haven't made pull requests to mealie before, so please let me know if there's any improvements I can make to help fit with the existing code.

In order to actually use these APIs, you need a wyoming tts and stt service running and to pass the appropriate environment variables to mealie to connect to them.

Testing

Manual testing

Tested /api/voice/stt using postman and passing a wav file. I used the wyoming-faster-whisper STT service
Tested /api/voice/tts using postman and passing some basic text. I used the wyoming-piper TTS service
Tested attempting to run both APIs when the relevant service was not running to ensure an appropriate error response from the API.

michael-genson · 2024-12-01T05:08:41Z

Hey there! This is pretty cool, I like the use of Wyoming, it seems to be the self-host standard (I've been toying with it in Home Assistant lately).

I think we'd want to see how this actually gets used in Mealie before merging (e.g. with the chat interface). Right now I'm hesitant to merge this since it doesn't actually enable anything for most users. I can definitely see the value in it, however we want to avoid merging a feature before it's ready and have to make significant changes later down the line, and/or remove it if it doesn't make sense. Since you have this in draft you may feel the same, but worth calling out.

Along those lines, the only broad feedback I have with the implementation is that it's a general-use endpoint, rather than a purposeful one. e.g. if the goal is to have a chatbot, I'd rather see APIs that handle the chatbot interaction which is called by the frontend, rather than a complex chat bot on the frontend calling basic APIs on the backend). This is similar to the current OpenAI integration: we don't have a "ask OpenAI" endpoint, we have endpoints for creating a recipe, parsing ingredients, etc. Internally we have one service which does all the work, but the interface is specific to the usecase.

github-actions bot added the feature label Dec 1, 2024

rosebeats force-pushed the voice branch from 1dd233c to 1b7db8f Compare December 1, 2024 04:52

rosebeats added 4 commits December 1, 2024 11:02

Added speech to text api

6be7612

Added text to speech API using wyoming protocol

e7c4dcb

Add error handling for failed connection

c53aba0

Write documentation

36315b7

rosebeats force-pushed the voice branch from 1b7db8f to 36315b7 Compare December 1, 2024 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add voice APIs to communicate with stt and tts through the wyoming protocol #4637

feat: Add voice APIs to communicate with stt and tts through the wyoming protocol #4637

rosebeats commented Dec 1, 2024 •

edited

Loading

michael-genson commented Dec 1, 2024

feat: Add voice APIs to communicate with stt and tts through the wyoming protocol #4637

Are you sure you want to change the base?

feat: Add voice APIs to communicate with stt and tts through the wyoming protocol #4637

Conversation

rosebeats commented Dec 1, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Changes in detail

New library

Dev container

Services

New Routes

New Settings

Which issue(s) this PR fixes:

Special notes for your reviewer:

Testing

Manual testing

michael-genson commented Dec 1, 2024

rosebeats commented Dec 1, 2024 •

edited

Loading