Pipeline API update #1172

Shulyaka · 2024-12-11T17:21:45Z

Shulyaka
Dec 11, 2024

The problem

The current pipeline API is limited to only text-based conversation agents, however there are voice-to-voice and multi-modal conversation agents that don't fit our current model. I suggest an approach that would be more extensible.

Status

This is a first draft, a request for comments. What should I change? Separate event type for update of each message field? Wyoming-based? Please share your comments.

Core concepts

The Pipeline events are used but updated
The API is extensible (allowing multiple conversation agents, various automations on conversation events, handover the conversation from one device to another and other things we might want to support in the future)
Both audio and text is streamed (made available in chunks as soon as it is produced)
Multi-modality support (voice, text, etc)
The basic entities are Message, Conversation and Pipeline. A Conversation is an ordered collection of Messages, and a Pipeline is a collection of configuration parameters for Conversations and helper functions (history management, markdown stripping, audio conversion, LLM tools calling)
A user device (i.e. frontend or satellite), a conversation agent (i.e. google_generative_ai_conversation), tts, stt, and vad entities, and the Pipeline, listen to events from a respective Pipeline and take actions if necessarry.
LLM API configuration becomes part of the pipeline
The initiator of the conversation becomes the owner of the conversation_id.
The Pipeline becomes part of conversation, or conversation would depend on assist_pipeline, or merged into single integration.

Events

conversation_start

This event initiates a new conversation.

Data:

variable	description
`pipeline_id`	The pipeline identifier, mandatory
`conversation_id`	The conversation identifier. Mandatory. Could be random UUID or a chat ID in an external system
`source`	Must be CONVERSATION_SOURCE_USER. Reserved for future use.
`input_modalities`	The bitmap showing the capabilities of input. Any combination of CONVERSATION_MODALITY_TEXT, CONVERSATION_MODALITY_VOICE. Optional, defaults to CONVERSATION_MODALITY_TEXT
`output_modalities`	The bitmap showing the capabilities of output. Any combination of CONVERSATION_MODALITY_TEXT, CONVERSATION_MODALITY_VOICE. Only include modalities that are wanted by user in this conversation, for example if the output device is capable of producing audio but the user switched it off, do not include the voice modality. Optional, defaults to CONVERSATION_MODALITY_TEXT.
`audio_formats`	If the device supports CONVERSATION_MODALITY_VOICE, this must be the array of supported output audio formats. See below for the audio formats. If this variable is not provided, it means that any format is accepted or not applicable

If an old conversation with the same conversation_id and pipeline_id is found, it is considered as replaced and all the history should be discarded.

conversation_end

This event marks the end of the conversation. It can be sent by any party. There should be no new messages for the conversation after this event, but updates are allowed.

Data:

variable	description
`pipeline_id`	The pipeline identifier, mandatory. Must be one of the configured pipelines.
`conversation_id`	The conversation identifier. Mandatory. Must match the `conversation_id` of the conversation start event.
`source`	Who initiated the event. Can be any of CONVERSATION_SOURCE_USER, CONVERSATION_SOURCE_AGENT, CONVERSATION_SOURCE_SYSTEM. Mandatory.
`reason`	Why conversation was ended. Can be any of CONVERSATION_REASON_USER_REQUEST (explicit request from user, such as closing of the dialog or saying specific words), CONVERSATION_REASON_TIMEOUT (no input detected for specific period of time), CONVERSATION_REASON_ERROR (any internal error other than input timeout), CONVERSATION_REASON_AUTO (any other logic to end the conversation, such as detection of an unrelated conversation). Mandatory.
`error_text`	The description of the error, if any. Optional.

message_add

This event adds a new message to the conversation or notification that a new message is being recorded. The message can be added without any text or voice data, or with only partial text or voice data, and be updated later with message_update event. This approach allows to start processing or displaying the message earlier and minimize latency.

Data:

variable	description
`pipeline_id`	The pipeline identifier, mandatory
`conversation_id`	The conversation identifier. Mandatory.
`message_id`	The message identifier, must be unique within conversation.
`parent_id`	Message identifier of the previous message. Must be present except for the very first message.
`source`	Who initiated the event. Can be any of CONVERSATION_SOURCE_USER, CONVERSATION_SOURCE_AGENT, CONVERSATION_SOURCE_SYSTEM. Mandatory.
`agent_id`	Reserved for future use.
`text`	The text of the message, if available.
`text_status`	Status of the text. One of CONVERSATION_MESSAGE_STATUS_NOTAVAILABLE, CONVERSATION_MESSAGE_STATUS_INCOMPLETE, CONVERSATION_MESSAGE_STATUS_COMPLETE. Mandatory.
`voice`	The audio data, if available.
`voice_status`	Status of the audio data. One of CONVERSATION_MESSAGE_STATUS_NOTAVAILABLE, CONVERSATION_MESSAGE_STATUS_INCOMPLETE, CONVERSATION_MESSAGE_STATUS_COMPLETE. Mandatory.
`audio_format`	Format of the voice data. See below the format details. Must be present whenever `voice` is present.
`voice_start`	Timestamp of the audio data start, ms since the start of the conversation. Mandatory.
`voice_end`	Timestamp of the audio data end, ms since the start of the conversation. Must be present when `voice_status` is CONVERSATION_MESSAGE_STATUS_COMPLETE. Must be greater than `voice_start`.
`image`	Reserved for future use.
`file`	Reserved for future use.
`tool_call`	Used by the conversation agent to indicate that it wants to call a tool. The value is the name of the tool. Only one call per message is supported, but multiple messages could be added.
`tool_args`	Optional arguments for the tool call

message_update

Updates the message.

Data:

variable	description
`pipeline_id`	The pipeline identifier, mandatory
`conversation_id`	The conversation identifier. Mandatory.
`message_id`	The message identifier, must match the ID of the corresponding message.
`text`	Next chunk of the messagetext, if available.
`text_status`	Status of the text. One of CONVERSATION_MESSAGE_STATUS_INCOMPLETE, CONVERSATION_MESSAGE_STATUS_COMPLETE. Should only be present if updated.
`voice`	Bytes of the next chunk of audio data, if available.
`voice_status`	Status of the audio data. One of CONVERSATION_MESSAGE_STATUS_INCOMPLETE, CONVERSATION_MESSAGE_STATUS_COMPLETE. Only present if updated.
`voice_start`	Timestamp of the audio data start, ms since the start of the conversation. Only present if updated. Must be greater than previous value. The difference is considered as discarded and should not be used.
`voice_end`	Timestamp of the audio data end, ms since the start of the conversation. Only present if `voice_status` is CONVERSATION_MESSAGE_STATUS_COMPLETE. Must be less than previous value if it was present and greater than `voice_start`. The difference is considered as discarded and should not be used.
`tool_result`	The result of the tool call. Used by the pipeline once they are available.

Audio formats

An audio format is determined by codec, sample rate, sample format, and the number of channels.

variable	description
`codec`	For now, only `pcm` is supported
`sample_rate`	Any of `homeassistant.components.stt.AudioSampleRates`
`sample_format`	We will start with `u8`, `s16le`, `s24le`, `s32le` for backwards compatibility
`channels`	Either 1 or 2

Speech to Text

When a new message is added, it may contain only text, only audio, or both. If the user message is audio-only, but the conversation agent only accepts text, or when the conversation agent is audio-only, but the output_modalities for the current conversation contains CONVERSATION_MODALITY_TEXT, and the STT is configured for the pipeline, a STT entity will update the message with the text by sending message_update events. The STT entity should send the produced text in chunks, as soon as it is available, if supported.

Text to Speech

Likewise, if the user message is text-only, but the conversation agent only accepts voice, or if the conversation agent added a voice-only message for the user, but the output_modalities for the current conversation contains CONVERSATION_MODALITY_VOICE, then a TTS entity will update the message with the voice by sending message_update events. The TTS entity should send the produced text in chunks, as soon as it is available, if supported, with one exception: if the TTS entity produces the result slower than realtime, then it should send the result at once.

Voice activity detection

If a VAD engine is configured for the pipeline, it listens for the message_add and message_update events, and for each message from user with audio it sends a message_update event with voice_start, voice_end, and voice_status. Except if it is the first message in the conversation, and wake_word is configured for the pipeline, then don't update the voice_start.

Wake word detection

If a wake_word engine is configured for the pipeline, it listens for the message_add and message_update events, and for the first message in the conversation, if it is from the user and contains audio data, it searches for the wake word. Once it is detected, it updates the voice_start with a message_update event.

Conversation agent

A conversation agent listen to the message_add and message_update events, and once the modality it expects get the completed status, it processes the message and adds a response.

Markdown support

LLMs are usually trained on markdown-formatted text and they occasionally produce it even when explicitly asked not to.
We will assume that the text may contain markdown formatting and will provide helper functions to remove it for devices and TTSs that don't support it.

History management

The Pipeline will save the messages in a centralized storage. There will be conditions for which the storage is guaranteed, such as limit on the number of messages per conversation, size of audio data, and time since conversation is started. A conversation agent is encouraged to use it instead of keeping its own history.

LLM Tools workflow

When a conversation agent wants to call an LLM tool, it adds a new message to the conversation with tool_call and tool_args parameters.

Backwards compatibility

The conversation.process action will be updated to add a message to the conversation and return the first response message. A pipeline will be auto created to handle the conversation.
ConversationEntity will be updated so that if the conversation agent do not have the updated API, a default implementation will be provided that calls the internal_async_process

Future enhancements

Allow automations to start conversations on behalf of devices.
Add more modalities, such as CONVERSATION_MODALITY_IMAGE, CONVERSATION_MODALITY_VIDEO, CONVERSATION_MODALITY_FILE
Multi-agent support: allow an agent to update the message for handling by another agent in the same pipeline. For example, a built-in NLP conversation can handover to LLM when no match is found, or a fast-and-cheap smaller LLM can handover the task to a smarter one, bu updating the message.

tetele · 2024-12-12T09:38:49Z

tetele
Dec 12, 2024

First of all, I'd start with a problem statement or something that indicates a clear goal. As it is, it looks like you're adding complexity for the sake of adding complexity.

I feel like a better definition of the concepts involved is required. Each needs to state what properties and methods are used and what the actual informational architecture should be. Who/what emits the events?

And as a very specific question: why is there only a single tool call permitted per message?

Also, you haven't finished the conversation agent description.

1 reply

Shulyaka Dec 12, 2024
Author

Thank you for the comments!

First of all, I'd start with a problem statement or something that indicates a clear goal. As it is, it looks like you're adding complexity for the sake of adding complexity.

Good point. I have added that. I believe some of the things will actually become less complex.

I feel like a better definition of the concepts involved is required. Each needs to state what properties and methods are used and what the actual informational architecture should be. Who/what emits the events?

I will add an example of a conversation, hope it will make it clear.

And as a very specific question: why is there only a single tool call permitted per message?

This is to allow asynchronous response to the tool call once the results for each tool is available. While only one tool call per message is permitted, multiple messages may be used to perform parallel tool calls.

Also, you haven't finished the conversation agent description.

Sorry! Done.

balloob · 2024-12-13T13:43:46Z

balloob
Dec 13, 2024
Maintainer

My main concern with this approach is that it's message based, but a video/audio stream can only be message based if a) you have tiny messages with every 10ms of audio or b) you buffer long time and send in chunks. Neither are desirable I believe.

That's why I think that at the very core of our pipeline API, we should just have audio in/out. I think that we can still use a similar structure as described to connect multiple AIs together. Although I am not sure if that is able to be done per "interaction" or that it would be better split out as an effort for multi-agent pipeline with shared history support.

0 replies

ms1design · 2024-12-27T08:15:10Z

ms1design
Dec 27, 2024

Hi everyone,

It’s really great that this discussion stared. The HA community is striving to get better offline & private voice assistants :)

@balloob hits the score here, totally agree that multimodality pipelines should be based not only on Messages (text, image, audio, etc) - but we can split all that data into Frames/Chunks, which can be used for all modalities including text to improve the conversation ux and future proof for planned features.

Do you guys also plan any of interrupt features - when user wants to stops the current generation or changes context?

When I read about streaming - does that mean that the TTS generated responses would be streamed to all client devices, including Companion App?

Do you plan to support the continuous conversation mode, where wake word only launches the Convo Mode and allows for more natural experience with followups and interruptions?

Would there be an option to use externally hosted STT/TTS systems over any protocol? Wyoming?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline API update #1172

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pipeline API update #1172

Shulyaka Dec 11, 2024

The problem

Status

Core concepts

Events

Audio formats

Speech to Text

Text to Speech

Voice activity detection

Wake word detection

Conversation agent

Markdown support

History management

LLM Tools workflow

Backwards compatibility

Future enhancements

Replies: 3 comments · 1 reply

tetele Dec 12, 2024

Shulyaka Dec 12, 2024 Author

balloob Dec 13, 2024 Maintainer

ms1design Dec 27, 2024

Shulyaka
Dec 11, 2024

Replies: 3 comments 1 reply

tetele
Dec 12, 2024

Shulyaka Dec 12, 2024
Author

balloob
Dec 13, 2024
Maintainer

ms1design
Dec 27, 2024