Pipeline API update #1172
Replies: 3 comments 1 reply
-
First of all, I'd start with a problem statement or something that indicates a clear goal. As it is, it looks like you're adding complexity for the sake of adding complexity. I feel like a better definition of the concepts involved is required. Each needs to state what properties and methods are used and what the actual informational architecture should be. Who/what emits the events? And as a very specific question: why is there only a single tool call permitted per message? Also, you haven't finished the conversation agent description. |
Beta Was this translation helpful? Give feedback.
-
My main concern with this approach is that it's message based, but a video/audio stream can only be message based if a) you have tiny messages with every 10ms of audio or b) you buffer long time and send in chunks. Neither are desirable I believe. That's why I think that at the very core of our pipeline API, we should just have audio in/out. I think that we can still use a similar structure as described to connect multiple AIs together. Although I am not sure if that is able to be done per "interaction" or that it would be better split out as an effort for multi-agent pipeline with shared history support. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, It’s really great that this discussion stared. The HA community is striving to get better offline & private voice assistants :) @balloob hits the score here, totally agree that multimodality pipelines should be based not only on Messages (text, image, audio, etc) - but we can split all that data into Frames/Chunks, which can be used for all modalities including text to improve the conversation ux and future proof for planned features. Do you guys also plan any of interrupt features - when user wants to stops the current generation or changes context? When I read about streaming - does that mean that the TTS generated responses would be streamed to all client devices, including Companion App? Do you plan to support the continuous conversation mode, where wake word only launches the Convo Mode and allows for more natural experience with followups and interruptions? Would there be an option to use externally hosted STT/TTS systems over any protocol? Wyoming? |
Beta Was this translation helpful? Give feedback.
-
The problem
The current pipeline API is limited to only text-based conversation agents, however there are voice-to-voice and multi-modal conversation agents that don't fit our current model. I suggest an approach that would be more extensible.
Status
This is a first draft, a request for comments. What should I change? Separate event type for update of each message field? Wyoming-based? Please share your comments.
Core concepts
google_generative_ai_conversation
), tts, stt, and vad entities, and the Pipeline, listen to events from a respective Pipeline and take actions if necessarry.conversation_id
.conversation
, orconversation
would depend onassist_pipeline
, or merged into single integration.Events
conversation_start
This event initiates a new conversation.
Data:
pipeline_id
conversation_id
source
input_modalities
output_modalities
audio_formats
If an old conversation with the same
conversation_id
andpipeline_id
is found, it is considered as replaced and all the history should be discarded.conversation_end
This event marks the end of the conversation. It can be sent by any party. There should be no new messages for the conversation after this event, but updates are allowed.
Data:
pipeline_id
conversation_id
conversation_id
of the conversation start event.source
reason
error_text
message_add
This event adds a new message to the conversation or notification that a new message is being recorded. The message can be added without any text or voice data, or with only partial text or voice data, and be updated later with
message_update
event. This approach allows to start processing or displaying the message earlier and minimize latency.Data:
pipeline_id
conversation_id
message_id
parent_id
source
agent_id
text
text_status
voice
voice_status
audio_format
voice
is present.voice_start
voice_end
voice_status
is CONVERSATION_MESSAGE_STATUS_COMPLETE. Must be greater thanvoice_start
.image
file
tool_call
tool_args
message_update
Updates the message.
Data:
pipeline_id
conversation_id
message_id
text
text_status
voice
voice_status
voice_start
voice_end
voice_status
is CONVERSATION_MESSAGE_STATUS_COMPLETE. Must be less than previous value if it was present and greater thanvoice_start
. The difference is considered as discarded and should not be used.tool_result
Audio formats
An audio format is determined by codec, sample rate, sample format, and the number of channels.
codec
pcm
is supportedsample_rate
homeassistant.components.stt.AudioSampleRates
sample_format
u8
,s16le
,s24le
,s32le
for backwards compatibilitychannels
Speech to Text
When a new message is added, it may contain only text, only audio, or both. If the user message is audio-only, but the conversation agent only accepts text, or when the conversation agent is audio-only, but the
output_modalities
for the current conversation contains CONVERSATION_MODALITY_TEXT, and the STT is configured for the pipeline, a STT entity will update the message with the text by sendingmessage_update
events. The STT entity should send the produced text in chunks, as soon as it is available, if supported.Text to Speech
Likewise, if the user message is text-only, but the conversation agent only accepts voice, or if the conversation agent added a voice-only message for the user, but the
output_modalities
for the current conversation contains CONVERSATION_MODALITY_VOICE, then a TTS entity will update the message with the voice by sendingmessage_update
events. The TTS entity should send the produced text in chunks, as soon as it is available, if supported, with one exception: if the TTS entity produces the result slower than realtime, then it should send the result at once.Voice activity detection
If a VAD engine is configured for the pipeline, it listens for the
message_add
andmessage_update
events, and for each message from user with audio it sends amessage_update
event withvoice_start
,voice_end
, andvoice_status
. Except if it is the first message in the conversation, andwake_word
is configured for the pipeline, then don't update thevoice_start
.Wake word detection
If a
wake_word
engine is configured for the pipeline, it listens for themessage_add
andmessage_update
events, and for the first message in the conversation, if it is from the user and contains audio data, it searches for the wake word. Once it is detected, it updates thevoice_start
with amessage_update
event.Conversation agent
A conversation agent listen to the
message_add
andmessage_update
events, and once the modality it expects get the completed status, it processes the message and adds a response.Markdown support
LLMs are usually trained on markdown-formatted text and they occasionally produce it even when explicitly asked not to.
We will assume that the text may contain markdown formatting and will provide helper functions to remove it for devices and TTSs that don't support it.
History management
The Pipeline will save the messages in a centralized storage. There will be conditions for which the storage is guaranteed, such as limit on the number of messages per conversation, size of audio data, and time since conversation is started. A conversation agent is encouraged to use it instead of keeping its own history.
LLM Tools workflow
When a conversation agent wants to call an LLM tool, it adds a new message to the conversation with
tool_call
andtool_args
parameters.Backwards compatibility
conversation.process
action will be updated to add a message to the conversation and return the first response message. A pipeline will be auto created to handle the conversation.ConversationEntity
will be updated so that if the conversation agent do not have the updated API, a default implementation will be provided that calls theinternal_async_process
Future enhancements
Beta Was this translation helpful? Give feedback.
All reactions