From 8a4aed175480f188de00e0674a7f1a5b7ef75684 Mon Sep 17 00:00:00 2001 From: David Zhao Date: Sat, 3 Aug 2024 12:23:50 -0700 Subject: [PATCH] 0.8 migration guide (#572) --- 0.8-migration-guide.md | 186 +++++++++++++++++++++++++++++++++++++++++ README.md | 13 ++- 2 files changed, 197 insertions(+), 2 deletions(-) create mode 100644 0.8-migration-guide.md diff --git a/0.8-migration-guide.md b/0.8-migration-guide.md new file mode 100644 index 000000000..f00b70a06 --- /dev/null +++ b/0.8-migration-guide.md @@ -0,0 +1,186 @@ +# Migrating to 0.8.x + +v0.8 is a major release of the framework, featuring significant reliability improvements to VoiceAssistant. This update includes a few breaking API changes that will impact the way you build your agents. We strive to minimize breaking changes and will stabilize the API as we approach version 1.0. + +## Job and Worker API + +### Specifying your entrypoint function + +`entrypoint_fnc` is now a parameter in WorkerOptions. Previously, you were required to explicitly accept the job. + +### Namespace had been removed + +We've removed the namespace option in order to simplify the registration process. In future versions, it'll be possible to provide an explicit `agent_name` to spin up multiple kinds of agents for each room. + +### Connecting to room is explicit + +You now need to call ctx.connect() to initiate the connection to the room. This allows for pre-connect setup (such as callback registrations) to avoid race conditions. + +### Example + +The above changes are reflected in the following minimal example: + +```python +from livekit.agents import JobContext, JobRequest, WorkerOptions, cli + +async def job_entrypoint(ctx: JobContext): + await ctx.connect() + # your logic here + ... + +if __name__ == "__main__": + cli.run_app( + WorkerOptions(entrypoint_fnc=job_entrypoint) + ) +``` + +## VoiceAssistant + +VoiceAssistant API remains mostly unchanged, despite significant improvements to functionality and internals. However, there have been changes to the configuration. + +### Initialization args + +- Removed + - base_volume + - debug + - sentence_tokenizer, word_tokenizer, hyphenate_word +- Changed + - transcription related options are grouped within `transcription` param + +```python +class VoiceAssistant(utils.EventEmitter[EventTypes]): + def __init__( + self, + *, + vad: vad.VAD, + stt: stt.STT, + llm: LLM, + tts: tts.TTS, + chat_ctx: ChatContext | None = None, + fnc_ctx: FunctionContext | None = None, + allow_interruptions: bool = True, + interrupt_speech_duration: float = 0.6, + interrupt_min_words: int = 0, + preemptive_synthesis: bool = True, + transcription: AssistantTranscriptionOptions = AssistantTranscriptionOptions(), + will_synthesize_assistant_reply: WillSynthesizeAssistantReply = _default_will_synthesize_assistant_reply, + plotting: bool = False, + loop: asyncio.AbstractEventLoop | None = None, + ) -> None: + ... +``` + +## LLM + +The LLM class has been restructured to enhance ergonomics and improve the function calling support. + +### Function/tool calling + +Function calling has gotten a complete overhaul in v0.8.0. The primary breaking change is that function calls are now NOT automatically invoked when iterating the LLM stream. `LLMStream.execute_functions` needs to be called instead. (VoiceAssistant handles this automatically) + +### LLM.chat is no longer an async method + +Previously, LLM.chat() was an async method that returned an LLMStream (which itself was an AsyncIterable). + +We found it easier and less-confusing for LLM.chat() to be synchronous, while still returning the same AsyncIterable LLMStream. + +### LLM.chat `history` has been renamed to `chat_ctx` + +In order to improve consistency and reduce confusion. + +```python +chat_ctx = llm.ChatContext() +chat_ctx.append(role="user", text="user message") +stream = llm_plugin.chat(chat_ctx=chat_ctx) +``` + +## STT + +### SpeechStream.flush + +Previously, to communicate to a STT provider that you have sent enough input to generate a response - you could push_frame(None) to coax the TTS into synthesizing a response. + +In v0.8.0 that API has been removed and replaced with flush() + +### SpeechStream.end_input + +`end_input` signals to the STT provider that the input is complete and no additional input will follow. Previously, this was done using aclose(wait=True). + +### SpeechStream.aclose + +The `wait` arg of aclose has been removed in favor of SpeechStream.end_input (see above). Now, if you call `TTS.aclose()` without first calling STT.end_input, the behavior will be that the request is cancelled. + +```python +stt_stream = my_stt_instance.stream() +async for ev in audio_stream: + stt_stream.push_frame(ev.frame) + # optionally flush when enough frames have been pushed + stt_stream.flush() + +stt_stream.end_input() +await stt_stream.aclose() +``` + +## TTS + +### SynthesizedAudio changed and SynthesisEvent removed + +SynthesizedAudio dataclass had gone through a major change + +```python +# New SynthesizedAudio dataclass +@dataclass +class SynthesizedAudio: + request_id: str + """Request ID (one segment could be made up of multiple requests)""" + segment_id: str + """Segment ID, each segment is separated by a flush""" + frame: rtc.AudioFrame + """Synthesized audio frame""" + delta_text: str = "" + """Current segment of the synthesized audio""" + +#Old SynthesizedAudio dataclass +@dataclass +class SynthesizedAudio: + text: str + data: rtc.AudioFrame +``` + +The SynthesisEvent has been removed entirely. All occurrences of it have been replaced with SynthesizedAudio + +### SynthesizeStream.flush + +Similar to the STT changes, this coaxes the TTS provider into generating a response. The SynthesizedAudio response will have a new segment_id after calls to flush(). + +### SynthesizeStream.end_input + +Similar to the STT changes, this replaces aclose(wait=True). + +### SynthesizeStream.aclose + +Similar to the STT changes, the wait arg has been removed. + +```python +tts_stream = my_tts_instance.stream() +tts_stream.push_text("This is the first sentence") +tts_stream.flush() +tts_stream.push_text("This is the second sentence") +tts_stream.end_input() +await tts_stream.aclose() +``` + +## VAD + +The same changes made to STT and TTS have also been made to VAD + +```python +vad_stream = my_vad_instance.stream() +async for ev in audio_stream: + vad_stream.push_frame(ev.frame) + # optionally flush when enough frames have been pushed + vad_stream.flush() + +vad_stream.end_input() +await vad_stream.aclose() +``` diff --git a/README.md b/README.md index a497f63c4..dd08f9d8d 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ audio, video, and data streams. The framework includes plugins for common workflows, such as voice activity detection and speech-to-text. -Agents integrates seamlessly with [LiveKit server](https://github.com/livekit/livekit), offloading job queuing and scheduling responsibilities to it. This eliminates the need for additional queuing infrastructure. Agent code developed on your local machine can scale to support thousands of concurrent sessions when deployed to a server in production. +Agents integrates seamlessly with Cloud or self-hosted [LiveKit](https://livekit.io/) server, offloading job queuing and scheduling responsibilities to it. This eliminates the need for additional queuing infrastructure. Agent code developed on your local machine can scale to support thousands of concurrent sessions when deployed to a server in production. > This SDK is currently in Developer Preview. During this period, you may encounter bugs and the APIs may change. > @@ -33,6 +33,9 @@ Agents integrates seamlessly with [LiveKit server](https://github.com/livekit/li - [Working with plugins](https://docs.livekit.io/agents/plugins) - [Deploying agents](https://docs.livekit.io/agents/deployment) +> [!NOTE] +> There are breaking API changes between versions 0.7.x and 0.8.x. Please refer to the [0.8 migration guide](0.8-migration-guide.md) for a detailed overview of the changes. + ## Examples - [Voice assistant](https://github.com/livekit/agents/tree/main/examples/voice-assistant): A voice assistant with STT, LLM, and TTS. [Demo](https://kitt.livekit.io) @@ -81,7 +84,7 @@ The framework exposes a CLI interface to run your agent. To get started, you'll - LIVEKIT_API_KEY - LIVEKIT_API_SECRET -### Running the worker +### Starting the worker This will start the worker and wait for users to connect to your LiveKit server: @@ -89,6 +92,12 @@ This will start the worker and wait for users to connect to your LiveKit server: python my_agent.py start ``` +To run the worker in dev-mode (with hot code reloading), you can use the dev command: + +```bash +python my_agent.py dev +``` + ### Using playground for your agent UI To ease the process of building and testing an agent, we've developed a versatile web frontend called "playground". You can use or modify this app to suit your specific requirements. It can also serve as a starting point for a completely custom agent application.