This project is based on server side rendered node
application that serves multiple versions of a voice-to-voice functionality and acts as a server that responds to the clients through an audio format. The responses is generated using openAI chat completion using response streaming. The following shows the breakdown of the directory structure for the repository
.
├── files # (Not for official use) collection of server rendered file formats
├── public # Server rendered page is served through this directory
├── templates
├── file
├── index.html
├── sockets
├── index.html
├── default
├── index.html
├── uploads # Used to store the user voice files
├── scripts
├── test # (For testing purpose only) scripts to test third party libraries
├── package.json
├── servers
├── file-based-response-server.js # 1st variation of server using file written audio responds
├── sockets-server.js # 2nd variation of the server uses sockets for audio communication
└── README.md
The servers
folder comprises of multiple variations of approaches towards a voice-to-voice bot, all these apporaches and their the average time recorded for several conversations have also been recorded to create a comparison. This repository uses OpenAI tools for transcribing, response generation and text to speech synthesis. On the other hand it uses RecordRTC utility library to record audio in a complete file format and audio chunks. The first iteration of the voice to voice communication is depicted within the diagram below
The numbered circles show the four stages of processing required to achieve the results and going through each of the process sequentially can cause alot of latency issues approx (~ 15-20 seconds avg). To adopt parallelism within these four processes, multiple experiments were conducted in terms of approaches which are listed below. Follow the steps below to run the application through these approaches
-
Create a .env file comprising of the folling details
OPENAI_API_KEY = "XXXXXXXXXXXXXX"
-
Install Dependencies
npm install
-
Run the following commands
-
To run File based response server
npm run start:file
-
To run Sockets based response server
npm run start:socket
-
To run Event based response server
npm run start:sse
-
Run the server on
http://localhost:8000
-
The OpenAI API can create an audio buffer and also store the audio buffer into a file. The idea of this approach was to transfer the user's audio chunks in form data or a file. The transcribing step processes the file and converts it into a text and generates a streamed response through OpenAI. The chunk of stream are gathered in form of a sentence and sent to the speak
method to be spoken out, the audio generated is saved in an audio file on the server which is played on the client as the audio data continue to stream in. The following diagram depicts the workflow
- No audio buffer management on client side or server side.
- The audio plays seemlessly as it's through a file on a server.
- Relatively Faster
- I/O overhead at scale
- The infrastucture would require encrypted file system to ensure security and communication.
- Archiving process would be required to manage the files.
This approach uses full duplex communication between the client and server through sockets. The recording chunks are collected on the server by 3 seconds interval instead of sending a complete recording file to the server. As the chunks are received, they are transcibed and consolidated on the server and once the last chunk is received, they are immediately sent for response generation, this saves time in transcribing a large sentence. The generated response is streamed just like within the previous section and provided to the speaking utility which generates the audio response buffers which are again sent through the server. The following diagram depicts the workflow
- Audio buffer management required on client and server side
- Saves time on transcribing large audio files
- Full duplex communication between server and client, less API calling overhead
- Response not saved on files
- Fast
- Uses sockets, which can cause scaling problems
- audio buffer management on client side can get tricky
The voice bot is an audio streaming solution, hence using server sent events instead of sockets to transfer the audio buffers emerges as a scalable solution. The audio data is received in form of a formdata at the server, which is transcribed and the chunks of transcribed text is brought together to generate audio buffers which are transferred to the clients through server sent events. The workflow of this solution is depicted in the diagram below
- Audio buffer management required on client side
- CSolution can scale with the underlying instances
- Response not saved on files
- audio buffer management on client side
- audio buffer sequence is often compromised
All these formats have been tested over 50 conversations and their times are recorded to get a rough estimation on elapsed time on variable length user queries. The following table shows the readings
Server Formats | Average Time (seconds) |
---|---|
File based | 5.9 |
Sockets based | 4.4 |
Event based | 6.4 |
The following shows the screenshot of the server rendered web page in action