GitHub - Conrad-X/openai-voice-to-voice-bot: A server rendered chatbot client for responding to audio streams

About

This project is based on server side rendered node application that serves multiple versions of a voice-to-voice functionality and acts as a server that responds to the clients through an audio format. The responses is generated using openAI chat completion using response streaming. The following shows the breakdown of the directory structure for the repository

.
├── files                              # (Not for official use) collection of server rendered file formats
├── public                             # Server rendered page is served through this directory
    ├── templates
        ├── file 
            ├── index.html
        ├── sockets
            ├── index.html
        ├── default          
            ├── index.html
    ├── uploads                         # Used to store the user voice files
├── scripts                             
    ├── test                            # (For testing purpose only) scripts to test third party libraries
├── package.json         
├── servers 
    ├── file-based-response-server.js   # 1st variation of server using file written audio responds 
    ├── sockets-server.js               # 2nd variation of the server uses sockets for audio communication
└── README.md

Getting Started

The servers folder comprises of multiple variations of approaches towards a voice-to-voice bot, all these apporaches and their the average time recorded for several conversations have also been recorded to create a comparison. This repository uses OpenAI tools for transcribing, response generation and text to speech synthesis. On the other hand it uses RecordRTC utility library to record audio in a complete file format and audio chunks. The first iteration of the voice to voice communication is depicted within the diagram below

The numbered circles show the four stages of processing required to achieve the results and going through each of the process sequentially can cause alot of latency issues approx (~ 15-20 seconds avg). To adopt parallelism within these four processes, multiple experiments were conducted in terms of approaches which are listed below. Follow the steps below to run the application through these approaches

Create a .env file comprising of the folling details
```
OPENAI_API_KEY = "XXXXXXXXXXXXXX"
```
Install Dependencies
```
npm install
```
Run the following commands
- To run File based response server
```
npm run start:file
```
- To run Sockets based response server
```
npm run start:socket
```
- To run Event based response server
```
npm run start:sse
```
- Run the server on http://localhost:8000

File Based Response Server

The OpenAI API can create an audio buffer and also store the audio buffer into a file. The idea of this approach was to transfer the user's audio chunks in form data or a file. The transcribing step processes the file and converts it into a text and generates a streamed response through OpenAI. The chunk of stream are gathered in form of a sentence and sent to the speak method to be spoken out, the audio generated is saved in an audio file on the server which is played on the client as the audio data continue to stream in. The following diagram depicts the workflow

Pros

No audio buffer management on client side or server side.
The audio plays seemlessly as it's through a file on a server.
Relatively Faster

Cons

I/O overhead at scale
The infrastucture would require encrypted file system to ensure security and communication.
Archiving process would be required to manage the files.

Sockets Based Server

This approach uses full duplex communication between the client and server through sockets. The recording chunks are collected on the server by 3 seconds interval instead of sending a complete recording file to the server. As the chunks are received, they are transcibed and consolidated on the server and once the last chunk is received, they are immediately sent for response generation, this saves time in transcribing a large sentence. The generated response is streamed just like within the previous section and provided to the speaking utility which generates the audio response buffers which are again sent through the server. The following diagram depicts the workflow

Pros

Audio buffer management required on client and server side
Saves time on transcribing large audio files
Full duplex communication between server and client, less API calling overhead
Response not saved on files
Fast

Cons

Uses sockets, which can cause scaling problems
audio buffer management on client side can get tricky

Server-Sent-Events Based Server

The voice bot is an audio streaming solution, hence using server sent events instead of sockets to transfer the audio buffers emerges as a scalable solution. The audio data is received in form of a formdata at the server, which is transcribed and the chunks of transcribed text is brought together to generate audio buffers which are transferred to the clients through server sent events. The workflow of this solution is depicted in the diagram below

Pros

Audio buffer management required on client side
CSolution can scale with the underlying instances
Response not saved on files

Cons

audio buffer management on client side
audio buffer sequence is often compromised

Experiment Readings

All these formats have been tested over 50 conversations and their times are recorded to get a rough estimation on elapsed time on variable length user queries. The following table shows the readings

Server Formats	Average Time (seconds)
File based	5.9
Sockets based	4.4
Event based	6.4

Color Palette

Link To Color Palette

Screenshots

The following shows the screenshot of the server rendered web page in action

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
files		files
public		public
scripts/test		scripts/test
servers		servers
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Getting Started

File Based Response Server

Pros

Cons

Sockets Based Server

Pros

Cons

Server-Sent-Events Based Server

Pros

Cons

Experiment Readings

Color Palette

Screenshots

About

Releases

Packages

Languages

Conrad-X/openai-voice-to-voice-bot

Folders and files

Latest commit

History

Repository files navigation

About

Getting Started

File Based Response Server

Pros

Cons

Sockets Based Server

Pros

Cons

Server-Sent-Events Based Server

Pros

Cons

Experiment Readings

Color Palette

Screenshots

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages