Skip to content

Commit

Permalink
chore: update documentation format and some key errors
Browse files Browse the repository at this point in the history
  • Loading branch information
MayorX500 committed Dec 20, 2024
1 parent 6ab8edd commit 61aa3ca
Show file tree
Hide file tree
Showing 22 changed files with 405 additions and 280 deletions.
80 changes: 56 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,21 @@
<img width="50" src="./docs/src/images/EEUM_logo_EN.jpg">
</div>

# PI24 - Real-Time Speech Synthesis
# PI24 - Real-Time Speech Synthesis

**Authors:**

- [Beatriz Monteiro](https://github.com/5ditto)
- [Daniel Du](https://github.com/ddu72)
- [Daniel Furtado](https://github.com/danielfurtado11)
- [Miguel Gomes](https://github.com/MayorX500)
- [Moisés Antunes](https://github.com/MoisesA14)
- [Telmo Maciel](https://github.com/telmomaciel9)

## A better version of the Documentation can be found [here](https://ddu72.github.io/PI24/)

## Description

This project aims to create a text-to-speech system that can be used in real-time. The system is divided into three main components: the client, the proxy, and the server. The client is responsible for sending the text to be synthesized to the proxy. The proxy is responsible for redirecting the requests to the available servers. The server is responsible for normalizing the text, synthesizing it and sending the audio back to the client.
The project was proposed by the company Agentifai and was being developed by a group of students from the University of Minho.

Expand All @@ -22,109 +27,136 @@ The project was proposed by the company Agentifai and was being developed by a g
The Text-to-Speech (TTS) system follows a modular pipeline to convert input text or Speech Synthesis Markup Language (SSML) into audio files or real-time streams. Below is a description of the key components and their roles in the process:

1. User Input or SSML File:
Users can provide natural text or structured SSML files. However, due to limitations in supporting SSML files, this functionality has not been implemented. Currently, the system only processes natural text inputs.

Users can provide natural text or structured SSML files. However, due to limitations in supporting SSML files, this functionality has not been implemented. Currently, the system only processes natural text inputs.

2. SSML Parser:
(Not implemented) This component was intended to extract relevant text data from SSML files and prepare it for further processing.

(Not implemented) This component was intended to extract relevant text data from SSML files and prepare it for further processing.

3. Normalizer:
The normalizer standardizes the input text (e.g., expanding abbreviations, handling numbers) to ensure it is ready for phonemization.

The normalizer standardizes the input text (e.g., expanding abbreviations, handling numbers) to ensure it is ready for phonemization.

4. Phonemizer:
The phonemizer converts normalized text into phonetic representations, enabling accurate pronunciation during synthesis. This component is also subject to training and evaluation cycles to improve accuracy and performance.

The phonemizer converts normalized text into phonetic representations, enabling accurate pronunciation during synthesis. This component is also subject to training and evaluation cycles to improve accuracy and performance.

5. TTS Model:
The text's phonetic representation is passed to the TTS model, which generates audio output. The model has been optimized for both file-based outputs and real-time streaming scenarios.

The text's phonetic representation is passed to the TTS model, which generates audio output. The model has been optimized for both file-based outputs and real-time streaming scenarios.

6. Audio Output:

The generated audio can be delivered as a downloadable file or streamed directly, depending on user requirements.

The system's modular architecture ensures flexibility, enabling enhancements to individual components without disrupting the entire workflow. While SSML support was initially planned, the current implementation focuses solely on natural text processing.
The generated audio can be delivered as a downloadable file or streamed directly, depending on user requirements.

The system's modular architecture ensures flexibility, enabling enhancements to individual components without disrupting the entire workflow. While SSML support was initially planned, the current implementation focuses solely on natural text processing.

## Components

### Client

The client is a simple TUI program that sends the text to be synthesized to the proxy. It is implemented in Python.

### Proxy

The proxy receives the text to be synthesized from the client and redirects it to the available servers. It is implemented in Python.

### Server

The server receives the text to be synthesized from the proxy, sends it to be normalized, receives the normalized text, synthesizes it and sends the audio back to the client. It is implemented in Python.

### Normalizer

The normalizer receives the text to be synthesized from the server, normalizes it and sends it back to the server. It is implemented in Python.

### API & Frontend

The API is responsible for receiving the text to be synthesized and returning the audio. The frontend is a simple web interface that allows the user to interact with the API.

## Architecture

The system was implemented using a microservices architecture. Each component is a separate service that communicates with the others using gRPCs. Each component is implemented in Python and is dockerized.

![Architecture](./docs/src/images/architecture.png)

The black arrows represent the flow of the text to be synthesized. The blue arrows represent the flow of the audio.

## Requirements

- Python 3.12
- Docker
- Docker Compose

## Installation

### Standalone Program

This allows the user to synthesize text using the Intelex Module. This version is a standalone (Single Service) version of the implementation.

#### Steps

1. Install the requirements: `pip install -r enviroments/server_requirements.txt`
2. Run the program: `python intlex.py [TEXT] [CONFIG] --output [OUTPUT] --lang [LANG] --kwargs [KWARGS]`
3. The output will be saved in the output file if provided, otherwise it will be stored in the default output file.

##### Arguments

- `TEXT`: Text to be synthesized
- `CONFIG`: Configuration file
- `OUTPUT`: Output file (optional)
- `LANG`: Language [pt, en] (optional)
- `KWARGS`: Additional arguments (optional)

### Docker

This allows the user to synthesize text using the Intelex Program. This version is a dockerized version of the implementation. It uses a microservices architecture.

#### Steps
1. Build the docker images:

`docker compose build`
2. Initialize proxy and required services:

`docker compose up proxy -d`
3. Initialize the client:
1. Build the docker images:

`docker compose build`

2. Initialize proxy and required services:

`docker compose up proxy -d`

3. Initialize the client:

`docker compose run -e PROXY_SERVER_PORT={PROXY_SERVER_PORT} -e PROXY_SERVER_ADDRESS={PROXY_SERVER_ADDRESS} client`

`docker compose run -e PROXY_SERVER_PORT={PROXY_SERVER_PORT} -e PROXY_SERVER_ADDRESS={PROXY_SERVER_ADDRESS} client`
4. The output will be displayed in the terminal.

- To stop the services:

`docker compose down`
- To stop the services:
`docker compose down`

## Improvements

##### General

- [TODO] Tests
- [TODO] Documentation
- [TODO] More languages

##### Client

- [TODO] Voice option in client
- [TODO] Better user interface

##### Proxy

- [FIX] Prints in proxy
- [TODO] Logfile

##### Server

- [TODO] Logfile

##### Normalizer

- [TODO] Logfile

##### API & Frontend

- [FIX] Not connecting using Docker
Binary file removed docs/README.pdf
Binary file not shown.
16 changes: 11 additions & 5 deletions docs/src/app_flow.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Flow
The system, as designed, is composed of several components, each responsible for a specific task.

The system, as designed, is composed of several components, each responsible for a specific task.

## App Flow

Based on the [architecture diagram](architecture.md#architecture-image), the flow of the system is as follows:

#### Frontend API flow:

- The user interacts with the system through the [Frontend](components/app.md), which sends requests to the `API`.

- The [API](components/app_api.md) processes the requests and sends them to the `Proxy`.
Expand All @@ -22,9 +24,8 @@ Based on the [architecture diagram](architecture.md#architecture-image), the flo

- The [API](components/app_api.md) sends the results back to the `Frontend`, which displays the results to the user.

#### Client flow:


#### Client flow:
- The user interacts with the system through the [Client](components/app_client.md) , which sends requests directly to the `Proxy`.

- The [Proxy](components/app_proxy.md) routes the requests to an available `Server`.
Expand All @@ -41,7 +42,7 @@ Based on the [architecture diagram](architecture.md#architecture-image), the flo

## Data Flow

The flow of data between these components is crucial for the system to function correctly. The following diagram illustrates the flow of data between the components of the system:
The flow of data between these components is crucial for the system to function correctly. The following diagram illustrates the flow of data between the components of the system:

![Data Flow](images/data_flow.png)

Expand All @@ -58,12 +59,14 @@ The system accepts two types of inputs:
#### 2. Normalizer:

The input text is sent to the Normalizer, which standardizes it for further processing. For example:

- Expanding abbreviations.
- Converting numbers into words.

#### 3. TTS Model:

The normalized text is then processed by the TTS Model, which converts the text into audio data. This includes:

- Generating phonetic representations.
- Applying prosody to ensure naturalness.

Expand All @@ -78,11 +81,14 @@ The audio data is finalized and saved as an output file (e.g., .wav or .mp3), re
**Streaming**: The system generates complete audio files and sends them directly to the user, but streaming capabilities could be added in future iterations.

## Communication

### Main Components Communication

To handle the comunications between the main components, the system uses `gRPC` as the communication protocol. This allows for fast and efficient communication between the components, ensuring that the system can handle the real-time requirements of the audio synthesis process.

The use of `gRPC` also allows for a technology-agnostic approach to the system, as it can be used with a wide variety of programming languages and platforms.

### Frontend API Communication
To handle the communication between the **Frontend** and the **API**, the system uses `HTTP` as the communication protocol. This allows for easy integration with web-based applications and ensures that the system can be easily accessed by a wide variety of devices.

To handle the communication between the **Frontend** and the **API**, the system uses `HTTP` as the communication protocol. This allows for easy integration with web-based applications and ensures that the system can be easily accessed by a wide variety of devices.

16 changes: 9 additions & 7 deletions docs/src/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,18 @@ The system was designed to be simple, modular, and scalable. The architecture is
![Architecture](images/architecture.png)

In the diagram above, it is possible to see the main components of the system:

- **Black Arrows**: Represent the flow of "text" between the components.
- **Blue Arrows**: Represent the flow of audio between the components.
- **Components**:
- [**Frontend**](components/app.md): The user interface, responsible for sending requests to the api and displaying the results.
- [**API**](components/app_api.md): One of the possible interfaces of the system, responsible for processing requests from the frontend and sending them to the server.
- [**Client**](components/app_client.md): The client is a Terminal-based interface that allows users to interact with the system without being dependent on the API.
- [**Proxy**](components/app_proxy.md): The proxy is responsible for routing requests to the server and returning the results to the client and/or API.
- [**Server**](components/app_server.md): The server is responsible for processing requests from the proxy and generating the audio output.
- [**Normalizer**](components/app_normalizer.md): The normalizer is responsible for processing the input text and preparing it for synthesis.
- [**Frontend**](components/app.md): The user interface, responsible for sending requests to the api and displaying the results.
- [**API**](components/app_api.md): One of the possible interfaces of the system, responsible for processing requests from the frontend and sending them to the server.
- [**Client**](components/app_client.md): The client is a Terminal-based interface that allows users to interact with the system without being dependent on the API.
- [**Proxy**](components/app_proxy.md): The proxy is responsible for routing requests to the server and returning the results to the client and/or API.
- [**Server**](components/app_server.md): The server is responsible for processing requests from the proxy and generating the audio output.
- [**Normalizer**](components/app_normalizer.md): The normalizer is responsible for processing the input text and preparing it for synthesis.

## Docker

Each component is incapsulated in a Docker container, allowing for easy deployment and scaling. The provided docker-compose file allows for easy deployment of the system on a single machine.
Each component is incapsulated in a Docker container, allowing for easy deployment and scaling. The provided docker-compose file allows for easy deployment of the system on a single machine.

3 changes: 2 additions & 1 deletion docs/src/closing.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# Closing Notes

## To whom it may concern

This project is open-source and welcomes contributions from the community.
This project was developed by a team of students from the University of Minho, as part of the curricular unit: 14602 - Informatics Project for the course [Master's in Informatics Engineering](https://www.uminho.pt/EN/education/educational-offer/Cursos-Conferentes-a-Grau/_layouts/15/UMinho.PortalUM.UI/Pages/CatalogoCursoDetail.aspx?itemId=5067&catId=15).

The project was developed under the supervision of [Professor João Miguel Fernandes](https://www.di.uminho.pt/~jmf/) and [Professor Victor Manuel Rodrigues Alves](https://www.di.uminho.pt/~vma/), with a proposal from [Agentifai](https://agentifai.com/) to create a high-quality, low-latency, and modular TTS system for their virtual assistant.

We would also like to thank our Agentifai mentor João Cunha for his guidance and support throughout the project.
We would also like to thank our Agentifai mentor João Cunha for his guidance and support throughout the project.
13 changes: 7 additions & 6 deletions docs/src/components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
In this section, we will discuss the system components, how they function, and how they interact with each other.

## Components
- [Intlex Module](./app_standalone.md)

- [Intlex Module](./app_standalone.md)
- [Frontend](./app.md)
- [API](./app_api.md)
- [Client](./app_client.md)
Expand All @@ -16,9 +17,9 @@ In this section, we will discuss the system components, how they function, and h
The components communicate with each other in the following ways:

- With the use of GRPC's:
- `API` <---> `Proxy`
- `Client` <---> `Proxy`
- `Proxy` <---> `Server`
- `Server` <---> `Normalizer`
- `API` <---> `Proxy`
- `Client` <---> `Proxy`
- `Proxy` <---> `Server`
- `Server` <---> `Normalizer`
- With the use of HTTP Requests and Responses:
- `Frontend` <---> `API`
- `Frontend` <---> `API`
45 changes: 22 additions & 23 deletions docs/src/components/app.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,37 +7,38 @@ The frontend is the user interface of the application. It is responsible for ren
The frontend requires the following:

- ENV: The environment variables must be defined in the `.env` file or as environment variables. The following variables must be defined:
- `PORT`: The port on which the frontend will run.
- `REACT_APP_API_IP_PORT`: The port of the API server.
- `REACT_APP_API_IP_ADDRESS`: The address of the API server.

- `PORT`: The port on which the frontend will run.
- `REACT_APP_API_IP_PORT`: The port of the API server.
- `REACT_APP_API_IP_ADDRESS`: The address of the API server.

- Dependencies: The frontend requires the following dependencies:
- [Node.js](https://nodejs.org/en/)
- [npm](https://www.npmjs.com/get-npm)
- [axios](https://www.npmjs.com/package/axios)
- [react-audio-player](https://www.npmjs.com/package/react-audio-player)
- [Node.js](https://nodejs.org/en/)
- [npm](https://www.npmjs.com/get-npm)
- [axios](https://www.npmjs.com/package/axios)
- [react-audio-player](https://www.npmjs.com/package/react-audio-player)

### Usage

To run the frontend, you can do the following:

1. Define the api address and port in the `.env` file or as an environment variable. The serve port can also be defined. For example:
```bash
export PORT=3000
export REACT_APP_API_IP_PORT=5000
export REACT_APP_API_IP_ADDRESS={api_address}
```
```bash
export PORT=3000
export REACT_APP_API_IP_PORT=5000
export REACT_APP_API_IP_ADDRESS={api_address}
```
2. Install the required dependencies by running:
```bash
cd app
npm install
```
```bash
cd app
npm install
```
3. Start the frontend by running:
```bash
`bash
cd app
npm start
```
The frontend will be available at `http://localhost:3000`.
`
The frontend will be available at `http://localhost:3000`.

### Comunication

Expand All @@ -49,8 +50,8 @@ The frontend sends a POST request to the API with the text to be synthesized. Th

```json
{
"text":"O exame de Época Especial realiza-se no dia 10 de Julho, às 9:00, na sala Ed.2 1.03.",
"language":"pt"
"text": "O exame de Época Especial realiza-se no dia 10 de Julho, às 9:00, na sala Ed.2 1.03.",
"language": "pt"
}
```

Expand All @@ -71,5 +72,3 @@ The frontend is composed of only one page, which contains the text input and the
![Frontend Waiting](../images/frontend/frontend_sent.png)

![Frontend With Audio Player](../images/frontend/frontend_audio.png)


Loading

0 comments on commit 61aa3ca

Please sign in to comment.