The system generates singing voice from a given text and MIDI in an end-to-end manner.
Overview of the proposed system
- A Windows/Linux system with a minimum of
16GB
RAM. - A GPU with at least
12GB
of VRAM. - Python >= 3.8
- Anaconda installed.
- Pytorch installed.
- CUDA 11.7 installed.
Pytorch install command:
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
CUDA 11.8 install:
https://developer.nvidia.com/cuda-11-8-0-download-archive
- Create an Anaconda environment:
conda create -n begansing python=3.9
- Activate the environment:
conda activate begansing
- Clone this repository to your local machine:
git clone https://github.com/ORI-Muchim/BEGANSing.git
- Navigate to the cloned directory:
cd BEGANSing
- Install the necessary dependencies:
pip install -r requirements.txt
Inside the cloned folder, there is a folder called ./test_datasets
. You can put the MIDI file and text file in it according to the format. MIDI and text should be arranged in the same number unconditionally. As an example, I will provide GFRIEND's "Rough" MIDI and text. And for the dataset to change the voice from the generated vocals, you can create a folder with the speaker's name in the ./datasets
folder and put voice data for Retrieval Voice Conversion (RVC) in it. The following shows the ./datasets format.
BEGANSing
├────datasets
│ ├───kss
│ │ ├────1_0000.wav
│ │ ├────1_0001.wav
│ │ └────...
│ ├───{speaker_name}
│ │ ├───1.wav
└───────└────└───2.wav
This is just an example, and it's okay to add more speakers.
This pre-trained model is a model in which an additional 100 epochs was trained. For Preprocessing and Training, see Preprocessing, Training in the original repository.
python main.py {speaker_name} {song} {pitch_shift} --audiosr
If the speaker is male, it is recommended to set the {pitch_shift} value to -12
, and if she is female, set it to 0
.
The --audiosr
option up-samples a voice generated at 22050hz to 48000hz. Use this option for those who have excellent graphics cards or don't mind taking a long time to generate a voice, or remove it if not.
Audio samples at: https://soonbeomchoi.github.io/saebyulgan-blog/. Model was trained at RTX3090 24GB with batch size 32 for 2 days.
- Change Vocoder Griffin-Lim -> HiFi-GAN
- g2p/korean_g2p.py from https://github.com/scarletcho/KoG2P
- utils/midi_utils.py from Madmom, https://madmom.readthedocs.io/en/latest/