add speed option to inference #6

mrciolino · 2024-01-12T17:36:51Z

Add a speed option to the infer function

Lines 186 to 195 in 350b888

    
           def inference(self, 
        
                         text: str, 
        
                         target_voice_path=None, 
        
                         output_wav_file=None, 
        
                         output_sample_rate=24000, 
        
                         alpha=0.3, 
        
                         beta=0.7, 
        
                         diffusion_steps=5, 
        
                         embedding_scale=1, 
        
                         ref_s=None):

Have tested with dividing the predicted duration by a value

StyleTTS2/src/styletts2/tts.py

Line 267 in 350b888

duration = torch.sigmoid(duration).sum(axis=-1)

Adding speed=1 to the function and / speed to the predicted duration gives the following duration of .wav files for various speeds. Speeds b/w .75 and 1.75 sound good but outside of that is rough.

duration = torch.sigmoid(duration).sum(axis=-1) / speed

    def inference(self,
                  text: str,
                  target_voice_path=None,
                  output_wav_file=None,
                  output_sample_rate=24000,
                  alpha=0.3,
                  beta=0.7,
                  diffusion_steps=5,
                  embedding_scale=1,
                  speed=1,
                  ref_s=None):

Orange line is duration of the original clip divided by the speed parameter.
Blue line is the duration of the clip produced when the speed parameter was used.

Had to convert to mp4 to play on here:

test_0.50.1.mp4

test_0.67.mp4

test_0.83.mp4

test_1.00.mp4

test_1.17.mp4

test_1.33.mp4

test_1.50.mp4

test_1.67.mp4

test_1.83.mp4

test_2.00.mp4

And here is the code I ran to test that after adding in those changes:

import matplotlib.pyplot as plt
from styletts2 import tts
import numpy as np
import librosa

# No paths provided means default checkpoints/configs will be downloaded/cached.
my_tts = tts.StyleTTS2()

# Optionally create/write an output WAV file.
speed_range = np.linspace(0.5, 2, 10)
for speed in speed_range:
    out = my_tts.inference(
        "Hello there, I am now a python package.",
        output_wav_file=f"test_{speed:.2f}.wav",
        speed=speed,
    )

# plot speed vs duration
durations = {}
for speed in speed_range:
    duration = librosa.get_duration(path=f"test_{speed:.2f}.wav")
    print(f"test_{speed:.2f}.wav: {duration:.2f}s")
    durations[speed] = duration


# using 1 as default plot a perfect line by division
expected_durations = [durations[1] / speed for speed in speed_range]
plt.plot(speed_range, list(durations.values()), label="Actual")
plt.plot(speed_range, expected_durations, label="Expected")
plt.xlabel("Speed")
plt.ylabel("Duration")
plt.show()

The text was updated successfully, but these errors were encountered:

See sidharthrajaram#6

RahulBhalley · 2024-09-08T12:05:06Z

This looks interesting! I might use it. Thanks.

mrciolino mentioned this issue Jan 12, 2024

Adding Speed Option #8

Closed

quernd added a commit to dialohq/StyleTTS2-pkg that referenced this issue Jan 23, 2024

Add speed parameter for faster/slower speech

454709b

See sidharthrajaram#6

Nik-Kras mentioned this issue Oct 7, 2024

Add Speed & Enable HuggingFace model loading #27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add speed option to inference #6

add speed option to inference #6

mrciolino commented Jan 12, 2024 •

edited

Loading

RahulBhalley commented Sep 8, 2024

add speed option to inference #6

add speed option to inference #6

Comments

mrciolino commented Jan 12, 2024 • edited Loading

RahulBhalley commented Sep 8, 2024

mrciolino commented Jan 12, 2024 •

edited

Loading