severe metallic sound #3

GuangChen2016 · 2022-05-18T17:52:55Z

Hi, thanks for your nice jobs. I used your codes for ny own datasets and the synthesized voices seems not that normal at 160K steps now. Though we could still figure out what's being saied, the spectrum is unnormal (especially the high frequency part, as you can see from the following figures.) with severe metallic sound. I have double checked the feature extraction process and the training process, and all are normal. Do you know any reason about it? BTW, how many steps are required to train the LJSpeech model?

Thanks again.

keonlee9420 · 2022-05-27T01:45:44Z

Hi @GuangChen2016 , thanks for your attention. Yes, I met the same issue and I'm still figuring out what's going on. In my opinion, it may be from feature matching loss since it shows extremely lower value as time goes, which is not that normal under GAN training. Of course some configurations or model architectures may cause the issue, too. I'd update the project if I could handle it, but It would be better if you can contribute by your side! Thanks.

BridgetteSong · 2022-06-09T05:02:00Z

Hi @GuangChen2016 , thanks for your attention. Yes, I met the same issue and I'm still figuring out what's going on. In my opinion, it may be from feature matching loss since it shows extremely lower value as time goes, which is not that normal under GAN training. Of course some configurations or model architectures may cause the issue, too. I'd update the project if I could handle it, but It would be better if you can contribute by your side! Thanks.

@keonlee9420 In your code, GAN's inputs is always predicted mels, which I think will cause unstable training. I think in initial training stage, we use true mels and compute loss between mel_true and mel_predict, and after some steps, we use predicted mels. Maybe this can help some.

mayfool · 2022-07-06T09:35:00Z

Hi @GuangChen2016 , thanks for your attention. Yes, I met the same issue and I'm still figuring out what's going on. In my opinion, it may be from feature matching loss since it shows extremely lower value as time goes, which is not that normal under GAN training. Of course some configurations or model architectures may cause the issue, too. I'd update the project if I could handle it, but It would be better if you can contribute by your side! Thanks.

@keonlee9420 In your code, GAN's inputs is always predicted mels, which I think will cause unstable training. I think in initial training stage, we use true mels and compute loss between mel_true and mel_predict, and after some steps, we use predicted mels. Maybe this can help some.

I tried to train generator first for 50k steps, but it didn't work.I will try to train as you said.Hope it can solve this problem.

BridgetteSong · 2022-07-06T10:01:29Z

Hi @GuangChen2016 , thanks for your attention. Yes, I met the same issue and I'm still figuring out what's going on. In my opinion, it may be from feature matching loss since it shows extremely lower value as time goes, which is not that normal under GAN training. Of course some configurations or model architectures may cause the issue, too. I'd update the project if I could handle it, but It would be better if you can contribute by your side! Thanks.

@keonlee9420 In your code, GAN's inputs is always predicted mels, which I think will cause unstable training. I think in initial training stage, we use true mels and compute loss between mel_true and mel_predict, and after some steps, we use predicted mels. Maybe this can help some.

I tried to train generator first for 50k steps, but it didn't work.I will try to train as you said.Hope it can solve this problem.

in my recent experiment, I found what I said can't solve the problem. In the training, I found mel_loss of vocoder is very big, I think use acoustic model outputs as inputs of vocoder will increase the difficulty of training. So now I add a Normalized Flow with the same as VITS, I get amazing results.

mayfool · 2022-08-26T08:42:59Z

Hi @GuangChen2016 , thanks for your attention. Yes, I met the same issue and I'm still figuring out what's going on. In my opinion, it may be from feature matching loss since it shows extremely lower value as time goes, which is not that normal under GAN training. Of course some configurations or model architectures may cause the issue, too. I'd update the project if I could handle it, but It would be better if you can contribute by your side! Thanks.

@keonlee9420 In your code, GAN's inputs is always predicted mels, which I think will cause unstable training. I think in initial training stage, we use true mels and compute loss between mel_true and mel_predict, and after some steps, we use predicted mels. Maybe this can help some.

I tried to train generator first for 50k steps, but it didn't work.I will try to train as you said.Hope it can solve this problem.

in my recent experiment, I found what I said can't solve the problem. In the training, I found mel_loss of vocoder is very big, I think use acoustic model outputs as inputs of vocoder will increase the difficulty of training. So now I add a Normalized Flow with the same as VITS, I get amazing results.

Only add Normalized Flow like postnet? Or also add posterior encoder as vits?

skyler14 · 2022-10-19T12:22:52Z

Hi @GuangChen2016 , thanks for your attention. Yes, I met the same issue and I'm still figuring out what's going on. In my opinion, it may be from feature matching loss since it shows extremely lower value as time goes, which is not that normal under GAN training. Of course some configurations or model architectures may cause the issue, too. I'd update the project if I could handle it, but It would be better if you can contribute by your side! Thanks.

@keonlee9420 In your code, GAN's inputs is always predicted mels, which I think will cause unstable training. I think in initial training stage, we use true mels and compute loss between mel_true and mel_predict, and after some steps, we use predicted mels. Maybe this can help some.

I tried to train generator first for 50k steps, but it didn't work.I will try to train as you said.Hope it can solve this problem.

in my recent experiment, I found what I said can't solve the problem. In the training, I found mel_loss of vocoder is very big, I think use acoustic model outputs as inputs of vocoder will increase the difficulty of training. So now I add a Normalized Flow with the same as VITS, I get amazing results.

Can you post your checkpoint so we can see what amazing results look like?

keonlee9420 · 2022-10-22T06:28:15Z

Hey guys, thank you all for your great efforts and discussion.

I've been resolving that issue, and finally make it work! Currently, I'm building a new open-source tts project for the general purpose, which is improved a lot and much easier to use, and I will share it soon including what @BridgetteSong suggested as well. Please stay tuned!

skyler14 · 2022-10-22T09:27:05Z

Hey guys, thank you all for your great efforts and discussion.

I've been resolving that issue, and finally make it work! Currently, I'm building a new open-source tts project for the general purpose, which is improved a lot and much easier to use, and I will share it soon including what @BridgetteSong suggested as well. Please stay tuned!

I was wondering if you had some general advice for radtts, it seems you implemented that into your code base but driving something even with well-trained models has been a daunting task.

15755841658 · 2022-11-07T03:27:20Z

@keonlee9420 How to solve this problem? I encountered the same synthesis result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

severe metallic sound #3

severe metallic sound #3

GuangChen2016 commented May 18, 2022 •

edited

Loading

keonlee9420 commented May 27, 2022

BridgetteSong commented Jun 9, 2022 •

edited

Loading

mayfool commented Jul 6, 2022

BridgetteSong commented Jul 6, 2022 •

edited

Loading

mayfool commented Aug 26, 2022

skyler14 commented Oct 19, 2022

keonlee9420 commented Oct 22, 2022

skyler14 commented Oct 22, 2022

15755841658 commented Nov 7, 2022

severe metallic sound #3

severe metallic sound #3

Comments

GuangChen2016 commented May 18, 2022 • edited Loading

keonlee9420 commented May 27, 2022

BridgetteSong commented Jun 9, 2022 • edited Loading

mayfool commented Jul 6, 2022

BridgetteSong commented Jul 6, 2022 • edited Loading

mayfool commented Aug 26, 2022

skyler14 commented Oct 19, 2022

keonlee9420 commented Oct 22, 2022

skyler14 commented Oct 22, 2022

15755841658 commented Nov 7, 2022

GuangChen2016 commented May 18, 2022 •

edited

Loading

BridgetteSong commented Jun 9, 2022 •

edited

Loading

BridgetteSong commented Jul 6, 2022 •

edited

Loading