-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ready to merge] Pruned transducer stateless5 recipe for tal_csasr (mix Chinese chars and English BPE) #428
[Ready to merge] Pruned transducer stateless5 recipe for tal_csasr (mix Chinese chars and English BPE) #428
Conversation
The lengths of samples'texts have a big difference.
|
Since it’s a larger dataset, you can try increasing |
This function can help you tune the settings to minimize the overall padding: https://github.com/lhotse-speech/lhotse/blob/94e9ed9c67bcb4b4329e907ae335947dbbce99e9/lhotse/dataset/sampling/utils.py#L89 |
OK, thanks. I will try your suggestions. |
Oh, after I increase some relative parameters ( |
num_buckets 800 might be excessive, what about sth like 50? in the batch where texts have a large difference in length, do you also observe large difference in audio durations? |
Or maybe I can have a try by using bucket sampler (not dynamicbucket sampler)? |
99% 18.0 | ||
99.5% 18.8 | ||
99.9% 20.8 | ||
max 36.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are the statistics about the durations among the train cuts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train_set:
Duration statistics (seconds):
mean 5.8
std 4.1
min 0.3
25% 2.8
50% 4.4
75% 7.3
99% 18.0
99.5% 18.8
99.9% 20.8
max 36.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about within-batch statistics? I am trying to understand if:
- a) the bucketing sampler is doing a poor job by putting cuts with very different durations together (so we can tune the settings of the sampler to do better), or
- b) if the mini-batch cut durations are close to each other (in which case there is nothing we can do)
# You should use ../local/display_manifest_statistics.py to get | ||
# an utterance duration distribution for your dataset to select | ||
# the threshold | ||
return 1.0 <= c.duration <= 18.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #428 (comment)
You will drop 1% of the training data, which may be too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I fix the above error, I will consider to increase the max duration.
I try to use the code
|
It looks like these batches don't have any padding at all. I think you don't have any other option than decreasing max duration. |
OK, I decrease the max duration to 90 and train it. Let's wait for the case. |
How many tokens in your vocabulary ? Maybe the vocab size is too large ? |
The number of modeling tokens (including Chinese chars and English BPEs) is |
@luomingshuang Dose it still raise OOM error after you sort the durations? |
At present, it runs for 5 epochs normally based on max-duration=90 and sort_by_duration when computing fbank feature. But if I only sort the cuts not decrease the max-duration, it still raises OOM error according to my attempts yesterday.
|
The results (WER(%)) for pruned_transducer_stateless5 trained with TAL_CSASR:
|
Do you have any baseline to compare with? |
Em....It seems that I can't find any baseline on this dataset.
|
ok, you are creating the baseline for others. |
Do you have the result of Chinese CER and English WER respectivly? |
Maybe I can test the Chinese and English decoding results respectively. |
The results (CER(%) and WER(%)) for pruned_transducer_stateless5 trained with TAL_CSASR (zh: Chinese, en: English): (It includes Chinese CER and English WER respectivly)
|
In this PR, I try to build a pruned transducer stateless5 recipe for TAL_CSASR dataset. I mix Chinese chars and English BPE (for English text parts) as tokens for modeling.