Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin to use multiple datasets in training #213

Merged
merged 10 commits into from
Feb 21, 2022

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Feb 15, 2022

See details at lhotse-speech/lhotse#554 (comment)

TODOs

  • Dataset preparation. Will use on the fly feature extraction
  • Build separate decoder+joiner for LibriSpeech and GigaSpeech
  • Train on LibriSpeech 100 hours
  • Decoding
  • Train on LibriSpeech 960 hours if it turns out to be helpful using GigaSpeech in training

"with training dataset. ",
)

group.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should standardize the name?
was asr_dataloader.py in another recipe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reverted to the previous name.

@csukuangfj
Copy link
Collaborator Author

Note: I am not going to use the changes in lhotse-speech/lhotse#565, which added support for multiplexing among CutSets, because there are utterances from different datasets in a batch if that method is used.

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Feb 16, 2022

Here is the tensorboard log
https://tensorboard.dev/experiment/HRlmSpNCRhKd5NgpqerkNg/#scalars&_smoothingWeight=0

for the following training command

export CUDA_VISIBLE_DEVICES="2,3"


./transducer_stateless_multi_datasets/train.py \
  --world-size 2 \
  --num-epochs 40 \
  --start-epoch 0 \
  --exp-dir transducer_stateless_multi_datasets/exp-100-2 \
  --full-libri 0 \
  --max-duration 300 \
  --lr-factor 1 \
  --bpe-model data/lang_bpe_500/bpe.model \
  --modified-transducer-prob 0.25

Screen Shot 2022-02-16 at 8 22 38 PM

Screen Shot 2022-02-16 at 8 22 56 PM

It uses the S subset of GigaSpeech, which has 250 hours of data.
80% of the time it selects a batch from LibriSpeech and 20% of the time a batch from GigaSpeech.

You can see that the model starts to converge.

The transducer loss for GigaSpeech is higher than that for LibriSpeech. One possible reason may be that the training sees less data from it.


The following shows the model architecture.

The encoder is shared between LibriSpeech and GigaSpeech, but they have separate decoder/joiner networks.
During training, a batch can come from LibriSpeech or GigaSpeech. When it is from LibriSpeech, only the decoder/joiner for LibriSpeech are run and the other decoder/joiner just do nothing.

multiple dataset

@csukuangfj
Copy link
Collaborator Author

Here are the results for this PR so far:

Decoding method test-clean test-other Comment
this PR - greedy search (--max-sym-per-frame=1) 7.19 18.89 --epoch 20 --avg 7
this PR - greedy search (--max-sym-per-frame=1) 6.79 17.81 --epoch 30 --avg 10
baseline - greedy search (--max-sym-per-frame=1) 7.65 20.69 --epoch 39 --avg 17

You can see that integrating the GigaSpeech dataset into the training pipeline helps to reduce the WER and results in faster convergence.


The training command for this PR is given in #213 (comment), which is repeated below:

./transducer_stateless_multi_datasets/train.py \
  --world-size 2 \
  --num-epochs 40 \
  --start-epoch 0 \
  --exp-dir transducer_stateless_multi_datasets/exp-100-2 \
  --full-libri 0 \
  --max-duration 300 \
  --lr-factor 1 \
  --bpe-model data/lang_bpe_500/bpe.model \
  --modified-transducer-prob 0.25

The training command for the baseline is given below:
(The code for the baseline is from #200, which is equivalent to the code in the master when --apply-frame-shift=0 --ctc-weight=0.0 is used)

export CUDA_VISIBLE_DEVICES="0,1"

./transducer_stateless/train.py \
  --world-size 2 \
  --num-epochs 40 \
  --start-epoch 0 \
  --exp-dir transducer_stateless/exp-100-no-shift \
  --full-libri 0 \
  --max-duration 300 \
  --lr-factor 1 \
  --bpe-model data/lang_bpe_500/bpe.model \
  --apply-frame-shift 0 \
  --modified-transducer-prob 0.25 \
  --ctc-weight 0.0

@danpovey
Copy link
Collaborator

Cool!!

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Feb 21, 2022

Here are the results for using train-clean-100 + S subset of GigaSpeech (250 hours):

test-clean test-other comment
greedy search (max sym per frame 1) 6.34 16.7 --epoch 57, --avg 17, --max-duration 100
greedy search (max sym per frame 2) 6.34 16.7 --epoch 57, --avg 17, --max-duration 100
greedy search (max sym per frame 3) 6.34 16.7 --epoch 57, --avg 17, --max-duration 100
modified beam search (beam size 4) 6.31 16.3 --epoch 57, --avg 17, --max-duration 100

The training for --full-libri + L subset of GigaSpeech (2.5k hours) is still running and may take some time to get the results.

A pre-trained model with train-clean-100 is available at https://huggingface.co/csukuangfj/icefall-asr-librispeech-100h-transducer-stateless-multi-datasets-bpe-500-2022-02-21

The tensorboard log can be found at https://tensorboard.dev/experiment/qUEKzMnrTZmOz1EXPda9RA/#scalars&_smoothingWeight=0


[EDITED]:
The results are competitive compared with the ones listed in

@csukuangfj csukuangfj changed the title WIP: Begin to use multiple datasets in training Begin to use multiple datasets in training Feb 21, 2022
@danpovey
Copy link
Collaborator

Cool!!

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Feb 21, 2022

I will merge it and do some experiments based on it.

The results for the full LibriSpeech will be posted later.

@csukuangfj csukuangfj merged commit 2332ba3 into k2-fsa:master Feb 21, 2022
@csukuangfj csukuangfj deleted the multiple-datasets branch February 21, 2022 09:42
@pzelasko
Copy link
Collaborator

nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants