-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low WER training pipeline in torchaudio with wav2letter #913
Comments
Hey @vincentqb these planned additions look great and useful! Can you clarify on the below points please?
|
Hey, thanks for commenting :)
Thoughts? |
@vincentqb Wouldn't it be interesting to create a torchaudio_ASR repository? Support various things, an independent preprocess of features that saves the features in .pt so we can extract any feature from torchaudio and make an easy support to features from fairseq (like wav2vec). I feel that ASR lacks a repository with torchaudio and is modular enough to accept new features. And that it is simple for people to add new models (which are different from those implemented in torchaudio (currently we only have wav2letter)). I believe that the community can like this idea and contribute to the repository. |
My first step is to understand what would be missing in torchaudio to serve the ASR community best :) Can you provide some examples?
Are there models you would like to contribute to torchaudio? :) |
@vincentqb I would support the Jasper model. Right now I'm out of time so maybe soon I can send a PR :). If that list of things has been added to torchaudio it will be very good for ASR :). It will be even simpler to build a great pipeline using only torchaudio :). My initial suggestion is to keep external notebooks. Why not make a torchaudio_ASR repository? So some things would not need to be implemented in the torchaudio itself, but in these new repositories. There in this repository we can extract features before and save with torch.save and just write a generic dataloader that reads, this is interesting to facilitate support for new features like wav2vec. So to support new extraction methods independent of torchaudio, just write a new preprocessing class. |
Great! Feel free to ping me when you do :)
We currently offer training examples such as wav2letter using torchaudio. The example I link shows one way of doing preprocessing. Does that help?
Our goal with torchaudio is to provide flexible building blocks for audio-related fields, such as ASR. As such, we want to make sure we capture what would be useful to the community, and to ASR. Can you provide an example of your suggested workflow? |
I had not seen this very good example :)
Now that I've seen the example. Do you think about supporting Wav2vec? The easiest way to do this support that I see is to change the example, separating the feature extraction (MFFC / waveform) from the model training. Basically a preprocess.py that extracts the characteristics and saves them with torch.save so the main only reads the files saved by torch.save. Do you know a simpler way to integrate Wav2vec with torchaudio? |
Adding an example workflow with wav2vec would be a great addition! I see you have already mentioned jasper in comment so let's move the discussion there :)
We currently don't have a pipeline with wav2vec included, but this would be a great addition. torchaudio is made to be modular and uses standard pytorch operations, so using the pre-trained tensors from fairseq can be done using standard pytorch operations. Is that what you meant? |
Do you have any idea how this support would do? Or do you want to try to make it independent of the fairseq structure and just create a class with Wav2vec architecture and load it at the checkpoint? |
@Edresson -- those are great questions, and thanks for sharing your thoughts :)
We do not want torchaudio to depend on fairseq, no. For the example implementation, we also aim to avoid such dependencies as much as possible. I would aim instead for fairseq to use torchaudio building blocks in some places.
What I meant above about the checkpoint was really just that torchaudio uses standard pytorch, so a user can interact with through standard pytorch means. For instance, someone could pre-process with torchaudio on a torchaudio dataset and then import a model from somewhere else, and then follow a torchaudio example for training loop. Is this what you meant? |
I believe so :). We could, for example, preprocess the dataset with torchaudio after extracting the features of these audios with wav2vec and save with torch.save. After going back to torchaudio and using the torchaudio ASR models. I think of saving with torch.save because extracting features at all times with wav2vec can be too slow, so the extraction is done only once. That makes sense? |
great!
yup, does to me :) |
torchaudio is targeting speech recognition as full audio application (internal). Along this line, we implemented wav2letter pipeline to obtain a low character error rate (CER). We want to expand on this and showcase a new pipeline which also has a low word error rate (WER). To achieve this, we consider the following additions to torchaudio from higher to lower priority.
Token Decoder: Add a lexicon-constrained beam search algorithm, based on fairseq (search class, sequence generator) since it is torchscriptable.
Acoustic Model: Add a transformer-based acoustic model, e.g. speech-transformer, comparison.
Language Model: Add KenLM to use a 4-gram language model based on LibriSpeech Language Model, as done in paper.
Training Loss: Add the RNN Transducer loss to replace the CTC loss in the pipeline.
Transformations: SpecAugment is already available in wav2letter pipeline.
See also internal
cc @astaff @dongreenberg @cpuhrsch
The text was updated successfully, but these errors were encountered: