-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new speech command recognition tutorial #1204
Conversation
Deploy preview for pytorch-tutorials-preview ready! Built with commit d07e292 https://deploy-preview-1204--pytorch-tutorials-preview.netlify.app |
8kHz with M5: Accuracy: 58%, 77%
|
4kHz with M5: Accuracy: 60%, 70%
|
3kHz with M5: Accuracy: 59%, 64%
|
8kHz, n_channels=64: Accuracy 72%, 67%
|
8kHz, stride=8: Accuracy 69%, 79%
|
8kHz, stride=16: Accuracy: 68%, 73%
|
8kHz, n_channel=32, Accuracy: 63%, 71%
|
8 kHz, n_channel=32, stride=16: Accuracy 62%, 71%
|
8kHz, n_channel=16, stride=16, Accuracy: 54%, 64%
|
8kHz, simple linear, MFCC, Accuracy: 13%, 15% :)
|
Follow-up to this one with more epochs :) 8kHz, stride=16, epoch=21, accuracy=86%
|
Loop only with batch resample on GPU, CPU, num_workers=1, pin_memory=True, 1:22
Loop only, CPU, num_workers=1, pin_memory=True, 5:50
Linear, CPU, num_workers=0, pin_memory=False, 5:44
Linear, GPU, num_workers=0, pin_memory=False, 5:18
Linear, CPU, num_workers=1, pin_memory=True, 5:55
Linear, GPU, num_workers=1, pin_memory=True, 4:59
Linear, CPU, num_workers=0, pin_memory=False, 5:44
Linear, CPU, SGD, num_workers=0, pin_memory=False, 5:58
Linear, GPU, SGD, num_workers=0, pin_memory=False, 5:58
|
CPU Model, CPU Resample, 3:37
GPU Model, CPU Resample, 2:10
GPU Model, GPU Resample, 1:26
|
scheduler step at 20, accuracy=86% after 21 epochs, 25.44s/it
no scheduler, accuracy=81% after 21 epochs
|
2568d5e
to
0b882f1
Compare
information from executed cells disappear). | ||
|
||
First, let’s import the common torch packages such as | ||
``torchaudio <https://github.com/pytorch/audio>``\ \_ that can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the link format should be markdown, not RST format.
https://colab.research.google.com/notebooks/markdown_guide.ipynb#scrollTo=70pYkR9LiOV0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the conversion from colab to rst didn't render correctly here. Thanks for pointing it out! The tutorial is rendered first on pytorch.org, so I see other tutorials using rst links, e.g. here.
intermediate_source/speech_command_recognition_with_torchaudio.py
Outdated
Show resolved
Hide resolved
# Let’s find the list of labels available in the dataset. | ||
# | ||
|
||
labels = sorted(list(set(datapoint[2] for datapoint in train_set))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[not PR review comment] I wonder if the Dataset implementations should have this kind of attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(added mention in #910)
# | ||
# The actual loading and formatting steps happen in the access function | ||
# ``__getitem__``. In ``__getitem__``, we use ``torchaudio.load()`` to | ||
# convert the audio files to tensors. ``torchaudio.load()`` returns a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth mentioning the internal mechanism about torchaudio.load
, which can be changed anytime in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression was that some details would help the user understand. I agree though this might be too detailed, especially since those are internal details. I've replaced this by the following to instead emphasize that there exists a torchaudio.load function in case someone just wants to load a file instead of using a dataset.
The actual loading and formatting steps happen when a data point is being accessed, and torchaudio takes care of converting the audio files to tensors. If one wants to load an audio file directly instead,
torchaudio.load()
can be used. It returns a tuple containing the newly created tensor along with the sampling frequency of the audio file (16kHz for SpeechCommands).
intermediate_source/speech_command_recognition_with_torchaudio.py
Outdated
Show resolved
Hide resolved
intermediate_source/speech_command_recognition_with_torchaudio.py
Outdated
Show resolved
Hide resolved
e9bc491
to
d07e292
Compare
* add new speech tutorial. * update with a few parameter tuned. model takes less than 10 min to run now. * feedback. * improve GPU performance. add interactive demo at the end. * feedback.
Add command recognition tutorial, see colab.
1h10 min using this5 min.SpeechCommands as it is currently in torchaudio doesn't have the train/valid/test split. We could decide to add that in before the release to simplify the tutorial. Add SpeechCommands train/valid/test split audio#966.Experiment with MFCC instead of resample only.Replace SPEECHCOMMANDS by YESNO dataset, and usetorchaudio.transforms.vad
to segment audio into utterances.This pull request complements #572, see also deprecated tutorial.
cc @brianjo