-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Applying codecs as data augmentation #1146
Comments
I have heard people say they do this (encode+decode for augmentation), and that it was helpful when applied to data that had been through codecs, but I don't recall that any of them were published. |
I agree with Dan’s point that mobile phone codecs would probably be the most useful. I also know of some cases where the companies stored telephone recordings in mp3... As a side note, mp3 has been used as a defense method against adversarial attacks (not necessarily a very strong one though). The point is there could be other applications for codecs beyond training data augmentation. |
Hi Moto,
I think this functionality is useful. There are already some papers using
codec augmentation successfully (e.g, https://arxiv.org/pdf/2005.07143.pdf).
Of course, more diversity in terms of different codecs would be a plus.
Personally, I like the idea of having this functionality
"on-the-fly" without the need of saving audio files on disk.
…On Fri, 1 Jan 2021 at 14:15, Piotr Żelasko ***@***.***> wrote:
I agree with Dan’s point that mobile phone codecs would probably be the
most useful. I also know of some cases where the companies stored telephone
recordings in mp3...
As a side note, mp3 has been used as a defense method against adversarial
attacks (not necessarily a very strong one though). The point is there
could be other applications for codecs beyond training data augmentation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1146 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA2ZVRQOLK34BPS2ZTO5MDSXYNNTANCNFSM4VP7HZMA>
.
|
Thanks for your comments. In the past, I used https://github.com/idiap/acoustic-simulator for training telephony model with clean dataset. It worked great. The codecs used there [list] include telephone codecs from ITU, which is available at https://github.com/openitu/STL. We might be able to add that as additional codecs. The next major codecs I was thinking to add AAC, and maybe binding ffmpeg. That can cover https://arxiv.org/abs/2005.07143. If oyu know there is something needs attention when applying codecs, let me know, I will proceed with API design. |
Sorry for my late response. My colleagues also told me that the opus compression seems to have very high compression, and used in https://openslr.org/94/ |
Note: According to https://en.wikipedia.org/wiki/FFmpeg, |
@mravanelli @sw005320 @pzelasko @faroit @mpariente @danpovey I read some previous comments and your discussion and I think, my project SpecAugment vs CodecAugement can be useful to you. I am working on similar research, to decide if we can use audio compression as a data augmentation or not. I have used OPUS audio codec as it is open source and royalty-free and checked at which bitrates (8,16,32,48,64, or 128 kbps) we can get the best output. My model is on speech emotion recognition but this methodology can be used on any audio-related problems. You can check the pinned repo on my GitHub. Happy to help !!! |
Cool. Would be nice to see effects on ASR accuracy if someone does the experiment. |
Yes, we are also working on ASR with CodecAugment and it shows the better results too, but the research paper is not published yet and that project is not mine, so can't share many details. But in near future, there can be a publication. |
From this conversation: https://groups.google.com/d/msgid/kaldi-help/0814d308-9150-475d-9163-398f54c6140en%40googlegroups.com The decoder tools for Opus, which is often used nowadays in VoIP (e.g. WebRTC) transmission, have options for simulating packet loss as well as bit rates (https://opus-codec.org/docs/opus-tools/opusdec.html). You can pipe your audio through the opusenc/opusdec binaries in your wav.scp. With Adaptive Multi-Rate (AMR) codecs, it's not as straightforward to simulate loss, but you can vary the mode (4.75-12.20 kbps for AMR-NB) over time. The reference implementation can take a "mode file" as input that lists the modes to use for each frame (https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1387). Some link (QoS) adaptation schemes in use allow changes in mode every 2 frames (40ms). |
Thanks for the input. In the upcoming release, we are releasing Opus is not supported yet as the underlying libsox does not support writing opus format. Together with the packet loss simulation mentioned above, I will think what I can do about it. |
Please check the below research paper. N. Hailu, I. Siegert and A. Nürnberger, "Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation," 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 2020, pp. 1-5, doi: 10.1109/MMSP48831.2020.9287127. |
MM, you don't have a pdf link do you?
…On Tue, Mar 2, 2021 at 7:51 PM Aditya ***@***.***> wrote:
Cool. Would be nice to see effects on ASR accuracy if someone does the
experiment.
Please check.
N. Hailu, I. Siegert and A. Nürnberger, "Improving Automatic Speech
Recognition Utilizing Audio-codecs for Data Augmentation," 2020 IEEE 22nd
International Workshop on Multimedia Signal Processing (MMSP), Tampere,
Finland, 2020, pp. 1-5, doi: 10.1109/MMSP48831.2020.9287127.
@danpovey <https://github.com/danpovey>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1146 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO6X2SIMVQDHIMPDI6DTBTGMBANCNFSM4VP7HZMA>
.
|
Hmm..it's published on IEEE. |
In #1108 and #1141, I am adding in-memory decoding and encoding. This allows us to apply codecs to audio Tensor, like the following way.
Which practically gives the same result as
I am thinking of adding this codecs application as part of torchaudio’s feature. Before starting working on API specification and engineering load map, I would like to hear from the community what kind of feature would be helpful to your use case.
If you would like to use codecs as data augmentation or if there are papers that use this kind of technique. Please leave comment.
cc @mravanelli @sw005320 @pzelasko @faroit @mpariente @danpovey
The text was updated successfully, but these errors were encountered: