RFC: Applying codecs as data augmentation #1146

mthrok · 2021-01-01T04:33:03Z

In #1108 and #1141, I am adding in-memory decoding and encoding. This allows us to apply codecs to audio Tensor, like the following way.

fileobj = io.BytesIO()
torchaudio.save(fileobj, waveform, …, format=”mp3”, compression=9)
fileobj.seek(0)
waveform, _ = torchaudio.load(fileobj)
# Note: depending on the format, the size of the tensor could be different,
# so some post processing might be necessary

Which practically gives the same result as

sox input.wav -C 9 temp.mp3
sox temp.mp3 output.wav

I am thinking of adding this codecs application as part of torchaudio’s feature. Before starting working on API specification and engineering load map, I would like to hear from the community what kind of feature would be helpful to your use case.

If you would like to use codecs as data augmentation or if there are papers that use this kind of technique. Please leave comment.

cc @mravanelli @sw005320 @pzelasko @faroit @mpariente @danpovey

danpovey · 2021-01-01T05:04:59Z

I have heard people say they do this (encode+decode for augmentation), and that it was helpful when applied to data that had been through codecs, but I don't recall that any of them were published.
My suspicion is that mp3 coding, being relatively high quality, probably preserves the short-time spectrum well enough that that it wouldn't have much effect. Typically these would be mobile-phone codecs with the bit-rate turned down far enough that the quality is affected. Unfortunately there can be IP issues with these... it's possible the IP holders would allow their use royalty-free under some caveat, but IDK.
Also I the popularity of these codecs changes with time...

pzelasko · 2021-01-01T19:15:25Z

I agree with Dan’s point that mobile phone codecs would probably be the most useful. I also know of some cases where the companies stored telephone recordings in mp3...

As a side note, mp3 has been used as a defense method against adversarial attacks (not necessarily a very strong one though). The point is there could be other applications for codecs beyond training data augmentation.

mravanelli · 2021-01-01T19:30:59Z

Hi Moto, I think this functionality is useful. There are already some papers using codec augmentation successfully (e.g, https://arxiv.org/pdf/2005.07143.pdf). Of course, more diversity in terms of different codecs would be a plus. Personally, I like the idea of having this functionality "on-the-fly" without the need of saving audio files on disk.

…

On Fri, 1 Jan 2021 at 14:15, Piotr Żelasko ***@***.***> wrote: I agree with Dan’s point that mobile phone codecs would probably be the most useful. I also know of some cases where the companies stored telephone recordings in mp3... As a side note, mp3 has been used as a defense method against adversarial attacks (not necessarily a very strong one though). The point is there could be other applications for codecs beyond training data augmentation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1146 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA2ZVRQOLK34BPS2ZTO5MDSXYNNTANCNFSM4VP7HZMA> .

mthrok · 2021-01-03T20:55:40Z

Thanks for your comments.

In the past, I used https://github.com/idiap/acoustic-simulator for training telephony model with clean dataset. It worked great. The codecs used there [list] include telephone codecs from ITU, which is available at https://github.com/openitu/STL. We might be able to add that as additional codecs.

The next major codecs I was thinking to add AAC, and maybe binding ffmpeg. That can cover https://arxiv.org/abs/2005.07143.

If oyu know there is something needs attention when applying codecs, let me know, I will proceed with API design.

sw005320 · 2021-01-04T19:07:44Z

Sorry for my late response.
I have not tried such techniques by myself, but it should be a very practical technique.
I'm looking forward to this implementation!

My colleagues also told me that the opus compression seems to have very high compression, and used in https://openslr.org/94/

mthrok · 2021-01-04T21:15:08Z

Note: According to https://en.wikipedia.org/wiki/FFmpeg, ffmpeg contains codecs from ITU-T, so binding ffmpeg like torchvision does will be the easiest in both legal and technical perspective.

Aditya3107 · 2021-01-18T15:56:44Z

@mravanelli @sw005320 @pzelasko @faroit @mpariente @danpovey
Dear all,

I read some previous comments and your discussion and I think, my project SpecAugment vs CodecAugement can be useful to you. I am working on similar research, to decide if we can use audio compression as a data augmentation or not. I have used OPUS audio codec as it is open source and royalty-free and checked at which bitrates (8,16,32,48,64, or 128 kbps) we can get the best output. My model is on speech emotion recognition but this methodology can be used on any audio-related problems.

You can check the pinned repo on my GitHub. Happy to help !!!

danpovey · 2021-01-19T04:13:41Z

Cool. Would be nice to see effects on ASR accuracy if someone does the experiment.

Aditya3107 · 2021-01-19T06:37:23Z

Yes, we are also working on ASR with CodecAugment and it shows the better results too, but the research paper is not published yet and that project is not mine, so can't share many details. But in near future, there can be a publication.

danpovey · 2021-01-29T10:13:32Z

From this conversation: https://groups.google.com/d/msgid/kaldi-help/0814d308-9150-475d-9163-398f54c6140en%40googlegroups.com

The decoder tools for Opus, which is often used nowadays in VoIP (e.g. WebRTC) transmission, have options for simulating packet loss as well as bit rates (https://opus-codec.org/docs/opus-tools/opusdec.html). You can pipe your audio through the opusenc/opusdec binaries in your wav.scp.

With Adaptive Multi-Rate (AMR) codecs, it's not as straightforward to simulate loss, but you can vary the mode (4.75-12.20 kbps for AMR-NB) over time. The reference implementation can take a "mode file" as input that lists the modes to use for each frame (https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1387). Some link (QoS) adaptation schemes in use allow changes in mode every 2 frames (40ms).

mthrok · 2021-03-01T15:52:43Z

Thanks for the input.

In the upcoming release, we are releasing apply_codec function as a beta feature.

Opus is not supported yet as the underlying libsox does not support writing opus format. Together with the packet loss simulation mentioned above, I will think what I can do about it.

Aditya3107 · 2021-03-02T11:51:10Z

Cool. Would be nice to see effects on ASR accuracy if someone does the experiment.

Please check the below research paper.

N. Hailu, I. Siegert and A. Nürnberger, "Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation," 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 2020, pp. 1-5, doi: 10.1109/MMSP48831.2020.9287127.

@danpovey

danpovey · 2021-03-02T11:59:35Z

MM, you don't have a pdf link do you?

…

On Tue, Mar 2, 2021 at 7:51 PM Aditya ***@***.***> wrote: Cool. Would be nice to see effects on ASR accuracy if someone does the experiment. Please check. N. Hailu, I. Siegert and A. Nürnberger, "Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation," 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 2020, pp. 1-5, doi: 10.1109/MMSP48831.2020.9287127. @danpovey <https://github.com/danpovey> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1146 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO6X2SIMVQDHIMPDI6DTBTGMBANCNFSM4VP7HZMA> .

Aditya3107 · 2021-03-02T12:06:34Z

Hmm..it's published on IEEE.
https://ieeexplore.ieee.org/abstract/document/9287127
Here is the link. I have university access of IEEE.

mthrok added RFC augmentation labels Jan 1, 2021

mthrok mentioned this issue Jan 5, 2021

Sharing My Projects 2021 H1 #1154

Closed

vincentqb mentioned this issue Jan 25, 2021

Roadmap ahead for torchaudio #1196

Closed

mthrok closed this as completed Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Applying codecs as data augmentation #1146

RFC: Applying codecs as data augmentation #1146

mthrok commented Jan 1, 2021

danpovey commented Jan 1, 2021

pzelasko commented Jan 1, 2021

mravanelli commented Jan 1, 2021 via email

mthrok commented Jan 3, 2021

sw005320 commented Jan 4, 2021

mthrok commented Jan 4, 2021

Aditya3107 commented Jan 18, 2021

danpovey commented Jan 19, 2021

Aditya3107 commented Jan 19, 2021

danpovey commented Jan 29, 2021 •

edited

Loading

mthrok commented Mar 1, 2021

Aditya3107 commented Mar 2, 2021 •

edited

Loading

danpovey commented Mar 2, 2021 via email

Aditya3107 commented Mar 2, 2021

RFC: Applying codecs as data augmentation #1146

RFC: Applying codecs as data augmentation #1146

Comments

mthrok commented Jan 1, 2021

danpovey commented Jan 1, 2021

pzelasko commented Jan 1, 2021

mravanelli commented Jan 1, 2021 via email

mthrok commented Jan 3, 2021

sw005320 commented Jan 4, 2021

mthrok commented Jan 4, 2021

Aditya3107 commented Jan 18, 2021

danpovey commented Jan 19, 2021

Aditya3107 commented Jan 19, 2021

danpovey commented Jan 29, 2021 • edited Loading

mthrok commented Mar 1, 2021

Aditya3107 commented Mar 2, 2021 • edited Loading

danpovey commented Mar 2, 2021 via email

Aditya3107 commented Mar 2, 2021

danpovey commented Jan 29, 2021 •

edited

Loading

Aditya3107 commented Mar 2, 2021 •

edited

Loading