Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Applying codecs as data augmentation #1146

Closed
mthrok opened this issue Jan 1, 2021 · 14 comments
Closed

RFC: Applying codecs as data augmentation #1146

mthrok opened this issue Jan 1, 2021 · 14 comments

Comments

@mthrok
Copy link
Collaborator

mthrok commented Jan 1, 2021

In #1108 and #1141, I am adding in-memory decoding and encoding. This allows us to apply codecs to audio Tensor, like the following way.

fileobj = io.BytesIO()
torchaudio.save(fileobj, waveform, …, format=mp3”, compression=9)
fileobj.seek(0)
waveform, _ = torchaudio.load(fileobj)
# Note: depending on the format, the size of the tensor could be different,
# so some post processing might be necessary

Which practically gives the same result as

sox input.wav -C 9 temp.mp3
sox temp.mp3 output.wav

I am thinking of adding this codecs application as part of torchaudio’s feature. Before starting working on API specification and engineering load map, I would like to hear from the community what kind of feature would be helpful to your use case.

If you would like to use codecs as data augmentation or if there are papers that use this kind of technique. Please leave comment.

cc @mravanelli @sw005320 @pzelasko @faroit @mpariente @danpovey

@danpovey
Copy link

danpovey commented Jan 1, 2021

I have heard people say they do this (encode+decode for augmentation), and that it was helpful when applied to data that had been through codecs, but I don't recall that any of them were published.
My suspicion is that mp3 coding, being relatively high quality, probably preserves the short-time spectrum well enough that that it wouldn't have much effect. Typically these would be mobile-phone codecs with the bit-rate turned down far enough that the quality is affected. Unfortunately there can be IP issues with these... it's possible the IP holders would allow their use royalty-free under some caveat, but IDK.
Also I the popularity of these codecs changes with time...

@pzelasko
Copy link

pzelasko commented Jan 1, 2021

I agree with Dan’s point that mobile phone codecs would probably be the most useful. I also know of some cases where the companies stored telephone recordings in mp3...

As a side note, mp3 has been used as a defense method against adversarial attacks (not necessarily a very strong one though). The point is there could be other applications for codecs beyond training data augmentation.

@mravanelli
Copy link

mravanelli commented Jan 1, 2021 via email

@mthrok
Copy link
Collaborator Author

mthrok commented Jan 3, 2021

Thanks for your comments.

In the past, I used https://github.com/idiap/acoustic-simulator for training telephony model with clean dataset. It worked great. The codecs used there [list] include telephone codecs from ITU, which is available at https://github.com/openitu/STL. We might be able to add that as additional codecs.

The next major codecs I was thinking to add AAC, and maybe binding ffmpeg. That can cover https://arxiv.org/abs/2005.07143.

If oyu know there is something needs attention when applying codecs, let me know, I will proceed with API design.

@sw005320
Copy link

sw005320 commented Jan 4, 2021

Sorry for my late response.
I have not tried such techniques by myself, but it should be a very practical technique.
I'm looking forward to this implementation!

My colleagues also told me that the opus compression seems to have very high compression, and used in https://openslr.org/94/

@mthrok
Copy link
Collaborator Author

mthrok commented Jan 4, 2021

Note: According to https://en.wikipedia.org/wiki/FFmpeg, ffmpeg contains codecs from ITU-T, so binding ffmpeg like torchvision does will be the easiest in both legal and technical perspective.

@Aditya3107
Copy link

@mravanelli @sw005320 @pzelasko @faroit @mpariente @danpovey
Dear all,

I read some previous comments and your discussion and I think, my project SpecAugment vs CodecAugement can be useful to you. I am working on similar research, to decide if we can use audio compression as a data augmentation or not. I have used OPUS audio codec as it is open source and royalty-free and checked at which bitrates (8,16,32,48,64, or 128 kbps) we can get the best output. My model is on speech emotion recognition but this methodology can be used on any audio-related problems.

You can check the pinned repo on my GitHub. Happy to help !!!

@danpovey
Copy link

Cool. Would be nice to see effects on ASR accuracy if someone does the experiment.

@Aditya3107
Copy link

Yes, we are also working on ASR with CodecAugment and it shows the better results too, but the research paper is not published yet and that project is not mine, so can't share many details. But in near future, there can be a publication.

@danpovey
Copy link

danpovey commented Jan 29, 2021

From this conversation: https://groups.google.com/d/msgid/kaldi-help/0814d308-9150-475d-9163-398f54c6140en%40googlegroups.com

The decoder tools for Opus, which is often used nowadays in VoIP (e.g. WebRTC) transmission, have options for simulating packet loss as well as bit rates (https://opus-codec.org/docs/opus-tools/opusdec.html). You can pipe your audio through the opusenc/opusdec binaries in your wav.scp.

With Adaptive Multi-Rate (AMR) codecs, it's not as straightforward to simulate loss, but you can vary the mode (4.75-12.20 kbps for AMR-NB) over time. The reference implementation can take a "mode file" as input that lists the modes to use for each frame (https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1387). Some link (QoS) adaptation schemes in use allow changes in mode every 2 frames (40ms).

@mthrok
Copy link
Collaborator Author

mthrok commented Mar 1, 2021

Thanks for the input.

In the upcoming release, we are releasing apply_codec function as a beta feature.

Opus is not supported yet as the underlying libsox does not support writing opus format. Together with the packet loss simulation mentioned above, I will think what I can do about it.

@Aditya3107
Copy link

Aditya3107 commented Mar 2, 2021

Cool. Would be nice to see effects on ASR accuracy if someone does the experiment.

Please check the below research paper.

N. Hailu, I. Siegert and A. Nürnberger, "Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation," 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 2020, pp. 1-5, doi: 10.1109/MMSP48831.2020.9287127.

@danpovey

@danpovey
Copy link

danpovey commented Mar 2, 2021 via email

@Aditya3107
Copy link

Hmm..it's published on IEEE.
https://ieeexplore.ieee.org/abstract/document/9287127
Here is the link. I have university access of IEEE.

@mthrok mthrok closed this as completed Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants