Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchaudio.load to optionally accept a target sample_rate (and maybe backend=) #2586

Open
vadimkantorov opened this issue Jul 27, 2022 · 17 comments
Assignees

Comments

@vadimkantorov
Copy link

vadimkantorov commented Jul 27, 2022

🚀 The feature

E.g. OPUS format supports resampling as part of reading. There is no standard and uniform way of setting sample rate at decoding.

E.g. sox sets it always as 48khz: https://github.com/dmkrepo/libsox/blob/master/src/opus.c#L114 (unofficial repo)
while original opusdec itself tries to first copy it from original source sample rate stored in stream header: https://github.com/xiph/opus-tools/blob/master/src/opusdec.c#L897

Fixing sox to do what opusdec does probably should be a feature request to sox and to ffmpeg. But probably torchaudio should support passing some forced sample_rate and built-in resampling if decoder supports it

It may also be a good idea to directly accept a backend= argument as well. This would avoid maintaining it as a global variable and eliminate the need for dataloader worker init code for setting the backend. (Personally, I would even think that the global variable should be phased out in favor of an explicit argument with a default argument)

Motivation, pitch

N/A

Alternatives

No response

Additional context

No response

@vadimkantorov
Copy link
Author

vadimkantorov commented Oct 12, 2022

It seems to me that libsndfile (underlying soundfile) also does not support passing a custom sample rate to configure the decoder (e.g. OPUS decoder). So to support this feature (primarily with OPUS or similar codecs that do resampling to target sample rate as part of decoding process), probably custom bindings to libopus (e.g. as in https://github.com/jlaine/opuslib/blob/master/opuslib/api/decoder.py) would be useful (if no per-frame calls are needed and a shared library is assumed, probably even ctypes-bindings would do as via this link)

Created a question about this in libsndfile: libsndfile/libsndfile#886

@vadimkantorov
Copy link
Author

also, unclear if currently global backend selection should be done in worker_init_fn...

@vadimkantorov
Copy link
Author

vadimkantorov commented Oct 19, 2022

It also seems that pysoundfile has issues of reading opus, so the problem of correct resampling is a bit more pressing: bastibe/python-soundfile#252. Is torchaudio using pysoundfile or libsndfile directly?

@vadimkantorov vadimkantorov changed the title torchaudio.load to optionally accept sample_rate (and maybe backend=) torchaudio.load to optionally accept a target sample_rate (and maybe backend=) Oct 19, 2022
@vadimkantorov
Copy link
Author

vadimkantorov commented Jun 27, 2023

It also appears that ffmepg doesn't let the user to directly downsample to the target sample_rate during opus decoding: https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/libopusdec.c#L65 - it always sets sample_rate to 48khz. Ideally, we should be able to directly set it to the required sample_rate (and be able to read an original sample_rate from the header if the target sample_rate is unset). So maybe it might make sense to directly link torchaudio to libopus to support this regime?

Also, builtin ffmpeg decoder seems to have severe perf problems when resampling is needed: https://video.stackexchange.com/questions/36610/opus-decoding-in-ffmpeg-how-to-pass-target-sample-rate-and-ensure-libopus-decod

@mthrok
Copy link
Collaborator

mthrok commented Jun 28, 2023

Hi - Just wanted to let you know that I read the messages, but I don't have the time to properly craft the reply to all the details.

Regarding the resampling, looking at their CLI code they seem to use FFT-based downsampling. I am not an expert here, but from https://signalsprocessed.blogspot.com/2016/08/audio-resampling-in-python.html, this downsampling method is considered unsuitable to general audio processing.

@vadimkantorov
Copy link
Author

I don't think opus_compare program is relevant to this. Is it? I think it's just some test utility, and the relevant bits are found in decoding codebase

My question on opus is on letting the user directly setting target sample_rate of the opus decoder structure

And question on general torchaudio.load API is that is that it should accept sample_rate for either passing directly to the decoder (as in opus case or some other decoders which support it) or for doing resampling inside torchaudio.load if it's specified.

It's very common need to force (by resampling if needed) some sample_rate out of the audio loading function...

@mthrok
Copy link
Collaborator

mthrok commented Jun 28, 2023

I don't think opus_compare program is relevant to this.

Would you point the part about the direct downsampling within libopus library you are talking about? Source or CLI help message or whatever. Decoders are generally only responsible for decoding, and resampling should not be part of it. If opus does, it's something special and I first need to understand what it is.

@vadimkantorov
Copy link
Author

vadimkantorov commented Jun 28, 2023

Yes, opus is special about this. By default it decodes to 48khz (which is what ffmpeg bindings do) or whatever sample_rate stored in the opus file header - which is what opusdec does), but actually it can decode to any sample rate at decoding time directly:

Here's how ffmpeg asks for 48khz: https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/libopusdec.c#L65

And similar code of opusdec does the same (e.g. see --rate option of opusdec but actually one should check the code, but essentially it just sets the .sample_rate field of the decoder structure

@vadimkantorov
Copy link
Author

Also note that libopus is extremely easy to interface with as demonstrated by ffmpeg bindings above or by https://github.com/jlaine/opuslib/blob/master/opuslib/api/decoder.py and opus is quite well-spread now and the library is compact, so might make sense to compile against it directly as well

@mthrok
Copy link
Collaborator

mthrok commented Jun 29, 2023

Yes, opus is special about this. By default it decodes to 48khz (which is what ffmpeg bindings do) or whatever sample_rate stored in the opus file header - which is what opusdec does), but actually it can decode to any sample rate at decoding time directly:

Here's how ffmpeg asks for 48khz: https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/libopusdec.c#L65

I already knew this. My ask is where in the libopus does the resampling happening? I am asking this because

And similar code of opusdec does the same (e.g. see --rate option of opusdec

If the resampling is implemented on opusdec CLI, then binding libopus won't help.

essentially it just sets the .sample_rate field of the decoder structure

And this sounds more like overriding than resampling.

@mthrok
Copy link
Collaborator

mthrok commented Jun 29, 2023

Okay, reading through libopusdec code it seems that the following function indeed does decoding and resampling.

https://github.com/xiph/opus/blob/9fc8fc4cf432640f284113ba502ee027268b0d9f/src/opus_decoder.c#L751

However, the structure of opus_decode_native function looks strange. It recursively calls the same function over the buffer with different frame size parameter. This structure and the fact that resampling only works for division of 48k Hz suggests that so called downsampling is actually decimation, which explains why it is fast.

@vadimkantorov
Copy link
Author

Also, there're some mentions of speex resampler in opusdec.c: https://github.com/xiph/opus-tools/blob/master/src/opusdec.c#L1157 ...

@mthrok
Copy link
Collaborator

mthrok commented Jun 29, 2023

Also, there're some mentions of speex resampler in opusdec.c: https://github.com/xiph/opus-tools/blob/master/src/opusdec.c#L1157 ...

Reading https://hydrogenaud.io/index.php/topic,113655.0.html, it does not seem speex resampler is contributing to the performance you are looking for.

Also see https://trac.ffmpeg.org/ticket/5240 for why FFmpeg does not recover the original sample rate, and such decision makes sense.

@mthrok
Copy link
Collaborator

mthrok commented Jun 29, 2023

overall, I don't think it's worthwhile for torchaudio to bind libopus. It does not seem to overweight the cost. (please know that nowadays I am almost maintaining alone alongside all the other works)

And if we are to add, I would add switch in torchaudio.load to see if the source is OPUS and branch out to the specific code execution path, however the project you pointed has already wrapped libopus, so you can already do that outside of torchaudio.load. And torchaudio is not aiming to be the fastest decoding library, so I recommend you to simply use the said library for OPUS.

@vadimkantorov
Copy link
Author

vadimkantorov commented Jun 29, 2023

Hmm. Overall yes, it seems that opus does resampling using speex resampler if not 48khz is required at decoding.

So in the context of torchaudio I would say:

  1. If decoding is done by chunks, it can be important to downsample a block right away without loading the whole 48khz decoded PCM in memory. 48khz PCM can eat a lot of RAM especially for huge multi-hour files, 8khz or 48khz can make a difference for these memory-wise.

  2. If torchaudio supports reading opus via ffmpeg, it should have some perf tests to test builtin and libopus decoders to guard against bugs like this: https://video.stackexchange.com/questions/36610/opus-decoding-in-ffmpeg-how-to-pass-target-sample-rate-and-ensure-libopus-decod and in general to compare against opusdec, as giant speech datasets are commonly stored in opus these days

  3. It still might make sense to accept target sample_rate as part of torchaudio.load API as resampling to a single target sample rate is often needed by the users

  4. In general, I don't know what resampler is torchaudio using and if speex C code from libopus can be a useful code

  5. I don't know if anyone needs ffmpeg-less build of torchaudio, but in this case directly building against libopus/libflac might make sense as they are quite small and self-container libraries in terms of code footprint.

  6. about ffmpeg not restoring the original sample rate: I think it's actually not a very good thing and as you pointed out it's mostly about fitting ffmpeg architecture. e.g. if torchaudio could first load the original sample rate from opus header and then call ffmpeg / resample to it, it would fulfill a feature a user would expect

@mthrok
Copy link
Collaborator

mthrok commented Jun 29, 2023

  1. It still might make sense to accept target sample_rate as part of torchaudio.load API as resampling to a single target sample rate is often needed by the users

This one, I totally agree. I actually tried to add this but got a unreasonable push back so could not do it. #816

I am okay with bringing this back.

  1. If decoding is done by chunks, it can be important to downsample a block right away without loading the whole 48khz decoded PCM in memory. 48khz PCM can eat a lot of RAM especially for huge multi-hour files, 8khz or 48khz can make a difference for these memory-wise.
  2. In general, I don't know what resampler is torchaudio using and if speex C code from libopus can be a useful code

FFmpeg and sox have own resampling implementations which work on streaming fashion, but soundfile does not. So on adding target sample rate, the consistency of the functionality is an issue. (well we can start by saying soundfile backend does not support this)

  1. about ffmpeg not restoring the original sample rate: I think it's actually not a very good thing and as you pointed out it's mostly about fitting ffmpeg architecture. e.g. if torchaudio could first load the original sample rate from opus header and then call ffmpeg / resample to it, it would fulfill a feature a user would expect
  2. If torchaudio supports reading opus via ffmpeg, it should have some perf tests to test builtin and libopus decoders to guard against bugs like this: https://video.stackexchange.com/questions/36610/opus-decoding-in-ffmpeg-how-to-pass-target-sample-rate-and-ensure-libopus-decod and in general to compare against opusdec, as giant speech datasets are commonly stored in opus these days

Reading the standard, the treatment of rate is vague. https://datatracker.ietf.org/doc/html/rfc7845.html#section-5.1 At least, I see why FFmpeg always resorts to 48k Hz even if it might feel strange to users. (I also thought it was strange at first, still do but I also get how FFmpeg developers think about it.) So I don't think it's a bug, yet indeed everything becomes 48k Hz is surprising and it is agaisnt least-surprise principle of software. but at the same time, libopus is only a reference implementation, so we don't need to stick to its extra behaviors not defined in standard.

  1. I don't know if anyone needs ffmpeg-less build of torchaudio, but in this case directly building against libopus/libflac might make sense as they are quite small and self-container libraries in terms of code footprint.

I see your point, but For OPUS, I think one can workaround by using the Python wrapper you referred. When you know that all the audios in your dataset are OPUS, there should be no problem using. I hear that conversion from NumPy NDArray to Torch Tensor is quite fast.

@vadimkantorov
Copy link
Author

About adding optional target sample_rate for torchaudio.load: I would say it's okay to add these kinds of high-level improvements for user convenience.

If users are interested in a particular backend, they can use it directly (btw pysoundfile still does not support opus unless some patches are made). And yes, of course one can directly use ctypes libopus wrappers, but it's just less convenient and more boiler-plate.

For soundfile, torchaudio could use its own builtin resampler to downsample. Currently most often one has to do this kind of boilerplate postproc anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants