-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torchaudio.load to optionally accept a target sample_rate (and maybe backend=) #2586
Comments
It seems to me that libsndfile (underlying soundfile) also does not support passing a custom sample rate to configure the decoder (e.g. OPUS decoder). So to support this feature (primarily with OPUS or similar codecs that do resampling to target sample rate as part of decoding process), probably custom bindings to libopus (e.g. as in https://github.com/jlaine/opuslib/blob/master/opuslib/api/decoder.py) would be useful (if no per-frame calls are needed and a shared library is assumed, probably even ctypes-bindings would do as via this link) Created a question about this in libsndfile: libsndfile/libsndfile#886 |
also, unclear if currently global backend selection should be done in worker_init_fn... |
It also seems that pysoundfile has issues of reading opus, so the problem of correct resampling is a bit more pressing: bastibe/python-soundfile#252. Is torchaudio using pysoundfile or libsndfile directly? |
It also appears that ffmepg doesn't let the user to directly downsample to the target sample_rate during opus decoding: https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/libopusdec.c#L65 - it always sets sample_rate to 48khz. Ideally, we should be able to directly set it to the required sample_rate (and be able to read an original sample_rate from the header if the target sample_rate is unset). So maybe it might make sense to directly link torchaudio to libopus to support this regime? Also, builtin ffmpeg decoder seems to have severe perf problems when resampling is needed: https://video.stackexchange.com/questions/36610/opus-decoding-in-ffmpeg-how-to-pass-target-sample-rate-and-ensure-libopus-decod |
Hi - Just wanted to let you know that I read the messages, but I don't have the time to properly craft the reply to all the details. Regarding the resampling, looking at their CLI code they seem to use FFT-based downsampling. I am not an expert here, but from https://signalsprocessed.blogspot.com/2016/08/audio-resampling-in-python.html, this downsampling method is considered unsuitable to general audio processing. |
I don't think opus_compare program is relevant to this. Is it? I think it's just some test utility, and the relevant bits are found in decoding codebase My question on opus is on letting the user directly setting target sample_rate of the opus decoder structure And question on general torchaudio.load API is that is that it should accept sample_rate for either passing directly to the decoder (as in opus case or some other decoders which support it) or for doing resampling inside torchaudio.load if it's specified. It's very common need to force (by resampling if needed) some sample_rate out of the audio loading function... |
Would you point the part about the direct downsampling within libopus library you are talking about? Source or CLI help message or whatever. Decoders are generally only responsible for decoding, and resampling should not be part of it. If opus does, it's something special and I first need to understand what it is. |
Yes, opus is special about this. By default it decodes to 48khz (which is what ffmpeg bindings do) or whatever sample_rate stored in the opus file header - which is what opusdec does), but actually it can decode to any sample rate at decoding time directly: Here's how ffmpeg asks for 48khz: https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/libopusdec.c#L65 And similar code of opusdec does the same (e.g. see |
Also note that libopus is extremely easy to interface with as demonstrated by ffmpeg bindings above or by https://github.com/jlaine/opuslib/blob/master/opuslib/api/decoder.py and opus is quite well-spread now and the library is compact, so might make sense to compile against it directly as well |
I already knew this. My ask is where in the libopus does the resampling happening? I am asking this because
If the resampling is implemented on
And this sounds more like overriding than resampling. |
Okay, reading through libopusdec code it seems that the following function indeed does decoding and resampling. However, the structure of |
Also, there're some mentions of speex resampler in opusdec.c: https://github.com/xiph/opus-tools/blob/master/src/opusdec.c#L1157 ... |
Reading https://hydrogenaud.io/index.php/topic,113655.0.html, it does not seem speex resampler is contributing to the performance you are looking for. Also see https://trac.ffmpeg.org/ticket/5240 for why FFmpeg does not recover the original sample rate, and such decision makes sense. |
overall, I don't think it's worthwhile for torchaudio to bind libopus. It does not seem to overweight the cost. (please know that nowadays I am almost maintaining alone alongside all the other works) And if we are to add, I would add switch in |
Hmm. Overall yes, it seems that opus does resampling using speex resampler if not 48khz is required at decoding. So in the context of torchaudio I would say:
|
This one, I totally agree. I actually tried to add this but got a unreasonable push back so could not do it. #816 I am okay with bringing this back.
FFmpeg and sox have own resampling implementations which work on streaming fashion, but soundfile does not. So on adding target sample rate, the consistency of the functionality is an issue. (well we can start by saying soundfile backend does not support this)
Reading the standard, the treatment of rate is vague. https://datatracker.ietf.org/doc/html/rfc7845.html#section-5.1 At least, I see why FFmpeg always resorts to 48k Hz even if it might feel strange to users. (I also thought it was strange at first, still do but I also get how FFmpeg developers think about it.) So I don't think it's a bug, yet indeed everything becomes 48k Hz is surprising and it is agaisnt least-surprise principle of software. but at the same time, libopus is only a reference implementation, so we don't need to stick to its extra behaviors not defined in standard.
I see your point, but For OPUS, I think one can workaround by using the Python wrapper you referred. When you know that all the audios in your dataset are OPUS, there should be no problem using. I hear that conversion from NumPy NDArray to Torch Tensor is quite fast. |
About adding optional target sample_rate for torchaudio.load: I would say it's okay to add these kinds of high-level improvements for user convenience. If users are interested in a particular backend, they can use it directly (btw pysoundfile still does not support opus unless some patches are made). And yes, of course one can directly use ctypes libopus wrappers, but it's just less convenient and more boiler-plate. For soundfile, torchaudio could use its own builtin resampler to downsample. Currently most often one has to do this kind of boilerplate postproc anyway. |
🚀 The feature
E.g. OPUS format supports resampling as part of reading. There is no standard and uniform way of setting sample rate at decoding.
E.g. sox sets it always as 48khz: https://github.com/dmkrepo/libsox/blob/master/src/opus.c#L114 (unofficial repo)
while original opusdec itself tries to first copy it from original source sample rate stored in stream header: https://github.com/xiph/opus-tools/blob/master/src/opusdec.c#L897
Fixing sox to do what opusdec does probably should be a feature request to sox and to ffmpeg. But probably torchaudio should support passing some forced sample_rate and built-in resampling if decoder supports it
It may also be a good idea to directly accept a
backend=
argument as well. This would avoid maintaining it as a global variable and eliminate the need for dataloader worker init code for setting the backend. (Personally, I would even think that the global variable should be phased out in favor of an explicit argument with a default argument)Motivation, pitch
N/A
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: