Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample_rate conversion option to sox_io_backend.load #816

Closed
wants to merge 3 commits into from

Conversation

mthrok
Copy link
Collaborator

@mthrok mthrok commented Jul 23, 2020

This PR adds sample_rate option to sox_io_backend.load function which performs resampling when loading audio data from file.

Inspired by this comment

On Python interface this change is backward compatible, but Torchscript registration schema is changed, (see register.cpp), therefore previously dumped Torchscript object that includes call to sox_io_backend.load will stop working. -> BC breaking.
Added backward compatibility. TODO: test -> #838

@mthrok mthrok changed the title [BC Breaking] Add sample_rate conversion option to sox_io_backend.load Add sample_rate conversion option to sox_io_backend.load Jul 23, 2020
@mthrok mthrok force-pushed the load-sample-rate branch 5 times, most recently from 5b91caf to 66499d4 Compare July 30, 2020 19:06
@mthrok mthrok force-pushed the load-sample-rate branch 12 times, most recently from 898213f to 4b767ee Compare August 6, 2020 18:48
@mthrok mthrok force-pushed the load-sample-rate branch from 4b767ee to 8a0b921 Compare August 7, 2020 19:15
@mthrok mthrok marked this pull request as ready for review August 7, 2020 19:45
@mthrok mthrok requested a review from vincentqb August 7, 2020 19:46

std::vector<std::vector<std::string>> effects;
if (sample_rate != -1) {
effects.emplace_back(
std::vector<std::string>{"rate", std::to_string(sample_rate)});
Copy link
Contributor

@vincentqb vincentqb Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is adding to the sox_effects chain a resampling transform, right? I feel the user could want many other transform, and that would add many flags. Could it work instead to point out that sox_effects_chain can combine those operations already? (as you point out here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying implementation is same (which is why we can add sample rate conversion to load function), but the intended use of frontend functions are different. If users want to use filtering they can certainly do so by using sox effects. This is for convenience, and not for compatibility. Changing sampling rate is such a popular use case, like using the same dataset to train ASR models for different environments (realtime vs batch or on-device vs server etc...). So having convenient and fast way to change sampling rate is valuable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users also have torchaudio.transforms.Resample :) Are any of these two (Resample and sox_effects) measurably slower than this implementation?

I'm not keen on adding parameters like these to load functions when there are already simple ways of doing so. On top of adding parameters to the signature, this also does add layers of backward-compatibility etc to the codebase.

How about other backends like soundfile?

Copy link
Collaborator Author

@mthrok mthrok Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users also have torchaudio.transforms.Resample :) Are any of these two (Resample and sox_effects) measurably slower than this implementation?

The difference is 120x

script
import time

import torchaudio

torchaudio.set_audio_backend('sox_io')

n_rep = 100
sr = 8000

t0 = time.monotonic()
for _ in range(n_rep):
    torchaudio.load('test/torchaudio_unittest/assets/steam-train-whistle-daniel_simon.wav', sample_rate=sr)
t1 = time.monotonic()


print((t1-t0) / 100)

t0 = time.monotonic()
for _ in range(n_rep):
    wave, sample_rate = torchaudio.load('test/torchaudio_unittest/assets/steam-train-whistle-daniel_simon.wav')
    torchaudio.compliance.kaldi.resample_waveform(wave, sample_rate, sr)
t1 = time.monotonic()

print((t1-t0) / 100)
$ python  foo.py
0.015911198877729474
0.18906223367899655

I'm not keen on adding parameters like these to load functions when there are already simple ways of doing so. On top of adding parameters to the signature, this also does add layers of backward-compatibility etc to the codebase.

How about other backends like soundfile?

This is BC compatible in both Python and C++ Torchscript schema. For other backend, we can add the argument and raise an error saying sample_rate is not supported in this backend until we actually add the implementation. Which is the same approach as the coming backend clean up on "sox_io" and "soundfile" backends.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why it is a good idea to add sampling rate option to loading function is when extracting a portion of audio file. Like in #882 there are occasions where users want to load only a portion of audio, based on timestamps. Having the correct sampling rate is critical in this case to calculate the target frame positions, and this addition will guarantee that users get the sampling rate they requested even if the underlying audio file has a different sample rate.

In my experience with ASR trainings, the most common mistakes are wrong sampling rate and wrong number of channels. When training models for different purposes, it would really beneficial to load audio while changing sampling rate and extracting a portion like one will do with STM files, without pre-converting audio files and storing on the file system.

Copy link
Contributor

@vincentqb vincentqb Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more consideration: could someone want to use a specific algorithm for resampling, instead of the default one? How would the behavior of the resampling be guaranteed across different backends?

When training models for different purposes, it would really beneficial to load audio while changing sampling rate and extracting a portion like one will do with STM files, without pre-converting audio files and storing on the file system.

We can already do that with in-memory sox effects, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more consideration: could someone want to use a specific algorithm for resampling, instead of the default one? How would the behavior of the resampling be guaranteed across different backends?

I do not think it is necessary to make that guarantee into a specification, because each backend has different limitations. If we make that guarantee into specification (except loading the data as is), we limit ourselves to only take the smallest subset of functionalities of all the backends.
Also, so far backends are for the purpose of platform-portability, and it is not meant to be swapped very often for the same installation.
If users can use the Resample transform for that purpose. I think, leveraging the backend's specific capability to provide more efficient way of perform it has added value.

We can already do that with in-memory sox effects, right?

Yes, but this way it's more efficient when one wants to trim the audio, because sox will discard the un-needed portion.

c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal> load_audio_file(
const std::string& path,
const int64_t frame_offset = 0,
const int64_t num_frames = -1,
const bool normalize = true,
const bool channels_first = true);

// ver. 1 sample_rate is added
c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal> load_audio_file_v1(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mruberry -- in case you would like to comment on how to deprecate torchscript APIs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're adding a new argument with a default value. In that case you should be able to load older serialized Torchscript without issue, and I don't think you need to the old signature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry It seems not the case. I removed the BC code and BC test fails. https://app.circleci.com/pipelines/github/pytorch/audio/3270/workflows/0f877fa3-6256-4ef5-8d84-124684d17a97/jobs/93108

Sorry, @mthrok, I feel like I misled you but that's what I was told. Empirically it seems to not be the case and I'll have to go back and investigate. I suppose you can keep the old signature and remap it into the new one like you originally were if you want to continue being able to load older serialized Torchscript.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @mthrok, I feel like I misled you but that's what I was told. Empirically it seems to not be the case and I'll have to go back and investigate. I suppose you can keep the old signature and remap it into the new one like you originally were if you want to continue being able to load older serialized Torchscript.

Do you have an estimate of when it would work without code for backward-compatibility?
I am wondering if I should wait if it's soon. That way we can avoid adding a new schema.

@@ -28,7 +28,8 @@ c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal> load_audio_file(
const int64_t frame_offset = 0,
const int64_t num_frames = -1,
const bool normalize = true,
const bool channels_first = true);
const bool channels_first = true,
const int64_t sample_rate = -1);
Copy link
Contributor

@vincentqb vincentqb Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the behavior of the -1 value documented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is omitted because I judged that the code is self-explanatory.

This reverts commit abbb52e.
@mthrok mthrok closed this Nov 3, 2020
mthrok pushed a commit to mthrok/audio that referenced this pull request Dec 13, 2022
* Update dcgan_faces_tutorial.py

Previous comment might have been slightly misleading. A bit easier to understand, if made explicit, that gradients of errD_real and errD_fake w.r.t. parameters of netD get added up/accumulated because of the successive backward calls without a zero_grad() in between.

* Update dcgan_faces_tutorial.py

Co-authored-by: Brian Johnson <[email protected]>
Co-authored-by: holly1238 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants