-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample_rate conversion option to sox_io_backend.load
#816
Conversation
sox_io_backend.load
sox_io_backend.load
5b91caf
to
66499d4
Compare
898213f
to
4b767ee
Compare
4b767ee
to
8a0b921
Compare
|
||
std::vector<std::vector<std::string>> effects; | ||
if (sample_rate != -1) { | ||
effects.emplace_back( | ||
std::vector<std::string>{"rate", std::to_string(sample_rate)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is adding to the sox_effects chain a resampling transform, right? I feel the user could want many other transform, and that would add many flags. Could it work instead to point out that sox_effects_chain can combine those operations already? (as you point out here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying implementation is same (which is why we can add sample rate conversion to load function), but the intended use of frontend functions are different. If users want to use filtering they can certainly do so by using sox effects. This is for convenience, and not for compatibility. Changing sampling rate is such a popular use case, like using the same dataset to train ASR models for different environments (realtime vs batch or on-device vs server etc...). So having convenient and fast way to change sampling rate is valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users also have torchaudio.transforms.Resample
:) Are any of these two (Resample
and sox_effects
) measurably slower than this implementation?
I'm not keen on adding parameters like these to load functions when there are already simple ways of doing so. On top of adding parameters to the signature, this also does add layers of backward-compatibility etc to the codebase.
How about other backends like soundfile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users also have
torchaudio.transforms.Resample
:) Are any of these two (Resample
andsox_effects
) measurably slower than this implementation?
The difference is 120x
script
import time
import torchaudio
torchaudio.set_audio_backend('sox_io')
n_rep = 100
sr = 8000
t0 = time.monotonic()
for _ in range(n_rep):
torchaudio.load('test/torchaudio_unittest/assets/steam-train-whistle-daniel_simon.wav', sample_rate=sr)
t1 = time.monotonic()
print((t1-t0) / 100)
t0 = time.monotonic()
for _ in range(n_rep):
wave, sample_rate = torchaudio.load('test/torchaudio_unittest/assets/steam-train-whistle-daniel_simon.wav')
torchaudio.compliance.kaldi.resample_waveform(wave, sample_rate, sr)
t1 = time.monotonic()
print((t1-t0) / 100)
$ python foo.py
0.015911198877729474
0.18906223367899655
I'm not keen on adding parameters like these to load functions when there are already simple ways of doing so. On top of adding parameters to the signature, this also does add layers of backward-compatibility etc to the codebase.
How about other backends like soundfile?
This is BC compatible in both Python and C++ Torchscript schema. For other backend, we can add the argument and raise an error saying sample_rate is not supported in this backend
until we actually add the implementation. Which is the same approach as the coming backend clean up on "sox_io" and "soundfile" backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason why it is a good idea to add sampling rate option to loading function is when extracting a portion of audio file. Like in #882 there are occasions where users want to load only a portion of audio, based on timestamps. Having the correct sampling rate is critical in this case to calculate the target frame positions, and this addition will guarantee that users get the sampling rate they requested even if the underlying audio file has a different sample rate.
In my experience with ASR trainings, the most common mistakes are wrong sampling rate and wrong number of channels. When training models for different purposes, it would really beneficial to load audio while changing sampling rate and extracting a portion like one will do with STM files, without pre-converting audio files and storing on the file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more consideration: could someone want to use a specific algorithm for resampling, instead of the default one? How would the behavior of the resampling be guaranteed across different backends?
When training models for different purposes, it would really beneficial to load audio while changing sampling rate and extracting a portion like one will do with STM files, without pre-converting audio files and storing on the file system.
We can already do that with in-memory sox effects, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more consideration: could someone want to use a specific algorithm for resampling, instead of the default one? How would the behavior of the resampling be guaranteed across different backends?
I do not think it is necessary to make that guarantee into a specification, because each backend has different limitations. If we make that guarantee into specification (except loading the data as is), we limit ourselves to only take the smallest subset of functionalities of all the backends.
Also, so far backends are for the purpose of platform-portability, and it is not meant to be swapped very often for the same installation.
If users can use the Resample transform for that purpose. I think, leveraging the backend's specific capability to provide more efficient way of perform it has added value.
We can already do that with in-memory sox effects, right?
Yes, but this way it's more efficient when one wants to trim the audio, because sox will discard the un-needed portion.
8a0b921
to
5cf9bf9
Compare
5cf9bf9
to
caf15aa
Compare
c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal> load_audio_file( | ||
const std::string& path, | ||
const int64_t frame_offset = 0, | ||
const int64_t num_frames = -1, | ||
const bool normalize = true, | ||
const bool channels_first = true); | ||
|
||
// ver. 1 sample_rate is added | ||
c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal> load_audio_file_v1( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mruberry -- in case you would like to comment on how to deprecate torchscript APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're adding a new argument with a default value. In that case you should be able to load older serialized Torchscript without issue, and I don't think you need to the old signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry It seems not the case. I removed the BC code and BC test fails. https://app.circleci.com/pipelines/github/pytorch/audio/3270/workflows/0f877fa3-6256-4ef5-8d84-124684d17a97/jobs/93108
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry It seems not the case. I removed the BC code and BC test fails. https://app.circleci.com/pipelines/github/pytorch/audio/3270/workflows/0f877fa3-6256-4ef5-8d84-124684d17a97/jobs/93108
Sorry, @mthrok, I feel like I misled you but that's what I was told. Empirically it seems to not be the case and I'll have to go back and investigate. I suppose you can keep the old signature and remap it into the new one like you originally were if you want to continue being able to load older serialized Torchscript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @mthrok, I feel like I misled you but that's what I was told. Empirically it seems to not be the case and I'll have to go back and investigate. I suppose you can keep the old signature and remap it into the new one like you originally were if you want to continue being able to load older serialized Torchscript.
Do you have an estimate of when it would work without code for backward-compatibility?
I am wondering if I should wait if it's soon. That way we can avoid adding a new schema.
caf15aa
to
abbb52e
Compare
@@ -28,7 +28,8 @@ c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal> load_audio_file( | |||
const int64_t frame_offset = 0, | |||
const int64_t num_frames = -1, | |||
const bool normalize = true, | |||
const bool channels_first = true); | |||
const bool channels_first = true, | |||
const int64_t sample_rate = -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the behavior of the -1 value documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is omitted because I judged that the code is self-explanatory.
This reverts commit abbb52e.
f324375
to
1e722d8
Compare
* Update dcgan_faces_tutorial.py Previous comment might have been slightly misleading. A bit easier to understand, if made explicit, that gradients of errD_real and errD_fake w.r.t. parameters of netD get added up/accumulated because of the successive backward calls without a zero_grad() in between. * Update dcgan_faces_tutorial.py Co-authored-by: Brian Johnson <[email protected]> Co-authored-by: holly1238 <[email protected]>
This PR adds
sample_rate
option tosox_io_backend.load
function which performs resampling when loading audio data from file.Inspired by this comment
On Python interface this change is backward compatible, but Torchscript registration schema is changed, (seeregister.cpp
), therefore previously dumped Torchscript object that includes call tosox_io_backend.load
will stop working. -> BC breaking.Added backward compatibility. TODO: test -> #838