Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample_rate conversion option to sox_io_backend.load #816

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions test/torchaudio_unittest/sox_io_backend/load_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -291,3 +291,47 @@ def test_channels_first(self, channels_first):
found, _ = sox_io_backend.load(self.path, channels_first=channels_first)
expected = self.original if channels_first else self.original.transpose(1, 0)
self.assertEqual(found, expected)


@skipIfNoExec('sox')
@skipIfNoExtension
class TestSampleRate(TempDirMixin, PytorchTestCase):
"""Test the correctness of frame parameters of `sox_io_backend.load`"""
path = None

def setUp(self):
super().setUp()
sample_rate = 16000
original = get_wav_data('float32', num_channels=2)
self.path = self.get_temp_path('original.wave')
save_wav(self.path, original, sample_rate)

@parameterized.expand([(8000, ), (44100, )], name_func=name_func)
def test_sample_rate(self, sample_rate):
"""sample_rate changes sample rate"""
found, rate = sox_io_backend.load(self.path, sample_rate=sample_rate)
ref_path = self.get_temp_path('reference.wav')
sox_utils.run_sox_effect(self.path, ref_path, ['rate', f'{sample_rate}'])
expected, expected_rate = load_wav(ref_path)

assert rate == expected_rate
self.assertEqual(found, expected)

@parameterized.expand(list(itertools.product(
[8000, 44100],
[0, 1, 10, 100, 1000],
[-1, 1, 10, 100, 1000],
)), name_func=name_func)
def test_frame(self, sample_rate, frame_offset, num_frames):
"""frame_offset and num_frames applied after sample_rate"""
found, rate = sox_io_backend.load(
self.path, frame_offset=frame_offset, num_frames=num_frames, sample_rate=sample_rate)

ref_path = self.get_temp_path('reference.wav')
sox_utils.run_sox_effect(self.path, ref_path, ['rate', f'{sample_rate}'])
reference, expected_rate = load_wav(ref_path)
frame_end = None if num_frames == -1 else frame_offset + num_frames
expected = reference[:, frame_offset:frame_end]

assert rate == expected_rate
self.assertEqual(found, expected)
10 changes: 8 additions & 2 deletions torchaudio/backend/sox_io_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ def load(
num_frames: int = -1,
normalize: bool = True,
channels_first: bool = True,
sample_rate: Optional[int] = None,
) -> Tuple[torch.Tensor, int]:
"""Load audio data from file.

Expand Down Expand Up @@ -84,11 +85,13 @@ def load(
Path to audio file
frame_offset (int):
Number of frames to skip before start reading data.
If ``sample_rate`` is given, frame counts start after the audio is resampled.
num_frames (int):
Maximum number of frames to read. ``-1`` reads all the remaining samples,
starting from ``frame_offset``.
This function may return the less number of frames if there is not enough
frames in the given file.
If ``sample_rate`` is given, frame counts start after the audio is resampled.
normalize (bool):
When ``True``, this function always return ``float32``, and sample values are
normalized to ``[-1.0, 1.0]``.
Expand All @@ -98,15 +101,18 @@ def load(
channels_first (bool):
When True, the returned Tensor has dimension ``[channel, time]``.
Otherwise, the returned Tensor's dimension is ``[time, channel]``.
sample_rate (int, optional):
Perform resampling.

Returns:
torch.Tensor:
If the input file has integer wav format and normalization is off, then it has
integer type, else ``float32`` type. If ``channels_first=True``, it has
``[channel, time]`` else ``[time, channel]``.
"""
signal = torch.ops.torchaudio.sox_io_load_audio_file(
filepath, frame_offset, num_frames, normalize, channels_first)
sample_rate = -1 if sample_rate is None else sample_rate
signal = torch.ops.torchaudio.sox_io_load_audio_file_v1(
filepath, frame_offset, num_frames, normalize, channels_first, sample_rate)
return signal.get_tensor(), signal.get_sample_rate()


Expand Down
3 changes: 3 additions & 0 deletions torchaudio/csrc/register.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ TORCH_LIBRARY(torchaudio, m) {
m.def(
"torchaudio::sox_io_load_audio_file",
&torchaudio::sox_io::load_audio_file);
m.def(
"torchaudio::sox_io_load_audio_file_v1",
&torchaudio::sox_io::load_audio_file_v1);
m.def(
"torchaudio::sox_io_save_audio_file",
&torchaudio::sox_io::save_audio_file);
Expand Down
18 changes: 18 additions & 0 deletions torchaudio/csrc/sox_io.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,16 @@ c10::intrusive_ptr<TensorSignal> load_audio_file(
const int64_t num_frames,
const bool normalize,
const bool channels_first) {
return load_audio_file_v1(path, frame_offset, num_frames, channels_first, -1);
}

c10::intrusive_ptr<TensorSignal> load_audio_file_v1(
const std::string& path,
const int64_t frame_offset,
const int64_t num_frames,
const bool normalize,
const bool channels_first,
const int64_t sample_rate) {
if (frame_offset < 0) {
throw std::runtime_error(
"Invalid argument: frame_offset must be non-negative.");
Expand All @@ -61,8 +71,16 @@ c10::intrusive_ptr<TensorSignal> load_audio_file(
throw std::runtime_error(
"Invalid argument: num_frames must be -1 or greater than 0.");
}
if (sample_rate == 0 || sample_rate < -1) {
throw std::runtime_error(
"Invalid argument: sample_rate must be -1 or greater than 0.");
}

std::vector<std::vector<std::string>> effects;
if (sample_rate != -1) {
effects.emplace_back(
std::vector<std::string>{"rate", std::to_string(sample_rate)});
Copy link
Contributor

@vincentqb vincentqb Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is adding to the sox_effects chain a resampling transform, right? I feel the user could want many other transform, and that would add many flags. Could it work instead to point out that sox_effects_chain can combine those operations already? (as you point out here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying implementation is same (which is why we can add sample rate conversion to load function), but the intended use of frontend functions are different. If users want to use filtering they can certainly do so by using sox effects. This is for convenience, and not for compatibility. Changing sampling rate is such a popular use case, like using the same dataset to train ASR models for different environments (realtime vs batch or on-device vs server etc...). So having convenient and fast way to change sampling rate is valuable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users also have torchaudio.transforms.Resample :) Are any of these two (Resample and sox_effects) measurably slower than this implementation?

I'm not keen on adding parameters like these to load functions when there are already simple ways of doing so. On top of adding parameters to the signature, this also does add layers of backward-compatibility etc to the codebase.

How about other backends like soundfile?

Copy link
Collaborator Author

@mthrok mthrok Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users also have torchaudio.transforms.Resample :) Are any of these two (Resample and sox_effects) measurably slower than this implementation?

The difference is 120x

script
import time

import torchaudio

torchaudio.set_audio_backend('sox_io')

n_rep = 100
sr = 8000

t0 = time.monotonic()
for _ in range(n_rep):
    torchaudio.load('test/torchaudio_unittest/assets/steam-train-whistle-daniel_simon.wav', sample_rate=sr)
t1 = time.monotonic()


print((t1-t0) / 100)

t0 = time.monotonic()
for _ in range(n_rep):
    wave, sample_rate = torchaudio.load('test/torchaudio_unittest/assets/steam-train-whistle-daniel_simon.wav')
    torchaudio.compliance.kaldi.resample_waveform(wave, sample_rate, sr)
t1 = time.monotonic()

print((t1-t0) / 100)
$ python  foo.py
0.015911198877729474
0.18906223367899655

I'm not keen on adding parameters like these to load functions when there are already simple ways of doing so. On top of adding parameters to the signature, this also does add layers of backward-compatibility etc to the codebase.

How about other backends like soundfile?

This is BC compatible in both Python and C++ Torchscript schema. For other backend, we can add the argument and raise an error saying sample_rate is not supported in this backend until we actually add the implementation. Which is the same approach as the coming backend clean up on "sox_io" and "soundfile" backends.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why it is a good idea to add sampling rate option to loading function is when extracting a portion of audio file. Like in #882 there are occasions where users want to load only a portion of audio, based on timestamps. Having the correct sampling rate is critical in this case to calculate the target frame positions, and this addition will guarantee that users get the sampling rate they requested even if the underlying audio file has a different sample rate.

In my experience with ASR trainings, the most common mistakes are wrong sampling rate and wrong number of channels. When training models for different purposes, it would really beneficial to load audio while changing sampling rate and extracting a portion like one will do with STM files, without pre-converting audio files and storing on the file system.

Copy link
Contributor

@vincentqb vincentqb Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more consideration: could someone want to use a specific algorithm for resampling, instead of the default one? How would the behavior of the resampling be guaranteed across different backends?

When training models for different purposes, it would really beneficial to load audio while changing sampling rate and extracting a portion like one will do with STM files, without pre-converting audio files and storing on the file system.

We can already do that with in-memory sox effects, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more consideration: could someone want to use a specific algorithm for resampling, instead of the default one? How would the behavior of the resampling be guaranteed across different backends?

I do not think it is necessary to make that guarantee into a specification, because each backend has different limitations. If we make that guarantee into specification (except loading the data as is), we limit ourselves to only take the smallest subset of functionalities of all the backends.
Also, so far backends are for the purpose of platform-portability, and it is not meant to be swapped very often for the same installation.
If users can use the Resample transform for that purpose. I think, leveraging the backend's specific capability to provide more efficient way of perform it has added value.

We can already do that with in-memory sox effects, right?

Yes, but this way it's more efficient when one wants to trim the audio, because sox will discard the un-needed portion.

}
if (num_frames != -1) {
std::ostringstream offset, frames;
offset << frame_offset << "s";
Expand Down
10 changes: 10 additions & 0 deletions torchaudio/csrc/sox_io.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,23 @@ struct SignalInfo : torch::CustomClassHolder {

c10::intrusive_ptr<SignalInfo> get_info(const std::string& path);

// ver. 0
c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal> load_audio_file(
const std::string& path,
const int64_t frame_offset = 0,
const int64_t num_frames = -1,
const bool normalize = true,
const bool channels_first = true);

// ver. 1 sample_rate is added
c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal> load_audio_file_v1(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mruberry -- in case you would like to comment on how to deprecate torchscript APIs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're adding a new argument with a default value. In that case you should be able to load older serialized Torchscript without issue, and I don't think you need to the old signature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry It seems not the case. I removed the BC code and BC test fails. https://app.circleci.com/pipelines/github/pytorch/audio/3270/workflows/0f877fa3-6256-4ef5-8d84-124684d17a97/jobs/93108

Sorry, @mthrok, I feel like I misled you but that's what I was told. Empirically it seems to not be the case and I'll have to go back and investigate. I suppose you can keep the old signature and remap it into the new one like you originally were if you want to continue being able to load older serialized Torchscript.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @mthrok, I feel like I misled you but that's what I was told. Empirically it seems to not be the case and I'll have to go back and investigate. I suppose you can keep the old signature and remap it into the new one like you originally were if you want to continue being able to load older serialized Torchscript.

Do you have an estimate of when it would work without code for backward-compatibility?
I am wondering if I should wait if it's soon. That way we can avoid adding a new schema.

const std::string& path,
const int64_t frame_offset = 0,
const int64_t num_frames = -1,
const bool normalize = true,
const bool channels_first = true,
const int64_t sample_rate = -1);
Copy link
Contributor

@vincentqb vincentqb Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the behavior of the -1 value documented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is omitted because I judged that the code is self-explanatory.


void save_audio_file(
const std::string& file_name,
const c10::intrusive_ptr<torchaudio::sox_utils::TensorSignal>& signal,
Expand Down