RFC: Fast Audio Loading (from file) #1000

mthrok · 2020-11-03T16:52:14Z

Background

Fast audio loading is critical to audio application, and this is more so for the case of music data because of the following properties which are different from speech applications.

Higher sampling rate such as 44.1k and 48k. (8k and 16k are typical choices in speech)
Long duration, like 5 mins. (utterance in speech is only a few seconds)
Training with random seeks is a de-facto standard training method for taks like source separation, VAD, diarization and speaker/language identification tasks.
Audio data are more often in formats other than WAV. (mp3, aac, flac, ogg, ect...)

Proposal

Add a new I/O scheme to torchaudio, which utilizes libraries that provide faster decoding, wide coverage of codecs and portable across different OSs (Linux / macOS / Windows).
Currently torchaudio binds libsox. Libsox is not supported on Windows. There are a variety of decoding libraries that we can take advantage of. These include

minimp3 (CC0-1.0 License)
Fast mp3 decoding library.
minimp4 (CC0-1.0 License)
Similar to minimp3, a MP4 decoding library by the same author.
libsndfile (LGPL-2.1 License)
Fast for wav format.*
Also handles flac, ogg/vorbis
SpeeXDSP (License)
Resampling
(Optionally) ffmpeg (libavcodec) (LGPL v2.1+, MIT/X11/BSD etc)
Covers a much wider range of codecs, with higher decode/encode quality, but not as fast.
Can handle AAC format (in addition to what is already listed above) and a lot more.

Unlike the existing torchaudio backends, which implement the same generic interfaces, the new I/O will provide one unified Python interface to all the supported platforms (Linux / macOS / Windows), and delegate the library selection to underlying C++ implementation.

Benchmark for some of these libraries are available at https://github.com/faroit/python_audio_loading_benchmark . (Thanks @faroit !)

Non-Goals

In-memory decoding

In-memory-decoding support is nice to have, but currently, we do not know if it is possible to pass memory objects from Python to C++ via TorchScript. For the sake of simplicity, we exclude this feature from the scope of this proposal. For Python-only solution, see #800 for the gist.

Streaming decoding

Streaming decoding support will be critical for supporting real-time applications. However it is difficult to design real-time decoding as a stand-alone module, because the design of the downstream process, such as preprocessing, feeding to NN, and using the result, are all relevant to the upstream I/O mechanism, therefore, the streaming decoding is excluded from this proposal.

Effects (filterings)

ffmpeg supports filterings like libsox does. We can make it available too but this is outside the scope of fast audio loading.

Interface

Python frontend to the C++ interface. No (significant) logic should happen here.

# in torchaudio.io module
# we can call this from `torchaudio.load` too.
def load_audio_from_file(
    path: str,
    *,
    offset: Optional[float] = None,
    duration: Optional[float] = None,
    sample_rate: Optional[float] = None,
    normalize: bool = True,
    channels_first: bool = True,
    offset_unit: str = "second",
    format: Optional[str] = None,
) -> namedtuple(waveform, sample_rate):
"""Load audio from file

Args:
    path (str or pathlib.Path):
        Path to the audio file.

    offset (float, optional):
        Offset of reading, in the unit provided as `offset_unit`.
        defaults to the beginning of the audio file.

    duration (float, optional):
        Duration of reading, in the unit provided as `offset_unit`.
        defaults to the rest of the audio file.

    sample_rate (float, optional):
        When provided, the audio is resampled.

    normalize (bool, optional):
        When `True`, this function always return `float32`, and
        sample values are normalized to `[-1.0, 1.0]`.
        If input file is integer WAV, giving `False` will change the
        resulting Tensor type to integer type.
        This argument has no effect for formats other than
        integer WAV type.

    channels_first (bool, optional):
        When `True`, the returned Tensor has dimension
        `[channel, time]` otherwise `[time, channel]`.

    offset_unit (str, optional):
        The unit of `offset` and `duration`.
        `"second"` or `"frame"` (default to "second")

    format (str, optional):
        Override the format detection.

Returns:
    namedtuple `(waveform, sample_rate)`:
        `waveform` is a Tensor type, and `sample_rate` is float.
"""

Example Usage (Python)

import torchaudio


# Load the entire file
waveform, sample_rate = torchaudio.io.load_audio_from_file(
    "foobar.wav",
)

# Load the segment at 1.0 - 3.0 seconds
waveform, sample_rate = torchaudio.io.load_audio_from_file(
    "foobar.wav",
    offset = 1.,
    duration = 2.,
)

# Load the entire file, resample it to 8000
waveform, sample_rate = torchaudio.io.load_audio_form_file(
    "foobar.wav",
    sample_rate = 8000,
)

# Load the segment at 1.0 - 3.0 seconds, resample it to 8000
waveform, sample_rate = torchaudio.io.load_audio_form_file(
    "foobar.wav",
    offset = 1.,
    duration = 2.,
    sample_rate = 8000,
)

FAQ

Will the proposed API replace the current `torchaudio.load` ?

No, this proposal does not remove torchaudio.load or ask users to migrate to the new API. Instead, torchaudio.load will make use of the proposed API. (the detail of how it does is TBD)

When we think of supporting other types of I/O, such as memory-object, file-like object, or streaming object, we will design APIs separately and plug-in to torchaudio.load.

This way, we decouple the concerns and requirements, yet are able to extend the functionality.

The text was updated successfully, but these errors were encountered:

vincentqb · 2020-11-04T18:19:01Z

Can you clarify how resampling at load time mentioned in the interface relates to fast loading? I'm assuming the implicit advantage is that the resampling is done "streaming-style" while loading in memory by the backend, is that correct?

A side effect is that the resampling algorithm would depend on the backend.

vincentqb · 2020-11-06T18:10:51Z

(I would also flag the choice of namedtensor for further discussion.)

andfoy · 2020-11-09T19:18:50Z

Just passing by on this discussion, we've already enabled FFmpeg support (also packaging) for FFmpeg on torchvision for all operating systems. Let us know if we can help you here

cc @fmassa

mthrok · 2020-11-11T16:06:17Z

@andfoy Thanks for the offer. Yes, this RFC is at the moment for defining the goal. When I get to the detail of the implementation detail, I will ask for the help there.

mthrok · 2020-12-09T20:23:30Z

@andfoy I just realized that if torchaudio binds ffmpeg and ship static binary, then the version that torchvision ships and torchaudio ships might collide, so if torchaudio decides to add ffmpeg binding, we definitely work together.

andfoy · 2020-12-09T22:18:50Z

@mthrok, the last point shouldn't be a problem once pytorch/vision#2818 is merged, that PR relocates FFmpeg always in torchvision in order to prevent collisions

fmassa · 2020-12-11T16:33:20Z

Hi,

Chiming in with some background on what we did for torchvision.

In the 0.4.0 release, we introduced an API for reading video / audio which is very similar to the API you proposed, where the API is

def read_video(
    filename: str, start_pts: int = 0, end_pts: Optional[float] = None, pts_unit: str = "pts"
) -> Tuple[torch.Tensor, torch.Tensor, Dict[str, Any]]:

One difference was that we didn't expose any resampling method for changing the fps of the videos (nor audio), and we left those as functions that the user could write outside.
The reason being that for videos, most of the time you would need to read the whole video segment to then change the fps, which could be done outside of the function.

While this API was ok for some types of tasks (like classification), for others it was clear that this API was suboptimal.
As such, with the 0.8.0 release we introduced a new (beta) API for reading audio / video that instead returns an iterator of the file being read. The output of calling next on it corresponds to one "frame" (either audio or video frame).

Here is one example of how it can be used

from torchvision.io import VideoReader

# stream indicates if reading from audio or video
reader = VideoReader('path_to_video.mp4', stream='video')
# can change the stream after construction
# via reader.set_current_stream

# to read all frames in a video starting at 2 seconds
for frame in reader.seek(2):
    # frame is a dict with "data" and "pts" metadata
    print(frame["data"], frame["pts"])

# because reader is an iterator you can combine it with
# itertools
from itertools import takewhile, islice
# read 10 frames starting from 2 seconds
for frame in islice(reader.seek(2), 10):
    pass
    
# or to return all frames between 2 and 5 seconds
for frame in takewhile(lambda x: x["pts"] < 5, reader.seek(2)):

I would be glad to discuss about the trade-offs of both APIs. Our decision on the more granular API was motivated by flexibility without hurting speed (which was backed by benchmarks we did), but I understand that decoding audio might have different overheads and thus other things might need to be adjusted.

cpuhrsch · 2020-12-15T22:17:07Z

Some thoughs

Sample rate

I'm wondering whether we should return sample_rate alongside the wav form by default? Why not use torch.info or such?

a) User wants sample rate under current proposal

wav, sample_rate = torch.load(path)

b) User wants sample rate using torch.info

sample_rate = torch.info(path).sample_rate
wav = torch.load(path)

c) User doesn't want sample rate under current proposal

wav, _ = torch.load(path)

or

wav = torch.load(path)[0] # Looks like they're taking the first channel?

d) User doesn't want sample rate using torch.info

wav = torch.load(path)

I claim that we can always augment this function to return a richer type which might include information about sample rates, bit depth or other metadata, but we shouldn't start out with that unless we have a concrete list of reasons.

Reasons for returning sample_rate by default

torchaudio.load currently does it (why did we add it to begin with?)
Other libraries do it too (pysoundfile, scipy)
Multiple torchaudio functionals need it
It's faster (data?)

Reasons for not returning sample_rate by default

Metadata can be retrieved separately via an orthogonal info function

Using time or frame number for offsets

I think we should support both by dispatching on the offset and duration type. If the type is integral it's interpreted as a frame, if it's floating it is interpreted as time.

Are there formats where there is no clear linear correspondence between the time and frame number?

cpuhrsch · 2020-12-15T22:33:24Z

On a general note, I'd propose narrowing the scope of this RFC to even just fast audio loading from paths unless we broaden it and also include io Buffers and streams. That is in favor of having something very specific like torchaudio.io.read_audio_from_path and then adding a mechanism to torch.load that'll choose the right factory function.

To derisk an ill-designed grab bag of file location specific (io Buffer, file path, stream, etc.) load function we could make this a prototype feature that's only available in the nightlies at first.

ghost · 2021-06-08T15:06:22Z

Is there an example for reading an audio file into a torch::tensor in C++ ?

Note: Resolved in #1562

Add information about potential pitfall of not serializing the model state and keeping it as a reference during training. Co-authored-by: holly1238 <[email protected]>

mthrok added the RFC label Nov 3, 2020

mthrok mentioned this issue Nov 3, 2020

Faster module for resampling time series. #911

Closed

3 tasks

vincentqb mentioned this issue Nov 3, 2020

Windows Support #425

Closed

11 tasks

mthrok changed the title ~~[RFC] Fast Audio Loading~~ [RFC] Fast Audio Loading (from file) Dec 16, 2020

mthrok changed the title ~~[RFC] Fast Audio Loading (from file)~~ RFC: Fast Audio Loading (from file) Jan 1, 2021

mthrok mentioned this issue Jan 5, 2021

Sharing My Projects 2021 H1 #1154

Closed

vincentqb mentioned this issue Jan 8, 2021

DRAFT #1163

Closed

vincentqb mentioned this issue Jan 25, 2021

Roadmap ahead for torchaudio #1196

Closed

mthrok added the module: IO label Aug 9, 2021

mthrok closed this as completed Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Fast Audio Loading (from file) #1000

RFC: Fast Audio Loading (from file) #1000

mthrok commented Nov 3, 2020 •

edited

Loading

vincentqb commented Nov 4, 2020

vincentqb commented Nov 6, 2020

andfoy commented Nov 9, 2020

mthrok commented Nov 11, 2020

mthrok commented Dec 9, 2020

andfoy commented Dec 9, 2020

fmassa commented Dec 11, 2020

cpuhrsch commented Dec 15, 2020

cpuhrsch commented Dec 15, 2020

ghost commented Jun 8, 2021 •

edited by mthrok

Loading

RFC: Fast Audio Loading (from file) #1000

RFC: Fast Audio Loading (from file) #1000

Comments

mthrok commented Nov 3, 2020 • edited Loading

Background

Proposal

Non-Goals

In-memory decoding

Streaming decoding

Effects (filterings)

Interface

Example Usage (Python)

FAQ

Will the proposed API replace the current torchaudio.load ?

vincentqb commented Nov 4, 2020

vincentqb commented Nov 6, 2020

andfoy commented Nov 9, 2020

mthrok commented Nov 11, 2020

mthrok commented Dec 9, 2020

andfoy commented Dec 9, 2020

fmassa commented Dec 11, 2020

cpuhrsch commented Dec 15, 2020

Sample rate

Reasons for returning sample_rate by default

Reasons for not returning sample_rate by default

Using time or frame number for offsets

cpuhrsch commented Dec 15, 2020

ghost commented Jun 8, 2021 • edited by mthrok Loading

mthrok commented Nov 3, 2020 •

edited

Loading

Will the proposed API replace the current `torchaudio.load` ?

ghost commented Jun 8, 2021 •

edited by mthrok

Loading