Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix load from file object for small files and shorter bytes #1181

Merged
merged 3 commits into from
Jan 27, 2021

Conversation

mthrok
Copy link
Collaborator

@mthrok mthrok commented Jan 15, 2021

In #1158 file-like object support was added to load function.
The code works an object that implements read protocol read(size: int) -> bytes.
The implementation expected that the read method will always returns the bytes with the exact requested length, but that should not be required.

This PR relax it with couple of tweaks.

  1. Introduce read_fileobj helper function that calls read and retry until the requested length is fetched / EOF is reached.
  2. Mark explicitly the end of file object read, and will not call the method again once it reaches the EOF.
  3. Tweak the behavior around file opening (reading header and initializing decoder) so that buffer size can be changed, and it works fine on tiny audio data.

@mthrok mthrok force-pushed the tweak-load-fileobj branch 2 times, most recently from 823265e to 07da48d Compare January 15, 2021 23:36
@mthrok mthrok marked this pull request as ready for review January 20, 2021 04:55
@mthrok mthrok force-pushed the tweak-load-fileobj branch 4 times, most recently from 4a1640d to cbaa492 Compare January 22, 2021 02:30
@mthrok mthrok changed the title Guard the call to Python method in C++ Fix load from file object Jan 22, 2021
@mthrok mthrok added this to the v0.8 milestone Jan 25, 2021
@mthrok mthrok force-pushed the tweak-load-fileobj branch 3 times, most recently from a67210d to dd92058 Compare January 26, 2021 20:35
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left only 2 non-blocking comments.

// * This can be changed with `torchaudio.utils.sox_utils.set_buffer_size`.
auto capacity =
(sox_get_globals()->bufsiz > 256) ? sox_get_globals()->bufsiz : 256;
std::string buffer(capacity, '\0');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there is a better type on std instead of string to define your buffer? Perhaps py::bytes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation of py::bytes, they do not provide direct access to the underlying memory. The only way to get an access to the memory is converting it to std::string. So I think it is straightforward to use std::string

torchaudio/csrc/sox/utils.cpp Outdated Show resolved Hide resolved
@mthrok mthrok force-pushed the tweak-load-fileobj branch from 97f4b51 to f858a4b Compare January 27, 2021 17:21
@mthrok mthrok changed the title Fix load from file object Fix load from file object for small files and shorter bytes Jan 27, 2021
@mthrok mthrok merged commit 47d97e3 into pytorch:master Jan 27, 2021
@mthrok mthrok deleted the tweak-load-fileobj branch January 27, 2021 17:59
mthrok pushed a commit to mthrok/audio that referenced this pull request Feb 26, 2021
* Update index.rst

* Delete aws_distributed_training_tutorial.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants