-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LibriSpeech Base class for metadata #2646
Conversation
@@ -43,16 +43,16 @@ def download_librispeech(root, url): | |||
extract_archive(archive) | |||
|
|||
|
|||
def load_librispeech_item( | |||
def get_librispeech_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to class method so that it's overridable from client code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes to load_librispeech_item
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this preferable to having users directly override the __getitem__
method of the dataset (as opposed to overriding a class method that __getitem__
calls which involves going a layer deeper)? also since load_librispeech_item
is currently being used by another dataset/accessible to users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_librispeech_item
should not be public. Can we follow up and make it private?
From the regular OOP perspective, the right approach is to let users overwrite __getitem__
. However, this forces the users to re-write/copy-paste almost the whole logic of path-parsing and text processing. There are situations where only decoding should be overwritten but since these functions being module-level plain function, it is not plausible. These functions being class method allow to override the logic in the subclass. That's the pattern we should have been followed in the dataset class implementations, but existing datasets not doing that led to the situation where new dataset did not follow either. See #910 (comment) for the basic idea.
torchaudio/datasets/librispeech.py
Outdated
) | ||
|
||
|
||
class LIBRISPEECHBase(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not keep dragging the bad naming scheme.
class LIBRISPEECHBase(Dataset): | |
class LibriSpeechBase(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since we are introducing a new class, which gives rare opportunity to redefine the interface without BC, let's remove folder_in_archive
, and make root
optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, we should split download from the constructor, downloading should not be the responsibility of this class (or at least constructor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, we can redefine/simplify the API for the base class. curious what's the reasoning for removing downloading from this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downloading is a complex procedure. It involves network access, local file access and the permission.
There are so many ways to fail for downloading feature, for example, network unreachable, slow network, directory creation permission, file write permission, and file storage limit. The fact that the existing dataset interface ended up having folder_in_archive
argument is one testament.
Also the data integrity is not resolved with the current approach. It is easy to verify the integrity of downloaded archive by using SHA-256 hash, but what about the individual data? Say, I delete one of the WAV file from the directory, should the download function refill it somehow? How to detect it? All these specifications and responsibility of flawless executions should be placed in a separate location. Dataset class should not be responsible.
From the perspective of library aiming to be building block (or single responsibility principle), it's not something that should be performed inside of dataset constructor. It should be performed separately.
By introducing the Base dataset class, we can re-interpret the existing dataset classes as higher-level implementations, composed of multiple primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should the default for root be if making it optional? I remember some conversation about this but don't think we reached a conclusion, and think this is something that could easily be addressed later on as making it optional would not be BC-breaking
b3a1a4a
to
01489af
Compare
|
||
|
||
class LibriSpeechBase(Dataset): | ||
"""Create a Dataset for *LibriSpeech* [:footcite:`7178964`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the description can be different from LIBRISPEECH
, for example, explicitly mention the class is only for fetching metadata, without loading the actual waveforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also add it to the documentation page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add it to docs yet since I wasn't sure the best way to do so with 2x API. Maybe something formatted like this?
...
LibriMix
+ LibriSpeech
- LibriSpeechBase
- LIBRISPEECH
LibriLightLimited
...
switching to a different approach for enabling easy access to metadata, similar to #910 (comment) |
Add
LIBRISPEECHBase
class that returns the same information asLIBRISPEECH
, but returns a file path instead of the loaded waveform. See #2539 for some context.This design choice results in 2x API for the dataset -- one for metadata mode, and one for wav mode. We may want to think of a better way to arrange the docs on the website since most variables are repeated, so while this under discussion, I have simply left it out of the docs build for now.
TODO: testing
cc: @leo19941227