Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LibriSpeech Base class for metadata #2646

Closed

Conversation

carolineechen
Copy link
Contributor

@carolineechen carolineechen commented Aug 23, 2022

Add LIBRISPEECHBase class that returns the same information as LIBRISPEECH, but returns a file path instead of the loaded waveform. See #2539 for some context.

This design choice results in 2x API for the dataset -- one for metadata mode, and one for wav mode. We may want to think of a better way to arrange the docs on the website since most variables are repeated, so while this under discussion, I have simply left it out of the docs build for now.

TODO: testing

cc: @leo19941227

@carolineechen carolineechen changed the title Add Librispeech Base class for metadata Add LibriSpeech Base class for metadata Aug 23, 2022
@@ -43,16 +43,16 @@ def download_librispeech(root, url):
extract_archive(archive)


def load_librispeech_item(
def get_librispeech_metadata(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to class method so that it's overridable from client code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes to load_librispeech_item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this preferable to having users directly override the __getitem__ method of the dataset (as opposed to overriding a class method that __getitem__ calls which involves going a layer deeper)? also since load_librispeech_item is currently being used by another dataset/accessible to users

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_librispeech_item should not be public. Can we follow up and make it private?

From the regular OOP perspective, the right approach is to let users overwrite __getitem__. However, this forces the users to re-write/copy-paste almost the whole logic of path-parsing and text processing. There are situations where only decoding should be overwritten but since these functions being module-level plain function, it is not plausible. These functions being class method allow to override the logic in the subclass. That's the pattern we should have been followed in the dataset class implementations, but existing datasets not doing that led to the situation where new dataset did not follow either. See #910 (comment) for the basic idea.

)


class LIBRISPEECHBase(Dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not keep dragging the bad naming scheme.

Suggested change
class LIBRISPEECHBase(Dataset):
class LibriSpeechBase(Dataset):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since we are introducing a new class, which gives rare opportunity to redefine the interface without BC, let's remove folder_in_archive, and make root optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we should split download from the constructor, downloading should not be the responsibility of this class (or at least constructor).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can redefine/simplify the API for the base class. curious what's the reasoning for removing downloading from this class?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading is a complex procedure. It involves network access, local file access and the permission.
There are so many ways to fail for downloading feature, for example, network unreachable, slow network, directory creation permission, file write permission, and file storage limit. The fact that the existing dataset interface ended up having folder_in_archive argument is one testament.

Also the data integrity is not resolved with the current approach. It is easy to verify the integrity of downloaded archive by using SHA-256 hash, but what about the individual data? Say, I delete one of the WAV file from the directory, should the download function refill it somehow? How to detect it? All these specifications and responsibility of flawless executions should be placed in a separate location. Dataset class should not be responsible.

From the perspective of library aiming to be building block (or single responsibility principle), it's not something that should be performed inside of dataset constructor. It should be performed separately.
By introducing the Base dataset class, we can re-interpret the existing dataset classes as higher-level implementations, composed of multiple primitives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should the default for root be if making it optional? I remember some conversation about this but don't think we reached a conclusion, and think this is something that could easily be addressed later on as making it optional would not be BC-breaking



class LibriSpeechBase(Dataset):
"""Create a Dataset for *LibriSpeech* [:footcite:`7178964`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the description can be different from LIBRISPEECH, for example, explicitly mention the class is only for fetching metadata, without loading the actual waveforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also add it to the documentation page?

Copy link
Contributor Author

@carolineechen carolineechen Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add it to docs yet since I wasn't sure the best way to do so with 2x API. Maybe something formatted like this?

  ...
  LibriMix
+ LibriSpeech
   - LibriSpeechBase
   - LIBRISPEECH
  LibriLightLimited
  ...

@carolineechen
Copy link
Contributor Author

switching to a different approach for enabling easy access to metadata, similar to #910 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants