Better structure dataset implementations #910

vincentqb · 2020-09-15T19:13:38Z

The suggestion is to adjust all the datasets (e.g. speechcommands) to follow some changes done in tedlium, see example below.

Add a _parse_filesystem method to extract a list of "data point identifiers" in a pre-determined order to replace the generic walk_files, as here and Changed GTZAN so that it only traverses filenames belonging to the dataset #791.
Move _load_item as method, as here.
Replace class attributes by constructor arguments, e.g. here.
Remove non-standard attributes, e.g. here? or add attributes, e.g. here?

Relates to #852, GTZAN #791, tedlium #882.
cc @mthrok @cpuhrsch

The text was updated successfully, but these errors were encountered:

mthrok · 2020-09-15T21:58:41Z

Replace walk by an explicit iterator over the files, as here

Note iterator here is NOT a Python generator expression. None of the datasets require iterable, and requiring it to be itrable adds no benefit yet constraints.

Rename _walker to _filelist, as here.

What problem does this solve? I do not see the benefit of explicitly aligning the name of internal attributes.

Move _load_item as method to main class, as here.

The important thing about this is that the arguments to _load_item are sufficient information to locate a sample in the dataset.

~~Add a (static?) method for loading audio, as here.~~

No static please. There is no point making it static.

I think it is missing the important one Remove walk_files from the code base.

mthrok · 2020-09-15T22:02:47Z

Note that consistency is a tool to achieve a good user/developer experience, the consistency itself is not a goal, or the first choice of design. The consistency for the sake of consistency is a invalid choice. When talking about the consistency, it should be always clear what benefit does that consistency brings.

vincentqb · 2020-09-15T22:17:47Z

Replace walk by an explicit iterator over the files, as here

Note iterator here is NOT a Python generator expression. None of the datasets require iterable, and requiring it to be itrable adds no benefit yet constraints.

cc @cpuhrsch for generator discussion. do we have a document about these prior discussions?

I personally like generator approach as it mimics linux piping. However, none of the data loader infrastructure and jit really support generator. Most important to me though is that we find a "simple" design that we can replicate. Bonus points for a dataset that is easily convertible from map-style to iterable-style (where length is not necessarily known or relevant, e.g. generator-based).

Rename _walker to _filelist, as here.

What problem does this solve? I do not see the benefit of explicitly aligning the name of internal attributes.

[updated in list] What I really meant here is to align internal attributes where possible.

My goal is to make it as simple/boilerplate as possible when creating a new dataset. Aligning attributes and implementations make that easier. The goal of this list was to document where we are trying to go with the dataset implementation, and I'm noting a diverging point here compared to prior design choices.

~~Add a (static?) method for loading audio, as here.~~

No static please. There is no point making it static.

Sure, updated above.

I think it is missing the important one Remove walk_files from the code base.

Added above. This is related to the generator/iterator discussion.

mthrok · 2020-09-16T17:38:42Z

I personally like generator approach as it mimics linux piping. However, none of the data loader infrastructure and jit really support generator. Most important to me though is that we find a "simple" design that we can replicate. Bonus points for a dataset that is easily convertible from map-style to iterable-style (where length is not necessarily known or relevant, e.g. generator-based).

Making the dataset into iterable is different topic. The point is that current datasets all materialize and consumes the generator right after it is instantiated for the sake of __len__. Making it generator expression when listing files are not helpful here. Converting datasets into generator is outside the scope of this issue. It's YAGNI.

My goal is to make it as simple/boilerplate as possible when creating a new dataset. Aligning attributes and implementations make that easier. The goal of this list was to document where we are trying to go with the dataset implementation, and I'm noting a diverging point here compared to prior design choices.

I do not think that's practical for a list. For methods like _load_item and _load_audio, it is a good practice because they clearly define what they are supposed to do and they give a good decoupling of logics (loading from parsing) of Dataset

alignment of attribute name is hard to enforce. (manual enforce is not possible)
what goes in there is under-defined. what if we want to have a dataset that contains multiple lists with different nature/length in future?
internal attributes are not supposed to be used by client code, or documented.
each class should have a best descriptive name for the list of items it preserves. (actually, in the case of tedium, it's segment list, not a file list)

Also I think we should get rid of class attributes like _ext_audio. It's much easier to change if they are constructor arguments.

mthrok · 2020-09-16T17:54:44Z

For example, we can define a base class for dataset class and use abc.abstractmethod to enforce the implementation of _load_item, _load_audio methods. This will force the developer of the new dataset classes to separate the different perspectives of what Dataset implementation do.

In my opinion, the current Dataset implementations do the following three things.

parse filesystem
given an index find the key to identify the corresponding sample
load sample

__item__, does 2 and _load_item method does 3. Probably adding one for 1, like _parse_filesystem also make sense. Then using abc

class _LocalFileDataset

@abc.abstractmethod
def _parse_files():
    pass


@abc.abstractmethod
def _load_sample():
    pass


@abc.abstractmethod
def _load_audio():
    pass

will enforce the same structure.

vincentqb · 2020-09-16T19:24:16Z

I like the three steps you mentioned: parse filesystem, identify sample given key, load sample. And it could be a good idea to use abstractmethod to enforce the steps, though we haven't used that in the codebase yet.

I also agree that the generator-based approach may be cute but adds constraint that we shouldn't worry about now.

I've updated the list with this. Thoughts?

lcances · 2020-09-18T14:18:39Z

Hello, I agree that the current state of the datasets needs some guidelines to follow.

I am going to work now with the GTZAN and SpeechCommand datasets that are already present on Torchaudio. I can take care of "aligning" them.

_load_item

I think it is a good idea.

Putting the _load_item method inside the class is more flexible for the end-user who needs to do specific processing or doesn't need to return all the elements.

In my case, for the ESC datasets, it allowed me within 5 lines to remove the sampling rate (which I didn't need) and to add a cache system to load the audio file and compute Mel-spectrogram only once. I am not sure how I would have been able to do so without it.

class ESC10_NoSR(ESC10):
    @cache_feature
    def __getitem__(self, index: int) -> Tuple[Tensor, int]:
        x, sr, y = super().__getitem__(index)
        return x, y

_parse_files

I'm not sure about the proposed name.

In some cases, a metadata descriptor (often a .csv file) accompanies the dataset and can be more or less complicated. In that case, the files are grouped into the same directory, and their name doesn't provide any information. Parsing them is pretty much useless. I see it more like a metadata preparation to select folders (for cross-validation) if provided, the subset (Train, Val, Test) or specific dataset variant (10 classes vs. 50 classes for ESC, for instance).

transform arguments

On torchvision, the datasets have the arguments transform and target_transform. I find it interesting because by doing so it can take advantage of the multiprocessing available with the Dataloader. I agree that some of this processing can be performed on the GPU, but for data augmentation, for instance, pitch shifting, it can CPU intensive and could use it.

mthrok · 2020-09-18T16:30:59Z

Hi @leocances

Thanks for you input.

_parse_files

I'm not sure about the proposed name.

In some cases, a metadata descriptor (often a .csv file) accompanies the dataset and can be more or less complicated. In that case, the files are grouped into the same directory, and their name doesn't provide any information. Parsing them is pretty much useless. I see it more like a metadata preparation to select folders (for cross-validation) if provided, the subset (Train, Val, Test) or specific dataset variant (10 classes vs. 50 classes for ESC, for instance).

Sure I did not spend any extra second for coming up with a good name, so I am totally open to change it. Just to note, since this is an internal method, it should be named in the way that developers (not end-users) can intuitively understand what logic they should put in there.

transform arguments

On torchvision, the datasets have the arguments transform and target_transform. I find it interesting because by doing so it can take advantage of the multiprocessing available with the Dataloader. I agree that some of this processing can be performed on the GPU, but for data augmentation, for instance, pitch shifting, it can CPU intensive and could use it.

I agree with your point, putting the transform into loading part might give us an opportunity to improve the performance. However I suggest we separate that discussion from this issue, for the sake of focus on the alignment this issue is trying to solve. @dongreenberg also mentioned the idea of putting transforms inside. We can have the discussion somewhere else.

mthrok · 2020-09-18T16:33:25Z

In my case, for the ESC datasets, it allowed me within 5 lines to remove the sampling rate (which I didn't need) and to add a cache system to load the audio file and compute Mel-spectrogram only once. I am not sure how I would have been able to do so without it.
class ESC10_NoSR(ESC10):
    @cache_feature
    def __getitem__(self, index: int) -> Tuple[Tensor, int]:
        x, sr, y = super().__getitem__(index)
        return x, y

That cache decorator is cool. 👍

vincentqb · 2020-09-28T14:45:33Z

I agree with your point, putting the transform into loading part might give us an opportunity to improve the performance. However I suggest we separate that discussion from this issue, for the sake of focus on the alignment this issue is trying to solve. @dongreenberg also mentioned the idea of putting transforms inside. We can have the discussion somewhere else.

Let's indeed move that discussion to #923.

cpuhrsch · 2020-10-12T18:44:49Z

I'm in favor of writing datasets that are easily copy-paste-editable. They should be very simple and easily modifiable for the user, who has advanced usecases (e.g. randomized slicing). We can in turn provide them with a few tools that are typically difficult to write (e.g. highly efficient load function that also allows to read subsets) and really focus on that. The code of a dataset in essence best describes how a single datapoint is defined via its explicit construction. This should be more user friendly than adding more flags etc.

vincentqb · 2020-10-14T20:31:44Z

To summarize the discussion above, the current suggestion is really just adjusting all the datasets (e.g. speechcommands) to follow some changes done in tedlium.

class SPEECHCOMMANDS(Dataset):

    # move class attributes to constructor

    def __init__(self, ...):
        # ...
        self._parse_filesystem()

    def _parse_filesystem(self):
        # populate the list mapping n to data point identifiers self._items without os.walk

    def _load_item(self, identifier):
        # move load here instead of being a function outside
        return item_tuple

    def __getitem__(self, n: int) -> Tuple[Tensor, int, str, str, int]:
        identifier = self._items[n]
        return _load_item(identifier)

    def __len__(self) -> int:
        return len(self._items)

See also top comment.

cc @cpuhrsch @mthrok

lcances · 2020-10-15T12:11:28Z

Hello,

As I said, I am currently working with some of the datasets that are already implemented in Torchaudio such as gtzan and speechcommand.

By the end of the next week, I should be able to work on these two to make the adjustment needed.

vincentqb · 2020-10-16T21:21:19Z

As I said, I am currently working with some of the datasets that are already implemented in Torchaudio such as gtzan and speechcommand.

By the end of the next week, I should be able to work on these two to make the adjustment needed.

@leocances -- thanks a lot for the follow-up :) I've marked this issue as "draft" ( for lack of better flag :) ) for now to indicate this is still in discussion

mthrok · 2020-12-22T18:49:47Z

@krishnakalyan3

If you are interested in, you can take a look into this. The overall direction is described above but it's not perfectly clarified so there are some perspectives that need to be figured out.

I think YesNo dataset can be a good starting point, as it is simple.

mthrok · 2020-12-25T17:05:22Z

Looking at #1127 (thanks @krishnakalyan3 ), the folder_in_archive should be deprecated and removed, from all the Dataset implementation.

It does not provide consistent behavior with download + extract behavior and, in the first place, directory structure being changed is not something library should be expecting and conforming. If user has changed the directory structure, that's on the user, and library should not be taking care of it.

mthrok · 2020-12-25T17:05:54Z

Also _ext_audio should be deprecated and removed too.

krishnakalyan3 · 2020-12-25T18:57:28Z

@mthrok

removing _ext_audio -> I will replace it with a string.
deprecating and removing folder_in_archive -> I am also in favor of this, It can also simplify the dataset module in the future.
What about deprecating and removing url? -> As a user, I would rarely specify my own URL unless I know exactly what I am doing. Since the extraction process is highly dependent on the dataset. (Modifying _RELEASE_CONFIGS directly would be a better option I guess)

mthrok · 2020-12-27T02:02:53Z

@krishnakalyan3

Thanks for the feedback. Do you have any other thoughts while working on #1127?

removing _ext_audio -> I will replace it with a string.

Do you mean string "constant"? I think that makes sense, especially for ".wav" formats.

deprecating and removing folder_in_archive -> I am also in favor of this, It can also simplify the dataset module in the future.

Yes. We had to do this with CommonVoice recently, and the resulting code became much simple.

What about deprecating and removing url?

For url, even though it is rarely used, there are multiple potential scenarios where the source becomes unavailable and the archive is re-hosted somewhere else.

firewalls
the original source becomes unavailable.

Modifying _RELEASE_CONFIGS directly would be a better option

But there are cases that users do not have an admin privilege to modify the installed package, so it's better if users can provide their configuration from their client code. Maybe not as a single variable but making a custom configuration type that is specific to the dataset might be a possible option.

krishnakalyan3 · 2020-12-28T10:29:13Z

I have also observed that the dataset (TEDLIUM, ) has release as a parameter in __init__. For the sake of consistency should we remove it?.

krishnakalyan3 · 2021-01-19T13:09:46Z

@vincentqb @mthrok, could you please advise if everything looks okay. I am a little worried about keeping PRs open for a long time as things tend to get stale.

mthrok · 2021-01-19T15:26:08Z

@krishnakalyan3

Sorry about the delayed response. I will take overt this work.

Regarding #1127, I think it is strict improvement in customizability, so I will go ahead and merge it.
Regarding #1143 , ping me again in the PR (as a comment, so that I won't miss the notification) once you are ready.

vincentqb added the module: datasets label Sep 15, 2020

vincentqb mentioned this issue Sep 16, 2020

Re-align GTZAN with other datasets #852

Closed

vincentqb changed the title ~~Re-align Datasets~~ Align dataset implementations Sep 16, 2020

This was referenced Sep 25, 2020

Adding ESC-10/50 datasets #912

Closed

dataset transform and target_transform #923

Closed

vincentqb changed the title ~~Align dataset implementations~~ [DRAFT] Align dataset implementations Oct 16, 2020

vincentqb mentioned this issue Nov 19, 2020

Use glob in datasets instead of walk_files #1051

Closed

7 tasks

mthrok mentioned this issue Dec 22, 2020

Remove redundant dataset utilities #1086

Merged

krishnakalyan3 mentioned this issue Dec 24, 2020

[Refactor][Dataset] YesNo implementation #1127

Merged

krishnakalyan3 mentioned this issue Dec 31, 2020

[Refactor][Dataset] LJSPEECH implementation #1143

Merged

mthrok self-assigned this Jan 19, 2021

mthrok changed the title ~~[DRAFT] Align dataset implementations~~ Better structure dataset implementations Jan 19, 2021

mthrok closed this as completed in #1127 Jan 19, 2021

mthrok reopened this Jan 19, 2021

mthrok added help wanted good first issue contributions welcome labels Jan 19, 2021

HAMZA310 mentioned this issue Feb 15, 2021

Add "Crowd-sourced Emotional Multimodal Actors CREMA-D" Dataset #1268

Open

mthrok mentioned this issue Aug 24, 2022

Add LibriSpeech Base class for metadata #2646

Closed

mthrok closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better structure dataset implementations #910

Better structure dataset implementations #910

vincentqb commented Sep 15, 2020 •

edited

Loading

mthrok commented Sep 15, 2020

mthrok commented Sep 15, 2020 •

edited

Loading

vincentqb commented Sep 15, 2020

mthrok commented Sep 16, 2020 •

edited

Loading

mthrok commented Sep 16, 2020 •

edited

Loading

vincentqb commented Sep 16, 2020

lcances commented Sep 18, 2020

mthrok commented Sep 18, 2020 •

edited

Loading

mthrok commented Sep 18, 2020

vincentqb commented Sep 28, 2020

cpuhrsch commented Oct 12, 2020

vincentqb commented Oct 14, 2020 •

edited

Loading

lcances commented Oct 15, 2020

vincentqb commented Oct 16, 2020

mthrok commented Dec 22, 2020

mthrok commented Dec 25, 2020

mthrok commented Dec 25, 2020

krishnakalyan3 commented Dec 25, 2020 •

edited

Loading

mthrok commented Dec 27, 2020

krishnakalyan3 commented Dec 28, 2020

krishnakalyan3 commented Jan 19, 2021

mthrok commented Jan 19, 2021

Better structure dataset implementations #910

Better structure dataset implementations #910

Comments

vincentqb commented Sep 15, 2020 • edited Loading

mthrok commented Sep 15, 2020

mthrok commented Sep 15, 2020 • edited Loading

vincentqb commented Sep 15, 2020

mthrok commented Sep 16, 2020 • edited Loading

mthrok commented Sep 16, 2020 • edited Loading

vincentqb commented Sep 16, 2020

lcances commented Sep 18, 2020

mthrok commented Sep 18, 2020 • edited Loading

mthrok commented Sep 18, 2020

vincentqb commented Sep 28, 2020

cpuhrsch commented Oct 12, 2020

vincentqb commented Oct 14, 2020 • edited Loading

lcances commented Oct 15, 2020

vincentqb commented Oct 16, 2020

mthrok commented Dec 22, 2020

mthrok commented Dec 25, 2020

mthrok commented Dec 25, 2020

krishnakalyan3 commented Dec 25, 2020 • edited Loading

mthrok commented Dec 27, 2020

krishnakalyan3 commented Dec 28, 2020

krishnakalyan3 commented Jan 19, 2021

mthrok commented Jan 19, 2021

vincentqb commented Sep 15, 2020 •

edited

Loading

mthrok commented Sep 15, 2020 •

edited

Loading

mthrok commented Sep 16, 2020 •

edited

Loading

mthrok commented Sep 16, 2020 •

edited

Loading

mthrok commented Sep 18, 2020 •

edited

Loading

vincentqb commented Oct 14, 2020 •

edited

Loading

krishnakalyan3 commented Dec 25, 2020 •

edited

Loading