Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better structure dataset implementations #910

Closed
4 tasks
vincentqb opened this issue Sep 15, 2020 · 22 comments · Fixed by #1127
Closed
4 tasks

Better structure dataset implementations #910

vincentqb opened this issue Sep 15, 2020 · 22 comments · Fixed by #1127

Comments

@vincentqb
Copy link
Contributor

vincentqb commented Sep 15, 2020

The suggestion is to adjust all the datasets (e.g. speechcommands) to follow some changes done in tedlium, see example below.

Relates to #852, GTZAN #791, tedlium #882.
cc @mthrok @cpuhrsch

@mthrok
Copy link
Collaborator

mthrok commented Sep 15, 2020

  • Replace walk by an explicit iterator over the files, as here

Note iterator here is NOT a Python generator expression. None of the datasets require iterable, and requiring it to be itrable adds no benefit yet constraints.

  • Rename _walker to _filelist, as here.

What problem does this solve? I do not see the benefit of explicitly aligning the name of internal attributes.

  • Move _load_item as method to main class, as here.

The important thing about this is that the arguments to _load_item are sufficient information to locate a sample in the dataset.

  • Add a (static?) method for loading audio, as here.

No static please. There is no point making it static.

I think it is missing the important one Remove walk_files from the code base.

@mthrok
Copy link
Collaborator

mthrok commented Sep 15, 2020

Note that consistency is a tool to achieve a good user/developer experience, the consistency itself is not a goal, or the first choice of design. The consistency for the sake of consistency is a invalid choice. When talking about the consistency, it should be always clear what benefit does that consistency brings.

@vincentqb
Copy link
Contributor Author

  • Replace walk by an explicit iterator over the files, as here

Note iterator here is NOT a Python generator expression. None of the datasets require iterable, and requiring it to be itrable adds no benefit yet constraints.

cc @cpuhrsch for generator discussion. do we have a document about these prior discussions?

I personally like generator approach as it mimics linux piping. However, none of the data loader infrastructure and jit really support generator. Most important to me though is that we find a "simple" design that we can replicate. Bonus points for a dataset that is easily convertible from map-style to iterable-style (where length is not necessarily known or relevant, e.g. generator-based).

  • Rename _walker to _filelist, as here.

What problem does this solve? I do not see the benefit of explicitly aligning the name of internal attributes.

[updated in list] What I really meant here is to align internal attributes where possible.

My goal is to make it as simple/boilerplate as possible when creating a new dataset. Aligning attributes and implementations make that easier. The goal of this list was to document where we are trying to go with the dataset implementation, and I'm noting a diverging point here compared to prior design choices.

  • Add a (static?) method for loading audio, as here.

No static please. There is no point making it static.

Sure, updated above.

I think it is missing the important one Remove walk_files from the code base.

Added above. This is related to the generator/iterator discussion.

@mthrok
Copy link
Collaborator

mthrok commented Sep 16, 2020

I personally like generator approach as it mimics linux piping. However, none of the data loader infrastructure and jit really support generator. Most important to me though is that we find a "simple" design that we can replicate. Bonus points for a dataset that is easily convertible from map-style to iterable-style (where length is not necessarily known or relevant, e.g. generator-based).

Making the dataset into iterable is different topic. The point is that current datasets all materialize and consumes the generator right after it is instantiated for the sake of __len__. Making it generator expression when listing files are not helpful here. Converting datasets into generator is outside the scope of this issue. It's YAGNI.

My goal is to make it as simple/boilerplate as possible when creating a new dataset. Aligning attributes and implementations make that easier. The goal of this list was to document where we are trying to go with the dataset implementation, and I'm noting a diverging point here compared to prior design choices.

I do not think that's practical for a list. For methods like _load_item and _load_audio, it is a good practice because they clearly define what they are supposed to do and they give a good decoupling of logics (loading from parsing) of Dataset

  1. alignment of attribute name is hard to enforce. (manual enforce is not possible)
  2. what goes in there is under-defined. what if we want to have a dataset that contains multiple lists with different nature/length in future?
  3. internal attributes are not supposed to be used by client code, or documented.
  4. each class should have a best descriptive name for the list of items it preserves. (actually, in the case of tedium, it's segment list, not a file list)

Also I think we should get rid of class attributes like _ext_audio. It's much easier to change if they are constructor arguments.

@mthrok
Copy link
Collaborator

mthrok commented Sep 16, 2020

For example, we can define a base class for dataset class and use abc.abstractmethod to enforce the implementation of _load_item, _load_audio methods. This will force the developer of the new dataset classes to separate the different perspectives of what Dataset implementation do.

In my opinion, the current Dataset implementations do the following three things.

  1. parse filesystem
  2. given an index find the key to identify the corresponding sample
  3. load sample

__item__, does 2 and _load_item method does 3. Probably adding one for 1, like _parse_filesystem also make sense. Then using abc

class _LocalFileDataset

@abc.abstractmethod
def _parse_files():
    pass


@abc.abstractmethod
def _load_sample():
    pass


@abc.abstractmethod
def _load_audio():
    pass

will enforce the same structure.

@vincentqb
Copy link
Contributor Author

I like the three steps you mentioned: parse filesystem, identify sample given key, load sample. And it could be a good idea to use abstractmethod to enforce the steps, though we haven't used that in the codebase yet.

I also agree that the generator-based approach may be cute but adds constraint that we shouldn't worry about now.

I've updated the list with this. Thoughts?

@vincentqb vincentqb changed the title Re-align Datasets Align dataset implementations Sep 16, 2020
@lcances
Copy link

lcances commented Sep 18, 2020

Hello, I agree that the current state of the datasets needs some guidelines to follow.

I am going to work now with the GTZAN and SpeechCommand datasets that are already present on Torchaudio. I can take care of "aligning" them.

  • _load_item

I think it is a good idea.

Putting the _load_item method inside the class is more flexible for the end-user who needs to do specific processing or doesn't need to return all the elements.

In my case, for the ESC datasets, it allowed me within 5 lines to remove the sampling rate (which I didn't need) and to add a cache system to load the audio file and compute Mel-spectrogram only once. I am not sure how I would have been able to do so without it.

class ESC10_NoSR(ESC10):
    @cache_feature
    def __getitem__(self, index: int) -> Tuple[Tensor, int]:
        x, sr, y = super().__getitem__(index)
        return x, y
  • _parse_files

I'm not sure about the proposed name.

In some cases, a metadata descriptor (often a .csv file) accompanies the dataset and can be more or less complicated. In that case, the files are grouped into the same directory, and their name doesn't provide any information. Parsing them is pretty much useless. I see it more like a metadata preparation to select folders (for cross-validation) if provided, the subset (Train, Val, Test) or specific dataset variant (10 classes vs. 50 classes for ESC, for instance).

  • transform arguments

On torchvision, the datasets have the arguments transform and target_transform. I find it interesting because by doing so it can take advantage of the multiprocessing available with the Dataloader. I agree that some of this processing can be performed on the GPU, but for data augmentation, for instance, pitch shifting, it can CPU intensive and could use it.

@mthrok
Copy link
Collaborator

mthrok commented Sep 18, 2020

Hi @leocances

Thanks for you input.

  • _parse_files

I'm not sure about the proposed name.

In some cases, a metadata descriptor (often a .csv file) accompanies the dataset and can be more or less complicated. In that case, the files are grouped into the same directory, and their name doesn't provide any information. Parsing them is pretty much useless. I see it more like a metadata preparation to select folders (for cross-validation) if provided, the subset (Train, Val, Test) or specific dataset variant (10 classes vs. 50 classes for ESC, for instance).

Sure I did not spend any extra second for coming up with a good name, so I am totally open to change it. Just to note, since this is an internal method, it should be named in the way that developers (not end-users) can intuitively understand what logic they should put in there.

  • transform arguments

On torchvision, the datasets have the arguments transform and target_transform. I find it interesting because by doing so it can take advantage of the multiprocessing available with the Dataloader. I agree that some of this processing can be performed on the GPU, but for data augmentation, for instance, pitch shifting, it can CPU intensive and could use it.

I agree with your point, putting the transform into loading part might give us an opportunity to improve the performance. However I suggest we separate that discussion from this issue, for the sake of focus on the alignment this issue is trying to solve. @dongreenberg also mentioned the idea of putting transforms inside. We can have the discussion somewhere else.

@mthrok
Copy link
Collaborator

mthrok commented Sep 18, 2020

In my case, for the ESC datasets, it allowed me within 5 lines to remove the sampling rate (which I didn't need) and to add a cache system to load the audio file and compute Mel-spectrogram only once. I am not sure how I would have been able to do so without it.

class ESC10_NoSR(ESC10):
    @cache_feature
    def __getitem__(self, index: int) -> Tuple[Tensor, int]:
        x, sr, y = super().__getitem__(index)
        return x, y

That cache decorator is cool. 👍

@vincentqb
Copy link
Contributor Author

I agree with your point, putting the transform into loading part might give us an opportunity to improve the performance. However I suggest we separate that discussion from this issue, for the sake of focus on the alignment this issue is trying to solve. @dongreenberg also mentioned the idea of putting transforms inside. We can have the discussion somewhere else.

Let's indeed move that discussion to #923.

@cpuhrsch
Copy link
Contributor

I'm in favor of writing datasets that are easily copy-paste-editable. They should be very simple and easily modifiable for the user, who has advanced usecases (e.g. randomized slicing). We can in turn provide them with a few tools that are typically difficult to write (e.g. highly efficient load function that also allows to read subsets) and really focus on that. The code of a dataset in essence best describes how a single datapoint is defined via its explicit construction. This should be more user friendly than adding more flags etc.

@vincentqb
Copy link
Contributor Author

vincentqb commented Oct 14, 2020

To summarize the discussion above, the current suggestion is really just adjusting all the datasets (e.g. speechcommands) to follow some changes done in tedlium.

class SPEECHCOMMANDS(Dataset):

    # move class attributes to constructor

    def __init__(self, ...):
        # ...
        self._parse_filesystem()

    def _parse_filesystem(self):
        # populate the list mapping n to data point identifiers self._items without os.walk

    def _load_item(self, identifier):
        # move load here instead of being a function outside
        return item_tuple

    def __getitem__(self, n: int) -> Tuple[Tensor, int, str, str, int]:
        identifier = self._items[n]
        return _load_item(identifier)

    def __len__(self) -> int:
        return len(self._items)

See also top comment.

cc @cpuhrsch @mthrok

@lcances
Copy link

lcances commented Oct 15, 2020

Hello,

As I said, I am currently working with some of the datasets that are already implemented in Torchaudio such as gtzan and speechcommand.

By the end of the next week, I should be able to work on these two to make the adjustment needed.

@vincentqb vincentqb changed the title Align dataset implementations [DRAFT] Align dataset implementations Oct 16, 2020
@vincentqb
Copy link
Contributor Author

As I said, I am currently working with some of the datasets that are already implemented in Torchaudio such as gtzan and speechcommand.

By the end of the next week, I should be able to work on these two to make the adjustment needed.

@leocances -- thanks a lot for the follow-up :) I've marked this issue as "draft" ( for lack of better flag :) ) for now to indicate this is still in discussion

@mthrok
Copy link
Collaborator

mthrok commented Dec 22, 2020

@krishnakalyan3

If you are interested in, you can take a look into this. The overall direction is described above but it's not perfectly clarified so there are some perspectives that need to be figured out.

I think YesNo dataset can be a good starting point, as it is simple.

@mthrok
Copy link
Collaborator

mthrok commented Dec 25, 2020

Looking at #1127 (thanks @krishnakalyan3 ), the folder_in_archive should be deprecated and removed, from all the Dataset implementation.

It does not provide consistent behavior with download + extract behavior and, in the first place, directory structure being changed is not something library should be expecting and conforming. If user has changed the directory structure, that's on the user, and library should not be taking care of it.

@mthrok
Copy link
Collaborator

mthrok commented Dec 25, 2020

Also _ext_audio should be deprecated and removed too.

@krishnakalyan3
Copy link
Contributor

krishnakalyan3 commented Dec 25, 2020

@mthrok

  1. removing _ext_audio -> I will replace it with a string.
  2. deprecating and removing folder_in_archive -> I am also in favor of this, It can also simplify the dataset module in the future.
  3. What about deprecating and removing url? -> As a user, I would rarely specify my own URL unless I know exactly what I am doing. Since the extraction process is highly dependent on the dataset. (Modifying _RELEASE_CONFIGS directly would be a better option I guess)

@mthrok
Copy link
Collaborator

mthrok commented Dec 27, 2020

@krishnakalyan3

Thanks for the feedback. Do you have any other thoughts while working on #1127?

  1. removing _ext_audio -> I will replace it with a string.

Do you mean string "constant"? I think that makes sense, especially for ".wav" formats.

  1. deprecating and removing folder_in_archive -> I am also in favor of this, It can also simplify the dataset module in the future.

Yes. We had to do this with CommonVoice recently, and the resulting code became much simple.

  1. What about deprecating and removing url?

For url, even though it is rarely used, there are multiple potential scenarios where the source becomes unavailable and the archive is re-hosted somewhere else.

  1. firewalls
  2. the original source becomes unavailable.

Modifying _RELEASE_CONFIGS directly would be a better option

But there are cases that users do not have an admin privilege to modify the installed package, so it's better if users can provide their configuration from their client code. Maybe not as a single variable but making a custom configuration type that is specific to the dataset might be a possible option.

@krishnakalyan3
Copy link
Contributor

I have also observed that the dataset (TEDLIUM, ) has release as a parameter in __init__. For the sake of consistency should we remove it?.

@krishnakalyan3
Copy link
Contributor

@vincentqb @mthrok, could you please advise if everything looks okay. I am a little worried about keeping PRs open for a long time as things tend to get stale.

@mthrok mthrok self-assigned this Jan 19, 2021
@mthrok mthrok changed the title [DRAFT] Align dataset implementations Better structure dataset implementations Jan 19, 2021
@mthrok
Copy link
Collaborator

mthrok commented Jan 19, 2021

@krishnakalyan3

Sorry about the delayed response. I will take overt this work.

Regarding #1127, I think it is strict improvement in customizability, so I will go ahead and merge it.
Regarding #1143 , ping me again in the PR (as a comment, so that I won't miss the notification) once you are ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants