-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better structure dataset implementations #910
Comments
Note iterator here is NOT a Python generator expression. None of the datasets require iterable, and requiring it to be itrable adds no benefit yet constraints.
What problem does this solve? I do not see the benefit of explicitly aligning the name of internal attributes.
The important thing about this is that the arguments to
No static please. There is no point making it static. I think it is missing the important one Remove walk_files from the code base. |
Note that consistency is a tool to achieve a good user/developer experience, the consistency itself is not a goal, or the first choice of design. The consistency for the sake of consistency is a invalid choice. When talking about the consistency, it should be always clear what benefit does that consistency brings. |
cc @cpuhrsch for generator discussion. do we have a document about these prior discussions? I personally like generator approach as it mimics linux piping. However, none of the data loader infrastructure and jit really support generator. Most important to me though is that we find a "simple" design that we can replicate. Bonus points for a dataset that is easily convertible from map-style to iterable-style (where length is not necessarily known or relevant, e.g. generator-based).
[updated in list] What I really meant here is to align internal attributes where possible. My goal is to make it as simple/boilerplate as possible when creating a new dataset. Aligning attributes and implementations make that easier. The goal of this list was to document where we are trying to go with the dataset implementation, and I'm noting a diverging point here compared to prior design choices.
Sure, updated above.
Added above. This is related to the generator/iterator discussion. |
Making the dataset into iterable is different topic. The point is that current datasets all materialize and consumes the generator right after it is instantiated for the sake of
I do not think that's practical for a list. For methods like
Also I think we should get rid of class attributes like |
For example, we can define a base class for dataset class and use In my opinion, the current Dataset implementations do the following three things.
will enforce the same structure. |
I like the three steps you mentioned: parse filesystem, identify sample given key, load sample. And it could be a good idea to use abstractmethod to enforce the steps, though we haven't used that in the codebase yet. I also agree that the generator-based approach may be cute but adds constraint that we shouldn't worry about now. I've updated the list with this. Thoughts? |
Hello, I agree that the current state of the datasets needs some guidelines to follow. I am going to work now with the GTZAN and SpeechCommand datasets that are already present on Torchaudio. I can take care of "aligning" them.
I think it is a good idea. Putting the In my case, for the ESC datasets, it allowed me within 5 lines to remove the sampling rate (which I didn't need) and to add a cache system to load the audio file and compute Mel-spectrogram only once. I am not sure how I would have been able to do so without it. class ESC10_NoSR(ESC10):
@cache_feature
def __getitem__(self, index: int) -> Tuple[Tensor, int]:
x, sr, y = super().__getitem__(index)
return x, y
I'm not sure about the proposed name. In some cases, a metadata descriptor (often a .csv file) accompanies the dataset and can be more or less complicated. In that case, the files are grouped into the same directory, and their name doesn't provide any information. Parsing them is pretty much useless. I see it more like a metadata preparation to select folders (for cross-validation) if provided, the subset (Train, Val, Test) or specific dataset variant (10 classes vs. 50 classes for ESC, for instance).
On torchvision, the datasets have the arguments |
Hi @leocances Thanks for you input.
Sure I did not spend any extra second for coming up with a good name, so I am totally open to change it. Just to note, since this is an internal method, it should be named in the way that developers (not end-users) can intuitively understand what logic they should put in there.
I agree with your point, putting the transform into loading part might give us an opportunity to improve the performance. However I suggest we separate that discussion from this issue, for the sake of focus on the alignment this issue is trying to solve. @dongreenberg also mentioned the idea of putting transforms inside. We can have the discussion somewhere else. |
That cache decorator is cool. 👍 |
Let's indeed move that discussion to #923. |
I'm in favor of writing datasets that are easily copy-paste-editable. They should be very simple and easily modifiable for the user, who has advanced usecases (e.g. randomized slicing). We can in turn provide them with a few tools that are typically difficult to write (e.g. highly efficient load function that also allows to read subsets) and really focus on that. The code of a dataset in essence best describes how a single datapoint is defined via its explicit construction. This should be more user friendly than adding more flags etc. |
To summarize the discussion above, the current suggestion is really just adjusting all the datasets (e.g. speechcommands) to follow some changes done in tedlium. class SPEECHCOMMANDS(Dataset):
# move class attributes to constructor
def __init__(self, ...):
# ...
self._parse_filesystem()
def _parse_filesystem(self):
# populate the list mapping n to data point identifiers self._items without os.walk
def _load_item(self, identifier):
# move load here instead of being a function outside
return item_tuple
def __getitem__(self, n: int) -> Tuple[Tensor, int, str, str, int]:
identifier = self._items[n]
return _load_item(identifier)
def __len__(self) -> int:
return len(self._items) See also top comment. |
Hello, As I said, I am currently working with some of the datasets that are already implemented in Torchaudio such as gtzan and speechcommand. By the end of the next week, I should be able to work on these two to make the adjustment needed. |
@leocances -- thanks a lot for the follow-up :) I've marked this issue as "draft" ( for lack of better flag :) ) for now to indicate this is still in discussion |
If you are interested in, you can take a look into this. The overall direction is described above but it's not perfectly clarified so there are some perspectives that need to be figured out. I think YesNo dataset can be a good starting point, as it is simple. |
Looking at #1127 (thanks @krishnakalyan3 ), the It does not provide consistent behavior with |
Also |
|
Thanks for the feedback. Do you have any other thoughts while working on #1127?
Do you mean string "constant"? I think that makes sense, especially for ".wav" formats.
Yes. We had to do this with CommonVoice recently, and the resulting code became much simple.
For
But there are cases that users do not have an admin privilege to modify the installed package, so it's better if users can provide their configuration from their client code. Maybe not as a single variable but making a custom configuration type that is specific to the dataset might be a possible option. |
I have also observed that the dataset ( |
@vincentqb @mthrok, could you please advise if everything looks okay. I am a little worried about keeping PRs open for a long time as things tend to get stale. |
The suggestion is to adjust all the datasets (e.g. speechcommands) to follow some changes done in tedlium, see example below.
_parse_filesystem
method to extract a list of "data point identifiers" in a pre-determined order to replace the genericwalk_files
, as here and Changed GTZAN so that it only traverses filenames belonging to the dataset #791._load_item
as method, as here.Relates to #852, GTZAN #791, tedlium #882.
cc @mthrok @cpuhrsch
The text was updated successfully, but these errors were encountered: