-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing redundant dataset parameters / batching vs. seq_ordering #516
Comments
I agree on
|
The I've never used windowing and chunking, but the argument might be similar, right? Or are those things that would not make sense to be configured differently in different datasets? |
This is also slightly wrong (I think). I think you can just specify the dict as a string on command line (or we could make this work if we want). See also e.g. here. However, this current way is somewhat hardcoded to HDFDataset. At some point, we might have a better alternative (or use TFRecords directly or so). |
No, this is not similar for windowing. Windowing changes the dimension, so it must be consistent. Chunking is similar though. But as I argued above, I think this should not be changed (the behavior for the user, i.e. the config option; how it is handled internally, of course we can change this and clean it up; this is #376). I'm not sure if we have a conclusion on Maybe we could even go in the opposite direction, and disallow to set |
Ok, makes sense then to have it global.
No, please don't 😄 You often want different sequence orderings for train and eval. (I know there are defaults for dev and eval, but you should be able to overwrite those individually.) And maybe you have other datasets for fine-tuning etc. |
Also strong disagree on that one, |
No, this is not what I meant. I argued above that this can be removed as a global option. Sure it must be consistent but you could accomplish this in different ways, so having a global option might not always be right.
But this is also not what I meant. I just meant to not have this as a dataset option but having it separate, like we already have You can easily have different options for train/dev/eval. I'm just saying it doesn't need to be specified as a dataset option, because it might make sense to decouple it logically.
I did not talk about I don't understand the problem then? I'm just saying, you don't need This is just a very similar argument to #376. |
But this is the problem here, the parameter "batching" has nothing to do with "batching" (in the sense of how to create batches, e.g. sequences vs frames, local bucketing etc...) but is purely about the sequence order. |
Yes, this is what I wrote before. This is a naming issue. This is what I already suggested, to just rename this. E.g. rename it to But the whole discussion here was never about the name so far (except my comment earlier). It was about whether this should be an option (for the user) on the dataset directly, or separately in the config. |
Then I don't understand what you mean. Like having And for meta-datasets, e.g. CombinedDataset, having control over the ordering in each subdataset individually makes it very flexible. A simple example would be, if you do "random" shuffling on CombinedDataset level you don't need shuffling of the subdatasets, i.e. use "default". How would you configure that with global parameters? |
Yes, just like we already have it (but with different names).
Yes. I'm speaking explicitly independent of how we internally have it implemented also. We should think about what makes sense logically. What is easier for the user. What is easier to understand, more straight-forward.
I don't understand what you mean? You configure I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the
But this discussion here is not about that. Technically, the global
And Also, ignore the technical details of the current implementation. We can change anything we want. So the question just becomes, how it should be (and whether that make sense, or is technically possible -- maybe we are overlooking some things). |
I vote against a separate option (named |
So, to clarify: Seq ordering should be coupled to the dataset? But chunking should not be coupled to the dataset? (#376). |
Correct.
|
Discussion related to #508
Many parameters that are part of the dataset can be set globally:
Here the question is to prohibit using all of them globally, and only allowing them locally as dataset parameters.
For
seq_ordering
this is especially problematic, as the global name wasbatching
, which is definitely misleading. I saw configs where people defined both, and most users do not know they are related.The text was updated successfully, but these errors were encountered: