Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datagenerator #39

Merged
merged 64 commits into from
Jan 17, 2021
Merged

Datagenerator #39

merged 64 commits into from
Jan 17, 2021

Conversation

WolfByttner
Copy link
Contributor

No description provided.

Saran-nns and others added 13 commits January 2, 2021 23:31
When the sequence length is not uniform,
categories do not all have the same number
of samples. This can make certain categories get
dropped if they are in an unfortunate position in
the indices list.

The net effect is to blow up some tests, randomly.
Since this behaviour is known (incomplete batches cannot
be used), it is not reasonable to fail tests when
it is encountered. Thus all sequences are given the
same length for the time being.
When splitting samples into categories,
alignment issues can, again, cause problems.
However, with an evenly divisible number of
samples, and with equally large samplers,
the loaders consistently exhibit the correct behaviour.
@Saran-nns Saran-nns self-requested a review January 10, 2021 13:49
@Saran-nns
Copy link
Member

Saran-nns commented Jan 10, 2021

I haven't done with this branch yet. I haven't sent PR, neither a review request. Anyway thanks for the updates.
I need to look into the Pytest for few reasons,

  1. We need to perform checks with all the models that we have(not just ODEs).
  2. Train and test splits were done optimally in the previous versions. There was no sharing of data between train and test splits (1,2,3,4,5) != (2,3,4,5,6) in time series.
  3. Weighted Random sampling is removed in recent updates. Why? WRS is why users may need Traja data loaders, otherwise, they could simply use SubsetRandomSamplers from PyTorch and do train test splits based on their data(or category) manually.
  4. SubsetRandomSampling was already part of WeightedRandomSamplers. It assigned weights to samples from each class to avoid overfitting to classes with large samples.
  5. Why ODEs require Category wise train and tests? For time series prediction/generation this shouldn't be the case? Is it more specific to any task that you working on?
  6. Previous loaders should not make the training slower. They were just optimized to get SOTA performance. 95.4 >>95.5 is always good enough to show better performance against ML benchmarks

@WolfByttner
Copy link
Contributor Author

@Saran-nns I created this new follow-up issue. Since I don't know what the Weighted random samples loaders are supposed to do, I'd be grateful for an unit test that helps clarify things.

@WolfByttner
Copy link
Contributor Author

@Saran-nns

  1. Which models should we test with? Could you provide the list here so we can create issues and add tests as appropriate, please?
  2. The issue was that the same category could be in both the train and test datasets. This caused some results to be 'polluted' - especially classification and regression trainings.
  3. There is no test for this. Please write one (or outline what it should do) and I can add the requisite functionality. Since the PR is already prepared, I suggest we merge it and add new functionality (with the new tests) in a different PR.
  4. See 3.
  5. Even datasets such as Fortasyn require it. When you have a trajectory of unknown provenance and you want to assign a class to it, you need it to not be part of the training set.
  6. The stride argument (now propagated through all the way) lets the user optimise the density of the sampling. stride = 1 recreates the previous behaviour.

Previously stride was an argument provided to the data generator.
This argument lets the user select a denser sampling of the dataset;
appropriate for category-wise sampling. Now the argument
is exposed to the end user.
@Saran-nns
Copy link
Member

  1. Please provide the test notebook for LSTMs,AE and VAEs
  2. For classification and regression, categories should be in train and test datasets. What you are looking for is Out-Of-Distribution detection, which is out of context for the moment. This way, we also fall short of training data. This could have big impact on small datasets like jaguar with 200 sample length.
  3. We do not need a test for weighted random sampling, it was already a part of Pytorch data loader obj by default.
  4. We need to add it back. I am not sure why it was removed.
  5. Scalers are also removed. Scaling is done but where are the scalers? How do we rescale the data during inference, if the data generator doesn't return the scalers?
  6. Striding is a good idea, but this could be done with less damage to the previous version of the datagenerator.
    -So overall, what we might need to do,
    NO category based train-test split since we are not performing OOD.
    Weighted random sampling for Multiclass datasets should be set default but keep the subset random sampling optional(let the user decide).
    Scalers should be returned from the data loader instance. This also good for training ODE models.
    Please also provide the unit test code that test this module. This has to be updated.

@codecov-io
Copy link

codecov-io commented Jan 13, 2021

Codecov Report

Merging #39 (8b745f2) into master (77a4871) will increase coverage by 7.22%.
The diff coverage is 92.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #39      +/-   ##
==========================================
+ Coverage   65.17%   72.40%   +7.22%     
==========================================
  Files          11       25      +14     
  Lines        1674     2551     +877     
==========================================
+ Hits         1091     1847     +756     
- Misses        583      704     +121     
Impacted Files Coverage Δ
traja/accessor.py 66.66% <ø> (ø)
traja/dataset/example.py 100.00% <ø> (ø)
traja/models/visualizer.py 0.00% <ø> (ø)
traja/parsers.py 62.29% <ø> (ø)
traja/models/inference.py 31.81% <16.66%> (ø)
traja/plotting.py 56.50% <85.71%> (+1.54%) ⬆️
traja/models/generative_models/vae.py 95.40% <89.47%> (ø)
traja/models/train.py 93.54% <90.14%> (ø)
traja/__init__.py 92.30% <100.00%> (ø)
traja/dataset/__init__.py 100.00% <100.00%> (ø)
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77a4871...8b745f2. Read the comment docs.

@WolfByttner
Copy link
Contributor Author

This pr also resolves #40

Copy link
Collaborator

@JustinShenk JustinShenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JustinShenk JustinShenk merged commit e3dcf25 into master Jan 17, 2021
@JustinShenk JustinShenk deleted the datagenerator branch January 17, 2021 16:34
Saran-nns pushed a commit that referenced this pull request Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants