Datagenerator #39

WolfByttner · 2021-01-10T11:14:39Z

No description provided.

When the sequence length is not uniform, categories do not all have the same number of samples. This can make certain categories get dropped if they are in an unfortunate position in the indices list. The net effect is to blow up some tests, randomly. Since this behaviour is known (incomplete batches cannot be used), it is not reasonable to fail tests when it is encountered. Thus all sequences are given the same length for the time being.

When splitting samples into categories, alignment issues can, again, cause problems. However, with an evenly divisible number of samples, and with equally large samplers, the loaders consistently exhibit the correct behaviour.

Saran-nns · 2021-01-10T14:05:23Z

I haven't done with this branch yet. I haven't sent PR, neither a review request. Anyway thanks for the updates.
I need to look into the Pytest for few reasons,

We need to perform checks with all the models that we have(not just ODEs).
Train and test splits were done optimally in the previous versions. There was no sharing of data between train and test splits (1,2,3,4,5) != (2,3,4,5,6) in time series.
Weighted Random sampling is removed in recent updates. Why? WRS is why users may need Traja data loaders, otherwise, they could simply use SubsetRandomSamplers from PyTorch and do train test splits based on their data(or category) manually.
SubsetRandomSampling was already part of WeightedRandomSamplers. It assigned weights to samples from each class to avoid overfitting to classes with large samples.
Why ODEs require Category wise train and tests? For time series prediction/generation this shouldn't be the case? Is it more specific to any task that you working on?
Previous loaders should not make the training slower. They were just optimized to get SOTA performance. 95.4 >>95.5 is always good enough to show better performance against ML benchmarks

WolfByttner · 2021-01-10T19:16:43Z

@Saran-nns I created this new follow-up issue. Since I don't know what the Weighted random samples loaders are supposed to do, I'd be grateful for an unit test that helps clarify things.

WolfByttner · 2021-01-11T18:28:03Z

@Saran-nns

Which models should we test with? Could you provide the list here so we can create issues and add tests as appropriate, please?
The issue was that the same category could be in both the train and test datasets. This caused some results to be 'polluted' - especially classification and regression trainings.
There is no test for this. Please write one (or outline what it should do) and I can add the requisite functionality. Since the PR is already prepared, I suggest we merge it and add new functionality (with the new tests) in a different PR.
See 3.
Even datasets such as Fortasyn require it. When you have a trajectory of unknown provenance and you want to assign a class to it, you need it to not be part of the training set.
The stride argument (now propagated through all the way) lets the user optimise the density of the sampling. stride = 1 recreates the previous behaviour.

Previously stride was an argument provided to the data generator. This argument lets the user select a denser sampling of the dataset; appropriate for category-wise sampling. Now the argument is exposed to the end user.

Saran-nns · 2021-01-12T12:11:38Z

Please provide the test notebook for LSTMs,AE and VAEs
For classification and regression, categories should be in train and test datasets. What you are looking for is Out-Of-Distribution detection, which is out of context for the moment. This way, we also fall short of training data. This could have big impact on small datasets like jaguar with 200 sample length.
We do not need a test for weighted random sampling, it was already a part of Pytorch data loader obj by default.
We need to add it back. I am not sure why it was removed.
Scalers are also removed. Scaling is done but where are the scalers? How do we rescale the data during inference, if the data generator doesn't return the scalers?
Striding is a good idea, but this could be done with less damage to the previous version of the datagenerator.
-So overall, what we might need to do,
NO category based train-test split since we are not performing OOD.
Weighted random sampling for Multiclass datasets should be set default but keep the subset random sampling optional(let the user decide).
Scalers should be returned from the data loader instance. This also good for training ODE models.
Please also provide the unit test code that test this module. This has to be updated.

codecov-io · 2021-01-13T20:31:51Z

Codecov Report

Merging #39 (8b745f2) into master (77a4871) will increase coverage by 7.22%.
The diff coverage is 92.73%.

@@            Coverage Diff             @@
##           master      #39      +/-   ##
==========================================
+ Coverage   65.17%   72.40%   +7.22%     
==========================================
  Files          11       25      +14     
  Lines        1674     2551     +877     
==========================================
+ Hits         1091     1847     +756     
- Misses        583      704     +121

Impacted Files	Coverage Δ
traja/accessor.py	`66.66% <ø> (ø)`
traja/dataset/example.py	`100.00% <ø> (ø)`
traja/models/visualizer.py	`0.00% <ø> (ø)`
traja/parsers.py	`62.29% <ø> (ø)`
traja/models/inference.py	`31.81% <16.66%> (ø)`
traja/plotting.py	`56.50% <85.71%> (+1.54%)`	⬆️
traja/models/generative_models/vae.py	`95.40% <89.47%> (ø)`
traja/models/train.py	`93.54% <90.14%> (ø)`
traja/__init__.py	`92.30% <100.00%> (ø)`
traja/dataset/__init__.py	`100.00% <100.00%> (ø)`
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77a4871...8b745f2. Read the comment docs.

WolfByttner · 2021-01-16T22:09:52Z

This pr also resolves #40

JustinShenk

LGTM

Data generator added

Saran-nns and others added 13 commits January 2, 2021 23:31

validation data update

d055ef0

todo setup dataloaders wrt test train, validation and scalers

eb44e6f

Add first plotter, using batch size 1

d0605c6

Add category-wise sampling

8f63df5

Reformat code

511403a

Add fancy assert strings

14fe3e3

Merge branch 'master' into datagenerator

ed4d421

Merge branch 'plotting_updates' into datagenerator

37c0fe0

Update scikit-learn dependency

c488da1

Make category-wise sampling faster

39608ab

Add sequential loaders

7b90043

Fix second test heisenbug

424b855

When splitting samples into categories, alignment issues can, again, cause problems. However, with an evenly divisible number of samples, and with equally large samplers, the loaders consistently exhibit the correct behaviour.

WolfByttner requested a review from JustinShenk January 10, 2021 11:14

WolfByttner added 2 commits January 10, 2021 11:23

Merge branch 'master' into datagenerator

9996fa8

Make plotting work with dataloader

931a09f

Saran-nns self-requested a review January 10, 2021 13:49

Fix formatting

9fcfae3

WolfByttner added 2 commits January 11, 2021 18:28

Add stride to dataset arguments.

2547a3f

Previously stride was an argument provided to the data generator. This argument lets the user select a denser sampling of the dataset; appropriate for category-wise sampling. Now the argument is exposed to the end user.

Propagate Stride all the way

a373de3

Add time-based dataloader separation

ff77af4

WolfByttner added 4 commits January 14, 2021 12:16

Make test loss print after 10 epochs

f21c63f

Correct forecasting test loss print

16141a1

Add regressor support to dataloader

990fd28

Correctly normalise training loss

26f03d0

WolfByttner added 24 commits January 16, 2021 15:06

Make plot_prediction move data to cpu so Numpy does not complain

08e9053

Correct batch size for more accurate sampling

9b779c6

Add stride one sampler test

7ace8bc

Clean up dataset imports

05274b9

Optimise __init__ files

2a02f35

Update plot_prediction with history

ac36d83

Add inference test

ca8b526

Remove old set validation code

fed1fe0

Add self.num_past to LSTM

6422314

Add coverage of Traja models

8675ac8

Move codecov.yml so codecov can see it

29a4cf1

Make codecov cover everything

ef22198

Remove dead code

f034499

Remove trainers that are no longer needed

98c37d6

Add convergence criterion to regression test

42135fe

Move trainer validate functionality into separate function

211da48

Propagate loss type to criterion

13ce751

Test network utility functions

0cfe728

Remove dead code

15b4e65

Reformat code

140e682

Add plot_prediction to docs

34f583b

Update documentation with parameter searching

d4ad805

Add loss tests

b6fb003

Reuse classifier, regressor in both ae and vae

8b745f2

JustinShenk approved these changes Jan 17, 2021

View reviewed changes

JustinShenk merged commit e3dcf25 into master Jan 17, 2021

JustinShenk deleted the datagenerator branch January 17, 2021 16:34

WolfByttner mentioned this pull request Jan 17, 2021

Create WeightedRandomSamplers unit tests in test_dataset.py #40

Closed

Saran-nns pushed a commit that referenced this pull request Aug 23, 2024

Merge pull request #39 from traja-team/datagenerator

9e9a6e3

Data generator added

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datagenerator #39

Datagenerator #39

WolfByttner commented Jan 10, 2021

Saran-nns commented Jan 10, 2021 •

edited

Loading

WolfByttner commented Jan 10, 2021

WolfByttner commented Jan 11, 2021

Saran-nns commented Jan 12, 2021

codecov-io commented Jan 13, 2021 •

edited

Loading

WolfByttner commented Jan 16, 2021

JustinShenk left a comment

Datagenerator #39

Datagenerator #39

Conversation

WolfByttner commented Jan 10, 2021

Saran-nns commented Jan 10, 2021 • edited Loading

WolfByttner commented Jan 10, 2021

WolfByttner commented Jan 11, 2021

Saran-nns commented Jan 12, 2021

codecov-io commented Jan 13, 2021 • edited Loading

Codecov Report

WolfByttner commented Jan 16, 2021

JustinShenk left a comment

Choose a reason for hiding this comment

Saran-nns commented Jan 10, 2021 •

edited

Loading

codecov-io commented Jan 13, 2021 •

edited

Loading