Extensive traversal of cross-validation folds #95

SaumyaSinghal · 2020-03-21T08:36:18Z

In the notebook I have tried to go deep into the two main cross-validation techniques (with reference to issue #4), shuffle split and k-fold split, along with their stratified variations. It was an interesting exploration and I learnt a great deal about each. Some things, however, that I would later incorporate into this are:

Keeping a completely unused test set for obtaining the final comparisons among all cross-validation techniques.
Try with a larger variety of folds. I have currently tried with 10, 20, 30..., up to 90 folds. I would specifically want to try with 2, 3..., up to 9 folds too in addition to that. It is a minute's work to integrate into the existing code but I guess I'll do that after the first point is ticked off.

I am eager for any suggestion for improvement or interesting direction for exploration :)

dzeber · 2020-03-27T21:35:11Z

Thanks for this excellent PR! I really appreciate the amount of depth that you went into in your exploration, your thoughtful experimentation plan, and the detailed documentation of the narrative and your interpretations in the notebook comments. This notebook really makes for compelling reading, which is the goal with this type of work.

If you would like to merge this as-is so as to record this contribution, that is fine. Otherwise, you're welcome to continue working on it. @SaumyaSinghal Please let me know either way in the comments. Also, please resolve the merge conflicts that are currently showing. I would start by removing the startup task notebook and rebasing onto the current main repo (upstream) master.

I think the interesting point of the comparison between ShuffleSplit and KFold CV is the difference between resampled (overlapping) splits and deterministic disjoint ones. However, varying the n_splits for ShuffleSplit only changes the number of times it resamples splits of fixed size. The k in KFold on the other hand determines not only the number of splits but also the relative sizes of the splits (eg. for 5-fold, the test split is 20% of the dataset, etc). This can also be seen from the graphs, as the results for ShuffleSplit are really flat across i, whereas there is more of increasing or decreasing trends for KFold.

What would make for a more insightful comparison I think is to reparametrize your cv_shuffle_split so that it varies n_splits, test_size and train_size with each i to mimic the values implied by that i for KFold. Eg. when i=10, n_splits=10, train_size=0.9, test_size=0.1. Also, what was the reason for holding out the 10% in the ShuffleSplit? I didn't see how that impacted the analysis.

Personally I would use line plots rather than bars for the graphs, as I think that would make comparisons easier.

It would also be interesting to include a couple of really large values for i (computation-permitting). Leave-one-out CV is another common strategy, which is basically n-fold CV. It would be interesting to have this as another point of comparison.

Don't worry about holding out a separate test set here. That is definitely best practice in training a model with CV, but this analysis is more about exploring how CV works, and it's not necessary here.

SaumyaSinghal · 2020-03-28T14:15:41Z

Thank you so much for your feedback and suggestions! These are some very interesting points that you have mentioned and I definitely look forward to exploring them.

Also, what was the reason for holding out the 10% in the ShuffleSplit? I didn't see how that impacted the analysis.

I forgot to change this. I was doing this at the start of this month thinking that I can somehow get the 10% returned and use it as a separate test set (I wondered why else would they even have this option) but I later realized that that is absolutely not how cross-validation works. I even changed ShuffleSplit to K-fold for the Startup task but I overlooked this one and left out 10% for no reason. I shall fix this; and comparing with similar train-test splits sounds like a very interesting idea too.

Please let me know either way in the comments. Also, please resolve the merge conflicts that are currently showing. I would start by removing the startup task notebook and rebasing onto the current main repo (upstream) master.

@dzeber I'd like to work on the aspects that you've mentioned before merging this in. Regarding conflicts, there is an updated version of these files lying in master but as the master has also not been merged (I don't have a folder with my name as of yet in /dev) there is a conflict. I suppose I can just pull these two files into this branch as you have mentioned. The master is actually associated with PR #36 (I should have created a separate branch for this though) and I think that should be merged before merging any other branch so would you suggest this branch to wait for it?

SaumyaSinghal · 2020-03-28T15:17:05Z

Oh my bad, one of my PRs was merged some hours ago, so I think a folder under my name exists now :)

dzeber · 2020-04-02T00:19:06Z

I'd like to work on the aspects that you've mentioned before merging this in.

Sounds good! I've added the WIP tag to this effect - feel free to remove it when you're ready to merge.

SaumyaSinghal · 2020-04-02T10:16:21Z

Thank you. I am working on all the unfinished issues parallelly so none of them are complete now with the degree of exploration I would like but I hope to complete everything by tomorrow midnight and then get them merged :D

SaumyaSinghal · 2020-04-05T14:07:46Z

@dzeber this can be merged now, I have kept in mind your suggestions (which has made the exploration much more interesting) and have updated it accordingly :)

SaumyaSinghal and others added 5 commits March 10, 2020 02:53

Create Readme.md

e66b42c

Create files for exploring issue #2

feb4f30

Format using black

a4dfac2

Remove notebook from master

b228db9

Create file for cross-validation exploration

852191f

Resolve conflict and update

3b1ccf7

dzeber added the WIP work in progress, will not be reviewed until label is removed or review is explicitly requested label Apr 2, 2020

SaumyaSinghal added 2 commits April 5, 2020 17:23

Update Cross Validation

84098e3

Attempt to resolve conflict

2e6db66

mlopatka approved these changes Apr 6, 2020

View reviewed changes

mlopatka merged commit 1b66259 into mozilla:master Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extensive traversal of cross-validation folds #95

Extensive traversal of cross-validation folds #95

SaumyaSinghal commented Mar 21, 2020 •

edited

Loading

dzeber commented Mar 27, 2020

SaumyaSinghal commented Mar 28, 2020

SaumyaSinghal commented Mar 28, 2020

dzeber commented Apr 2, 2020

SaumyaSinghal commented Apr 2, 2020

SaumyaSinghal commented Apr 5, 2020

Extensive traversal of cross-validation folds #95

Extensive traversal of cross-validation folds #95

Conversation

SaumyaSinghal commented Mar 21, 2020 • edited Loading

dzeber commented Mar 27, 2020

SaumyaSinghal commented Mar 28, 2020

SaumyaSinghal commented Mar 28, 2020

dzeber commented Apr 2, 2020

SaumyaSinghal commented Apr 2, 2020

SaumyaSinghal commented Apr 5, 2020

SaumyaSinghal commented Mar 21, 2020 •

edited

Loading