Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extensive traversal of cross-validation folds #95

Merged
merged 8 commits into from
Apr 6, 2020

Conversation

SaumyaSinghal
Copy link
Contributor

@SaumyaSinghal SaumyaSinghal commented Mar 21, 2020

In the notebook I have tried to go deep into the two main cross-validation techniques (with reference to issue #4), shuffle split and k-fold split, along with their stratified variations. It was an interesting exploration and I learnt a great deal about each. Some things, however, that I would later incorporate into this are:

  • Keeping a completely unused test set for obtaining the final comparisons among all cross-validation techniques.
  • Try with a larger variety of folds. I have currently tried with 10, 20, 30..., up to 90 folds. I would specifically want to try with 2, 3..., up to 9 folds too in addition to that. It is a minute's work to integrate into the existing code but I guess I'll do that after the first point is ticked off.

I am eager for any suggestion for improvement or interesting direction for exploration :)

@dzeber
Copy link
Contributor

dzeber commented Mar 27, 2020

Thanks for this excellent PR! I really appreciate the amount of depth that you went into in your exploration, your thoughtful experimentation plan, and the detailed documentation of the narrative and your interpretations in the notebook comments. This notebook really makes for compelling reading, which is the goal with this type of work.

If you would like to merge this as-is so as to record this contribution, that is fine. Otherwise, you're welcome to continue working on it. @SaumyaSinghal Please let me know either way in the comments. Also, please resolve the merge conflicts that are currently showing. I would start by removing the startup task notebook and rebasing onto the current main repo (upstream) master.

I think the interesting point of the comparison between ShuffleSplit and KFold CV is the difference between resampled (overlapping) splits and deterministic disjoint ones. However, varying the n_splits for ShuffleSplit only changes the number of times it resamples splits of fixed size. The k in KFold on the other hand determines not only the number of splits but also the relative sizes of the splits (eg. for 5-fold, the test split is 20% of the dataset, etc). This can also be seen from the graphs, as the results for ShuffleSplit are really flat across i, whereas there is more of increasing or decreasing trends for KFold.

What would make for a more insightful comparison I think is to reparametrize your cv_shuffle_split so that it varies n_splits, test_size and train_size with each i to mimic the values implied by that i for KFold. Eg. when i=10, n_splits=10, train_size=0.9, test_size=0.1. Also, what was the reason for holding out the 10% in the ShuffleSplit? I didn't see how that impacted the analysis.

Personally I would use line plots rather than bars for the graphs, as I think that would make comparisons easier.

It would also be interesting to include a couple of really large values for i (computation-permitting). Leave-one-out CV is another common strategy, which is basically n-fold CV. It would be interesting to have this as another point of comparison.

Don't worry about holding out a separate test set here. That is definitely best practice in training a model with CV, but this analysis is more about exploring how CV works, and it's not necessary here.

@SaumyaSinghal
Copy link
Contributor Author

Thank you so much for your feedback and suggestions! These are some very interesting points that you have mentioned and I definitely look forward to exploring them.

Also, what was the reason for holding out the 10% in the ShuffleSplit? I didn't see how that impacted the analysis.

I forgot to change this. I was doing this at the start of this month thinking that I can somehow get the 10% returned and use it as a separate test set (I wondered why else would they even have this option) but I later realized that that is absolutely not how cross-validation works. I even changed ShuffleSplit to K-fold for the Startup task but I overlooked this one and left out 10% for no reason. I shall fix this; and comparing with similar train-test splits sounds like a very interesting idea too.

Please let me know either way in the comments. Also, please resolve the merge conflicts that are currently showing. I would start by removing the startup task notebook and rebasing onto the current main repo (upstream) master.

@dzeber I'd like to work on the aspects that you've mentioned before merging this in. Regarding conflicts, there is an updated version of these files lying in master but as the master has also not been merged (I don't have a folder with my name as of yet in /dev) there is a conflict. I suppose I can just pull these two files into this branch as you have mentioned. The master is actually associated with PR #36 (I should have created a separate branch for this though) and I think that should be merged before merging any other branch so would you suggest this branch to wait for it?

@SaumyaSinghal
Copy link
Contributor Author

Oh my bad, one of my PRs was merged some hours ago, so I think a folder under my name exists now :)

@dzeber dzeber added the WIP work in progress, will not be reviewed until label is removed or review is explicitly requested label Apr 2, 2020
@dzeber
Copy link
Contributor

dzeber commented Apr 2, 2020

I'd like to work on the aspects that you've mentioned before merging this in.

Sounds good! I've added the WIP tag to this effect - feel free to remove it when you're ready to merge.

@SaumyaSinghal
Copy link
Contributor Author

Thank you. I am working on all the unfinished issues parallelly so none of them are complete now with the degree of exploration I would like but I hope to complete everything by tomorrow midnight and then get them merged :D

@SaumyaSinghal
Copy link
Contributor Author

@dzeber this can be merged now, I have kept in mind your suggestions (which has made the exploration much more interesting) and have updated it accordingly :)

@mlopatka mlopatka merged commit 1b66259 into mozilla:master Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP work in progress, will not be reviewed until label is removed or review is explicitly requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants