Split plot refinement (accurate labeling) #876
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I used
get_splits(train_labels, valid_size=.2, stratify=True, random_state=23, shuffle=True)
.In this case, I would expect the second label to be "Valid" instead of "Test". I'm not specifying any test split and by default it is zero. I am, however, specifying a
valid_size
, which is why the labels should be "Train" and "Validation", not "Test", but it looks like this:I made a small change to
plot_splits()
to change the behavior to my needs. For some reason, it assumed that one split, i.e., two lists, always means (train, test). I realized that validation data is not optional in the split generation function, so I assumed it as mandatory. So the combination of only "Train" and "Test" is not possible.Behavior now:
valid_size
andtest_size
-> three lists and labels "Train", "Valid" and "Test":valid_size
-> two lists and labels "Train" and "Valid":test_size
-> three lists and labels "Train", "Valid" and "Test" (since valid data is mandatory):test_size
and setvalid_size
to 0 -> two lists and labels "Valid" and "Test" (in this case valid == train):This is a reasonable labeling behavior in my opinion (under the assumption that validation data is mandatory).
I also set a default value for the new parameter in
plot_splits()
so that it doesn't cause any compatibility issues.