Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Random Split dialog #7519

Merged

Conversation

derekagorhom
Copy link
Contributor

@derekagorhom derekagorhom commented Jun 2, 2022

Fixes #7227
This PR replaces PR #7272
This dialogue can be found in Prepare>Data Reshape > Random Split dialogue.
This is ready for review.
I have made the recommended changes requested by @lilyclements except no.4, since that needs @rdstern input/approval.

@lloyddewit
Copy link
Contributor

lloyddewit commented Jun 3, 2022

@N-thony - Please could you peer review?
@rdstern - Please could you test?
Thanks

instat/dlgRandomSplit.vb Outdated Show resolved Hide resolved
instat/dlgRandomSplit.vb Show resolved Hide resolved
instat/dlgRandomSplit.vb Outdated Show resolved Hide resolved
instat/dlgRandomSplit.vb Outdated Show resolved Hide resolved
instat/dlgRandomSplit.vb Show resolved Hide resolved
@rdstern
Copy link
Collaborator

rdstern commented Jun 6, 2022

@derekagorhom please wait for @lilyclements confirmation before continuing with this pull request. And make sure you fully understand the reasons for the suggested changes, before implementing them.
a) The R code needs to change. The rsample documentation shows why. You need to generate the sample into x first and then produce the training and testing sets. (You must ensure they complement each other, i.e. that the testing subset is what is left over, from the training set. If you generate them separately - as you are doing now, then some observations may be in both.)
b) Produce and same both of them each time.
c) Don't include the arguments in the code that are not needed. Currently you have breaks and pool always. You rarely need them.)
So, the dialogue will change quite a lot.
d) You don't need the data selector to be visible unless you tick the stratifying factor.
e) Delete the label "Variables" and move the Stratifying factor checkbox to just above the control. (It can act as the label as well.)
f) Delete the groups control. (That's for stratifying by a numeric variable, which we are not allowing here.)
g) Omit the the 2 checkboxes for saving the training and testing sets - let's always save both of them.
h) Here is a more speculative query for @lilyclements . Should we add a simple group box called Naming: Inside are two radio buttons. The first (default) is analysis/assessment and the second is training/testing. (This issue is discussed in the rsample guide as a section in rsample called terminology.) If default, then the resulting data frames are called analysis and assessment (and then analysis1, assessment1, etc. Otherwise they are called training and testing, etc.
i) Should we also save the rsample object? If so, possibly that should be optional, so have a checkbox, default unchecked.
j) Another @lilyclements question is that there is a rsample2caret function. Should we allow this option. If so, then this could be the object stored instead?

@lilyclements
Copy link
Contributor

@derekagorhom for (a) do you know the R code to achieve this? From #7227

If "Sample" is selected then initial_split is the base function

  • data is the data in the data frame
  • prop is the proportion given in the fraction nud (default 0.75)
  • strata is default NULL. If ucrChkStratifyingFactor is checked, then strata is the variables in the receiver
  • breaks could be included if we allow numeric variables in the strata receiver.
  • pool is the number given in the pool nud (default 0.1, but takes values 0-1).

If "Time Series" is selected then initial_time_split is the base function

  • data is the data in the data frame
  • prop is the proportion given in the fraction nud (default 0.75)
  • lag is the number given in the lag nud (default 0, takes any numerical integer).

This creates an object, which we can call x for now. That is, x <- initial_split(....) (or initial_time_split(...)).

For "save training data", run training() around the saved object (output from initial_split, or initial_time_split)
And save the output as a new dataframe.
That is, training_data <- training(x)

For "save testing data", run testing() around the saved object (output from initial_split, or initial_time_split)
And save the output as a new dataframe.
That is, testing_data <- testing(x)

@rdstern to:
h) Why not offer the user to name the data frame whatever they wish? With the default as the data frame name, followed by "_training" or "_testing" (e.g. "mtcars_training"). It is good to use the guide, but I'm not sure if that is overcomplicating it?
i) If we save the rsample object, what would we want to offer can be done with it? At the moment you can look at different objects, such as the row numbers that are used in one of the data sets. Is this something that could be viewed on the "View Objects" dialog?

j) The rsample2caret function does not work for initial_split unfortunately. It works for other functions (from what I can tell: vfold_cv(), bootstraps(), mc_cv(), rolling_origin())

E.g. for initial_split() we run,

car_split <- initial_split(mtcars)
train_data <- training(car_split)

for bootstraps() we run:

car_splits <- bootstraps(mtcars, times = 2)
car_splits$splits  # gives a list of all the objects, like "car_split" object above.
car_splits$splits[[1]]  # returns just the first object, like "car_split" above
train_data_1 <- training(car_splits$splits[[1]])  # to get the training data for the first data set
rsample2caret(car_splits, data = c("analysis", "assessment"))

@derekagorhom
Copy link
Contributor Author

Thank you for the explanation @lilyclements, I will get to working on the Pull request.

@N-thony
Copy link
Collaborator

N-thony commented Jun 28, 2022

Thank you for the explanation @lilyclements, I will get to working on the Pull request.

@derekagorhom any progress?

@derekagorhom
Copy link
Contributor Author

derekagorhom commented Mar 23, 2023

@lilyclements i have made the changes, you can review this now

@@ -59,7 +59,7 @@ Public Class dlgRandomSplit
ucrChkStratifyingFactor.AddToLinkedControls({ucrNudPool, ucrReceiverRanSplit}, {True}, bNewLinkedAddRemoveParameter:=True, bNewLinkedHideIfParameterMissing:=True, bNewLinkedUpdateFunction:=True, bNewLinkedChangeToDefaultState:=True, objNewDefaultState:=0.1)

ucrNudLag.SetParameter(New RParameter("lag", 3))
ucrNudLag.SetMinMax(0, 100)
ucrNudLag.SetMinMax(Integer.MinValue, Integer.MaxValue)
ucrNudLag.Increment = 0.5
ucrNudLag.DecimalPlaces = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lag should not have any decimal places

Copy link
Contributor

@lilyclements lilyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

When I open the dialog, OK is not enabled. Can you explain which ucr here needs to be filled/sorted for OK to be enabled?

@derekagorhom
Copy link
Contributor Author

image

When I open the dialog, OK is not enabled. Can you explain which ucr here needs to be filled/sorted for OK to be enabled?

For Sample, Stratifying Variable should be checked and ucrReceiverRanSplit needs to be filled (also none of the ucrNuds needs to be empty),
For Time series; ucrChklag needs to be checked (also none of the ucrNuds is supposed to be empty)

@lilyclements
Copy link
Contributor

lilyclements commented Mar 23, 2023

For Sample, Stratifying Variable should be checked and ucrReceiverRanSplit needs to be filled (also none of the ucrNuds needs to be empty), For Time series; ucrChklag needs to be checked (also none of the ucrNuds is supposed to be empty)

Good, I have two follow up points

  1. From what you have said, then what needs to be sorted here for OK to be enabled?

image

  1. Thank you for answering, it has helped clear up a bit of the confusion - I'll try to make sure I'm a bit clearer in the future :)
    But, actually, for Sample, if Stratifying Variable is checked, then ucrReceiverRanSplit needs to be filled (also none of the ucrNuds needs to be empty)
    For Time series, if ucrChklag is checked, none of the ucrNuds that are connected to that ucrChklag should be empty.

@derekagorhom
Copy link
Contributor Author

@lilyclements i have made the changes now, can you have a look
thanks

Copy link
Contributor

@lilyclements lilyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekagorhom this is working a lot better. But OK is still not enabled all the times that it should be.

For example, what is the reason for OK to not be enabled here:

image

We want to say if time series is checked and if lag is checked then ...
At the moment, I find that OK is only enabled if time series and lag are checked.
Do you see the difference there?

End If
Else
ucrBase.OKEnabled(False)
End If
Else
ucrBase.OKEnabled(False)
Copy link
Contributor

@lilyclements lilyclements Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this FALSE?
If rdoSample is checked, we do not need ucrChkStratifyingFactor to be checked to press OK

There's a similar case for rdoTimeSeries with ucrChkLag

If rdoSample.Checked Then
If ucrChkStratifyingFactor.Checked Then
If Not ucrReceiverRanSplit.IsEmpty AndAlso Not ucrNudBreaks.IsEmpty AndAlso Not ucrNudPool.IsEmpty Then
If ucrSaveTrainingData.IsComplete AndAlso ucrSaveTestingData.IsComplete AndAlso Not ucrNudFraction.IsEmpty Then
Copy link
Contributor

@lilyclements lilyclements Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ucrs (the two saves and the fraction nud) are not controlled by the ucrChkStratifyingFactor.
So, whether ucrChkStratifyingFactor is checked or not should not affect the state of these three controls. Does that make sense?
So in the If ucrChkStratifyingFactor.Checked [...] End If portion, we only want to contain ucrs that are controlled by ucrChkStratifyingFactor

Copy link
Contributor

@lilyclements lilyclements Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because these three controls affect the whole dialog, then to avoid repeating code, perhaps it makes sense to bring them to the "exterior" of the if-statement.

If ucrSaveTrainingData.IsComplete AndAlso ucrSaveTestingData.IsComplete AndAlso Not ucrNudFraction.IsEmpty Then
  < If statements for when rdos are checked > 
Else
   ucrBase.OKEnabled(False)
End If

Then we can build it up from there. So, next we can add in our if statement for rdoSample and rdoTimeSeries:

If ucrSaveTrainingData.IsComplete AndAlso ucrSaveTestingData.IsComplete AndAlso Not ucrNudFraction.IsEmpty Then
  If rdoSample.Checked Then 
    '<if statements for when rdo sample is checked >
  Else ' this here is if rdoTimeSeries.Checked
    '<if statements for when rdo time series is checked >
  End If
Else
   ucrBase.OKEnabled(False)
End If

Step 3: What if rdoSample is checked then? From looking at the dialog and the controls, there is a checkbox for "Stratifying variable". This brings up three controls. So we want to add in a statement that now says, if ucrChkStratifyingVariable.Checked Then < new controls cannot be empty for OK to be enabled >
Perfect though - since you have already done this on lines 162-163. So we can just add that in :)
Then we just need to add in what if ucrChkStratifyingFactor isnt checked? Then OK should work fine.

If ucrSaveTrainingData.IsComplete AndAlso ucrSaveTestingData.IsComplete AndAlso Not ucrNudFraction.IsEmpty Then
  If rdoSample.Checked Then 
    If ucrChkStratifyingFactor.Checked Then
              If Not ucrReceiverRanSplit.IsEmpty AndAlso Not ucrNudBreaks.IsEmpty AndAlso Not ucrNudPool.IsEmpty Then
                   ucrBase.OKEnabled(True)
              Else
                   ucrBase.OKEnabled(False)
              End If
     Else ' And what if ucrChkStratifyingFactor isnt checked? Then OK should work fine.
             ucrBase.OKEnabled(True)
     End If
  Else ' this here is if rdoTimeSeries.Checked
    '<if statements for when rdo time series is checked >
  End If
Else
   ucrBase.OKEnabled(False)
End If

Then we just do the same for rdoTimeSeries.Checked.

Does this make sense?

@derekagorhom
Copy link
Contributor Author

@lilyclements i have made some changes to the testOK, can you have a look

Copy link
Contributor

@lilyclements lilyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great and works great. Really nice job!

@rdstern @lloyddewit over to you to review!

If ucrChkStratifyingFactor.Checked Then
If Not ucrReceiverRanSplit.IsEmpty AndAlso Not ucrNudBreaks.IsEmpty AndAlso Not ucrNudPool.IsEmpty Then
If ucrSaveTrainingData.IsComplete AndAlso ucrSaveTestingData.IsComplete AndAlso Not ucrNudFraction.IsEmpty Then
If Not ucrSaveTrainingData.IsComplete Or Not ucrSaveTestingData.IsComplete Or ucrNudFraction.IsEmpty Then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of interest, what's the difference between or and orElse?

Copy link
Contributor Author

@derekagorhom derekagorhom Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilyclements
Or requires both expression to be evaluated while OrElse would still work even if only one expression is evaluated. so in this case if ucrSaveTrainingData (or ucrsaveTestingData/UcrNudfraction) is empty then the TestOK will be disabled regardless of if the other controls are true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - thanks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or requires both expression to be evaluated while OrElse would still work even if only one expression is evaluated. so in this case if ucrSaveTrainingData (or ucrsaveTestingData/UcrNudfraction) is empty then the TestOK will be disabled regardless of if the other controls are true.

@derekagorhom @lilyclements I'm not sure I completely agree with this. Both Or and OrElse provide a logical 'Or' operation and will always return the same logical result.
The only difference is that OrElse is a short-circuiting operator, Or is not. This means that with OrElse if the left-hand side is true then VB will not bother to evaluate the right-hand side (because the 'Or' condition has already been met). In contrast Or will always evaluate both sides. There may be rare conditions when you want to do this, e.g. if you want to force 2 functions to be called:

If myFunction1() Or myFunction2() Then

But this could be confusing for other developers so better to call functions explicitly above the 'If'.
OrElse is useful for avoiding exceptions, e.g.:

If myObject Is Nothing OrElse myObject.name = 'Derrick' Then

If we used Or above, and myObject was nothing, then it would raise an exception.
In summary, OrElse is more efficient and may be safer. Therefore best to always use OrElse (apart from very rare circumstances).
Does this make sense?

Copy link
Contributor

@lloyddewit lloyddewit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekagorhom Thank you, this looks good.
Please could you read my comment and if you agree, then change the Ors to OrElse? It's not actually essential but it makes the code more consistent with the rest of R-Instat.
If you disagree, then that's also fine, I'm happy to discuss.
Thanks

If ucrChkStratifyingFactor.Checked Then
If Not ucrReceiverRanSplit.IsEmpty AndAlso Not ucrNudBreaks.IsEmpty AndAlso Not ucrNudPool.IsEmpty Then
If ucrSaveTrainingData.IsComplete AndAlso ucrSaveTestingData.IsComplete AndAlso Not ucrNudFraction.IsEmpty Then
If Not ucrSaveTrainingData.IsComplete Or Not ucrSaveTestingData.IsComplete Or ucrNudFraction.IsEmpty Then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or requires both expression to be evaluated while OrElse would still work even if only one expression is evaluated. so in this case if ucrSaveTrainingData (or ucrsaveTestingData/UcrNudfraction) is empty then the TestOK will be disabled regardless of if the other controls are true.

@derekagorhom @lilyclements I'm not sure I completely agree with this. Both Or and OrElse provide a logical 'Or' operation and will always return the same logical result.
The only difference is that OrElse is a short-circuiting operator, Or is not. This means that with OrElse if the left-hand side is true then VB will not bother to evaluate the right-hand side (because the 'Or' condition has already been met). In contrast Or will always evaluate both sides. There may be rare conditions when you want to do this, e.g. if you want to force 2 functions to be called:

If myFunction1() Or myFunction2() Then

But this could be confusing for other developers so better to call functions explicitly above the 'If'.
OrElse is useful for avoiding exceptions, e.g.:

If myObject Is Nothing OrElse myObject.name = 'Derrick' Then

If we used Or above, and myObject was nothing, then it would raise an exception.
In summary, OrElse is more efficient and may be safer. Therefore best to always use OrElse (apart from very rare circumstances).
Does this make sense?

@derekagorhom
Copy link
Contributor Author

@lloyddewit i have made the changes.... OrElse works fine also. thank you for the suggestion

@lloyddewit
Copy link
Contributor

@rdstern Please could you test? thanks

@N-thony
Copy link
Collaborator

N-thony commented Jul 7, 2023

@rdstern Please could you test? thanks

@rdstern any plan about testing this?

@N-thony
Copy link
Collaborator

N-thony commented Jul 24, 2023

@rdstern Please could you test? thanks

@rdstern any plan about testing this?

@rdstern this is awaiting for your testing.

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be working. @derekagorhom there needs to be a possibly seperate pull request to add the rsplit package to R-Instat.
As a detail I found it odd that there was the usual data selector, when there was no receiver - unless you included the stratified option. But I found it was working well - at least for the ordinary option. I didn't check the time series. I tried with the hh data from the MICS survey and it seems to be working fine. So I am approving.

@lloyddewit lloyddewit changed the title Random split dialogue Added Random Split dialog Aug 2, 2023
@lloyddewit lloyddewit merged commit f6e4aeb into IDEMSInternational:master Aug 2, 2023
@derekagorhom derekagorhom deleted the Random_Split_Dialogue branch August 11, 2023 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a new dialogue possibly called Random Split using rsample?
6 participants