Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the Describe > One Variable > Summarise dialogue #7123

Closed
wants to merge 26 commits into from

Conversation

Ivanluv
Copy link
Contributor

@Ivanluv Ivanluv commented Jan 13, 2022

fixes #7081
@africanmathsinitiative/developers this is ready for review
@rdstern In this PR I have added the functionality to control the orientation, i.e. choosing whether the variables or the summaries go down in the summary table

@rdstern
Copy link
Collaborator

rdstern commented Jan 14, 2022

@Ivanluv well done. That looks great for a start.
a) Omit the Treat Columns as Factor checkbox, but always do that, so the 2 other controls are always visible.
b) Change the Label for the first from Summary to Summaries. Default is for it to be checked.
c) Now a puzzle that might be more difficult to resolve, namely the receiver can take any type of variable. So could be factor, or ordered factor or date, etc. Currently it gives an error for dates or factors or character variables. Daters should be ok (in the long run) to solve, and also need to be possible in the general tabulation. I am happy for most summaries to be missing when they don't apply. But I would like to be able to cope with those that do. I hope you may have ideas on what can be done in mmtable2, e.g. dates, and what might be impossible.
d) Could you also check on different types of output - this applies to the general tabulation as well. Currently it is html. @Patowhiz is looking into whether that may be possible in the output window. I wonder whether there might be other options, particularly simple ones that could go into the current output window.
e) Oh I forgot - I think only half the corrections have been included - so the answers are wrong!
image
Here they are from the new dialogue - with the totals line included - so those are double the true values. But the columns referred to are incorrect!

@rdstern
Copy link
Collaborator

rdstern commented Jan 20, 2022

@Ivanluv it now seems to get the correct summaries, but in the wrong order.
image
This was the dialogue:
image

a) You can see that Blue Sky remains at the top in the results, but its summary is at the bottom!
b) The summary for R-Instat is in second place (towards the top - but it should be second from the bottom.
c) The labelling has become alphabetical.

And, may need @lilyclements help - it doesn't work for factor variables. etc. I would like it to do something for variables of any type. Not sure what to do here?

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments given for changes

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Jan 24, 2022

@lilyclements any suggestions here

@lilyclements
Copy link
Contributor

lilyclements commented Jan 26, 2022

@Ivanluv have you tried updating your branch? I am not getting this error if I try it on the version on PR #7107

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Jan 27, 2022

And, may need @lilyclements help - it doesn't work for factor variables. etc. I would like it to do something for variables of any type. Not sure what to do here?

@lilyclements I have updated this branch and its now working okay

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Jan 27, 2022

And, may need @lilyclements help - it doesn't work for factor variables. etc. I would like it to do something for variables of any type. Not sure what to do here?

@lilyclements do you have any ideas how we could fix this?

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ivanluv this is now working fine. Now I tried with other types of data (not-numeric) and it doesn't like that.
Now I thought you might need help from @lilyclements for this. But then I tried with our summary system and that handles all types of data perfectly - even character variables. And dates come out fine as dates in the summary. So I wonder if that can provide the code to be used before going into mmtable?
I treid with dodoma data, so there is a date variable and also month as a factor. Then I made month into character. It just gives NA when it can't work out the summary!

On the actual dialogue could you have the Variables as rows checked by default.

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Jan 27, 2022

@rdstern by the "summary system" do you mean summary tables?

@rdstern
Copy link
Collaborator

rdstern commented Jan 27, 2022

No, I meant Prepare > Data: Reshape > Column Summaries. The first receiver is Variable(s) to Summarise and it allows all types of variable - as does your Describe > One Variable > Summarise. And it also has the same sub-dialogue as yours, and copes with any summary for any type! Just what we need in your One Variable dialogue!

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Jan 27, 2022

@rdstern should we have the results stored as it is being done in the Describe > One Variable > Summarise?

@rdstern
Copy link
Collaborator

rdstern commented Jan 27, 2022

@Ivanluv you ask - " should we have the results stored as it is being done in the Describe > One Variable > Summarise?" I assume you mean as is done in the Prepare > One Variable > Column Summaries"

I would not, at least for now. Once we are able to do sensible summaries for variables of all types, then I would like to merge and then reflect and consider.

What is done here is a great update, and very useful. I am more concerned here with the tables of results. And they are able to use any of the summaries we have in our system. There will be more to do, but that can come later. We need it to stay simple, but could usefully add the options for the missing values - now that we have them in a package.
Then I am waiting for Patrick to sort out html for the output window, which will be great.

Then (as you say) there will be 2 further directions:
a) More flexibility on the output. We could add the presentation options from the Summary tables work. But let's finish that on the summary tables dialogue first. Then this dialogue could allow users to practice with very simple structures.
b) We could permit the save - as you say.

We also apply these ideas to the 2-way and 3-way tables - if we have the 3-way - I think so!

Then in simple graphs and tables (which IS descriptive statistics after all) we have a package to be envied!
This will be good for both teaching and also doing!

And I continue while I am thinking what's needed. I use this, because one-variable summaries is soooooo simple. Who would have thought you could do so much with it!

In the presentation of the results we can choose the order of the summaries in the general case, but not here. And, even in the general case, we can't yet choose the labels for those summaries - but we will in the General Summaries. Then the challenge in this dialogue is to make it as attractive and flexible as possible - but not complicated. Here it must stay also simple for the user.

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ivanluv not sure what has happened here? This now seems like the old dialogue again?

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Jan 28, 2022

@rdstern Am still making the changes for the dialogue.

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ivanluv I assume these changes are "work in progress?", but I had a quick look.
a) The options are now visible all the time. Should not be, for the default where the summary is used.
b) The default of ticking the Variables as rows is still to be done.
c) The summary gives an error now.

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Jan 31, 2022

@rdstern could you have a look

@lilyclements
Copy link
Contributor

@Ivanluv if the user selects to modify/choose the summary statistics to display, you are currently using $calculate_summary not $summary_table.

$summary_table is a wrapper for $calculate_summary therefore you do not need to use $pivot_longer with $summary_table as that is all incorporated into the code.
Additionally, the alphabetical order would be sorted with $summary_table since this is already written in.
Is there a reason to not use summary_table?

In PR #7230, the code is fixed to use date variables in the summary functions. Once this is merged in, then update your master and hopefully that will work! (Or you can merge it into this PR if you think that's more appropriate).

@lilyclements
Copy link
Contributor

@Ivanluv I'm not sure if this will fix the proportions problem. If it is still a problem after changing the function then I can look into it then

@lilyclements
Copy link
Contributor

@Ivanluv how is this work going?

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Mar 8, 2022

@lilyclements am waiting for PR #7230 to be merged

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Mar 14, 2022

@rdstern @lilyclements I have merged PR #7230 into this and made the changes as suggested by @lilyclements i.e using the summary_table function instead of the $calculate_summary function. There is however a challenge when trying to find summaries not supported by certain data types eg getting the sum of factor columns as below.

image

image

image

After the error the table won't be produced even if one of the selected columns supports the summary

@rdstern
Copy link
Collaborator

rdstern commented Mar 16, 2022

It is looking nice
image

Note sensible summaries for date variables. And I made month into an ordered factor, so it can get the min and max! That's nice to explain to people!

But we do need to solve the problem that @Ivanluv mentioned, namely that it can't cope with some summary statistics - like sum, or mean, for a factor. When values are missing, and that option isn't given, then that cell is blank. We need the same for those statistics for a factor variable, etc.

One small change if @Ivanluv can't solve the current error, is to move the Missing option to below the other 2 checkboxes. Then it will be in the same relative position as in the summary dialogue.
But we can't merge until we can present tables of results for datasets (like the simple survey) that mix factors and variates.

@lilyclements
Copy link
Contributor

lilyclements commented Mar 24, 2022

So the sum/mean for factor is a problem in calculate_summary function. The same error reported by @Ivanluv can occur elsewhere in R-Instat. For example, in the "Column Summaries" dialog -

image

I've tried to work on getting this into the calculate_summary function, but think I need assistance from @dannyparsons

@@ -1418,28 +1423,29 @@ DataBook$set("public", "summary_table", function(data_name, columns_to_summarise
for (facts in power_sets_summary) {
if (length(facts) == 0) facts <- c()
summary_margins_df <- data_book$get_data_frame(data_name = data_name) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change data_book to say self? Realised I missed these earlier.

dplyr::select(c(factors, columns_to_summarise)) %>%
tidyr::pivot_longer(cols = columns_to_summarise)
dplyr::select(c(tidyselect::all_of(factors), tidyselect::all_of(columns_to_summarise))) %>%
tidyr::pivot_longer(cols = columns_to_summarise, values_transform = list(value = as.character))
data_book$import_data(data_tables = list(summary_margins_df = summary_margins_df))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

dplyr::select(c(factors, columns_to_summarise)) %>%
tidyr::pivot_longer(cols = columns_to_summarise)
dplyr::select(c(tidyselect::all_of(factors), tidyselect::all_of(columns_to_summarise))) %>%
tidyr::pivot_longer(cols = columns_to_summarise, values_transform = list(value = as.character))
data_book$import_data(data_tables = list(summary_margins_df = summary_margins_df))
summary_margins[[length(summary_margins) + 1]] <- data_book$calculate_summary(data_name = "summary_margins_df", columns_to_summarise = "value", summaries = summaries, factors = facts, store_results = FALSE, drop = drop, na.rm = na.rm, return_output = TRUE, weights = weights, result_names = result_names, percentage_type = percentage_type, perc_total_columns = perc_total_columns, perc_total_factors = perc_total_factors, perc_total_filter = perc_total_filter, perc_decimal = perc_decimal, margin_name = margin_name, additional_filter = additional_filter, perc_return_all = FALSE, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

dplyr::select(c(factors, columns_to_summarise)) %>%
tidyr::pivot_longer(cols = columns_to_summarise)
dplyr::select(c(tidyselect::all_of(factors), tidyselect::all_of(columns_to_summarise))) %>%
tidyr::pivot_longer(cols = columns_to_summarise, values_transform = list(value = as.character))
data_book$import_data(data_tables = list(summary_margins_df = summary_margins_df))
summary_margins[[length(summary_margins) + 1]] <- data_book$calculate_summary(data_name = "summary_margins_df", columns_to_summarise = "value", summaries = summaries, factors = facts, store_results = FALSE, drop = drop, na.rm = na.rm, return_output = TRUE, weights = weights, result_names = result_names, percentage_type = percentage_type, perc_total_columns = perc_total_columns, perc_total_factors = perc_total_factors, perc_total_filter = perc_total_filter, perc_decimal = perc_decimal, margin_name = margin_name, additional_filter = additional_filter, perc_return_all = FALSE, ...)
data_book$delete_dataframes(data_names = "summary_margins_df")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally here. Thanks!

@lilyclements
Copy link
Contributor

@rdstern @lilyclements I have merged PR #7230 into this and made the changes as suggested by @lilyclements i.e using the summary_table function instead of the $calculate_summary function. There is however a challenge when trying to find summaries not supported by certain data types eg getting the sum of factor columns as below.

@Ivanluv @rdstern I have now fixed this bug in PR #7338

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Mar 31, 2022

@lilyclements I have pulled you changes from PR #7338 and am getting the following error
image

image
image

Copy link
Contributor

@lilyclements lilyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ivanluv is village or size a character variable? If so then the following should fix it. If not, then the following will fix another potential bug so can you implement the following anyway:

property = "class")
# note: important that there *is* a space between | for str_detect function
if (any(stringr::str_detect("factor | Date", col_data_type))){
curr_data_list[[c_data_label]] <- curr_data_list[[c_data_label]] %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ivanluv can you change “factor | Date” to
“factor | character | Date”

it is important to have a space between the | too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilyclements the bug is now fixed

@rdstern
Copy link
Collaborator

rdstern commented Apr 1, 2022

@Ivanluv was this supposed to work now? I tried with a minimum for a factor and it said that isn't allowed.

@rdstern
Copy link
Collaborator

rdstern commented Apr 7, 2022

@Ivanluv and @lilyclements I am hoping this will be included in the Version 0.7.5 update? That's in 2 weeks time. OK?

@lilyclements
Copy link
Contributor

lilyclements commented Apr 8, 2022

@Ivanluv was this supposed to work now? I tried with a minimum for a factor and it said that isn't allowed.

@rdstern which combinations do we not want to work?

Factors and characters:
minimum/maximum
mean
median
sum
range
SD
variance

Ordered factors:
mean
median
sum
range
SD
variance

Dates:
sum
SD/var?

I have not yet put in that min/max (amongst others in the list above) shouldn't work for factors or characters, but that is very quick to add. I'll add any other combinations when you've responded to this so I know if there's any I'm missing here. Thanks!

@rdstern
Copy link
Collaborator

rdstern commented Apr 8, 2022

@lilyclements and @Ivanluv I'd prefer to look at this the other way round, namely what is sensible for different types. I am just keen that we have an option to mix the data types in this dialogue, rather than having to think carefully what to put in the the receiver.
So numeric allows everything. Logical and date are special cases of numeric
a) Ordinary factors (or character) only allow number of values, including number of unique values, number of missing, etc.
b) Ordered factors allow max and min (and they should therefore allow range - perhaps we don't yet use the range function?
c) Ordered factors currently don't allow anything else. I actually don't wee why they can't allow median and other percentiles, because they are ordered. But let's stick with what R does for now.
d) So only the numeric types allow means, sd, skewness and everything else. And for now percentiles, and proportions, etc.

@Ivanluv
Copy link
Contributor Author

Ivanluv commented Apr 12, 2022

@lilyclements what is you suggestion on how to proceed ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the Describe > One Variable > Summarise dialogue and possibly 2-variable and 3-variable?
3 participants