Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe > Two/Three Variables > Summarise #4952

Open
rdstern opened this issue Sep 14, 2018 · 37 comments · Fixed by #8960
Open

Describe > Two/Three Variables > Summarise #4952

rdstern opened this issue Sep 14, 2018 · 37 comments · Fixed by #8960
Assignees
Milestone

Comments

@rdstern
Copy link
Collaborator

rdstern commented Sep 14, 2018

There is a lot to improve here.

I suggest a few steps, so (at least) the dialogues are consistent. But then I think discussion is needed with @dannyparsons and @volloholic to check on the strategy. Once that's agreed, and I would like to write the discussion points in this issue, then I suggest it could be an interesting and important task largely for @Muthenya , with support from the others?

So an initial suggestion. The 2 dialogues remain inconsistent and should not be. Summarise has a single receiver first and then a multiple receiver. Graph has the same, but the other way round.

I suggest that most people will be considering 2 specific variables when they visit this dialogue. So could we have the same idea as in the specific graphs, namely
a) The first receiver is always a single receiver.
b) The second one is also a single receiver (by default), but with the same button as on the initial receiver for the Describe > Specific > Boxplot, etc. So it says Single, and can be changed to Multiple, in which case it becomes a Multiple receiver.

I don't necessarily expect @dannyparsons and @volloholic to agree, even with this, but propose that the extra button will allow the idea of the dialogue to be explained clearly.

@rdstern
Copy link
Collaborator Author

rdstern commented Sep 14, 2018

Next proposal for these 2 dialogues:
Currently the main dialogue hides anything to do with the type for the variables. There are currently 4 options, namely:

  1. Numeric by Numeric
  2. Numeric by Factor
  3. Factor by Numeric
  4. Factor by Factor

On the Describe > Two Variable > Summarise there is currently a sub-dialogue - I guess there will be! On the Describe > Two Variable > Graphics there is:

image

So, suddenly you do have to understand this idea!
I suggest, instead, that we have the new-style radio buttons at the top of each (main) dialogue. This will make the selection of the variables easier. Then the sub-dialogues could also be tabbed and the Options button takes you to the correct tab.

@Muthenya
Copy link
Contributor

Muthenya commented Nov 6, 2018

@rdstern is this still awaiting further discussion?

@Muthenya
Copy link
Contributor

@rdstern?

@dannyparsons
Copy link
Contributor

I would like to make some progress on improving this. Here are some suggestions:

  • The dialogs have a multiple receiver, followed by a single receiver

For the summarise dialog the summaries will be:

  • Numeric by numeric: correlations
  • Numeric by categorical: numerical summaries by the levels of the factor
  • Categorical by numeric: ANOVA table
  • Categorical by categorical: Frequency tables

This is sort of how it works now, so I think this was discussed before.

Questions

  • To address Roger's point about the types of variables being hidden on the main dialog, I suggest having Options below the Selector which says the type of variables selected and then gives some options for the summaries or just gives the type of variables and leads to the sub dialog. I think the idea was to keep these dialogs very simple so maybe we shouldn't add too many options?
  • Should we (optionally) allow the dialog also be used a way where the second variable is just a "by" factor i.e. it is like the One Variable dialog but able to do it by the levels of a factor. Currently the multiple receiver only allows one type of data, so it can't have different types like the One Variable dialogs.
  • What do we do for other types - character, dates particularly. Dates can sort of be treated like numeric, even for ANOVA? Character can be treated as categorical?

@rdstern
Copy link
Collaborator Author

rdstern commented Sep 22, 2019

I wonder, with the two variable situation, whether we (at least) have two radio buttons called By and 2 Variables? The By is simply the one variable dialogue with results By a second variable. This would be like the grouped data frame idea in dplyr.

I suggest we consider the three variable dialogues (and possibly add a 4 level item to the menu at the same time. I suggest this will be useful, and continue David's initial idea.

So the three-level could (at least initially) be simply 2 By, and By. For consistency we might include a 3rd button which is 3 Variables, but this would be disabled for now.

Here the 2 By is the same as the one Variable multiple receiver, split by 2 factor variables. The By is all the 2-variable options split by one factor.

If it looks useful, then we could add the 4 Variables situation, which might just be the 2-variables summaries split by 2 factors.

Of course there are other options for 4 variables, but many analyses seem to stop at 2-way tables, etc, (and there is a fair bit to teach here) and we do want to encourage users to move to the more general situation. So, at least for now, I suggest we don't worry about too may 3-variable tables etc,

This split will mean that the one-variable by will allow the multiple receiver to permit any, or all variables, while the tow variable summaries can restrict the multiple receiver to be of a single type.

Allowing the By up to 2 factors also fits well with the graphics, where the default can be for a by to be a facet.

@dannyparsons
Copy link
Contributor

How should date columns be treated? We had thought like a numeric column, but you can't do correlations with them like a numeric column, and they also can't be used as the response variable in an ANOVA table.
We can either convert them to numeric and use the underlying numbers or date types could be excluded from the selector if we don't want to use them in these cases.

@rdstern
Copy link
Collaborator Author

rdstern commented Nov 15, 2019

Interesting. Is this a logic, or an R question. I can think of examples where correlations or regression could be useful.

  1. I use ODK for a survey. For each respondent it includes the interviewer, the date/time when the survey process started and the duration of the survey in minutes. I would like to know the relationship between the date/time and the duration.
  2. I record the date of the start of the rains and also the latitude of the farm. What is the relationship between the start date and the latitude?

Is this more a question of a suitable origin being needed, or perhaps there is usually a logical start so it is a difference in dates that is being used. In many studies there is a natural origin (often zero), while dates have an arbitrary origin. I get this problem when looking at trends in temperature, with year as the x - but could be daily with a date as the x. Then (with year) the origin is year zero, which is a long time ago! In a practical sense this can mess up the regression modelling, so better to have a more sensible origin. Is the paper by Cox useful here? Perhaps the issue of it being a date is less relevant than the fact there are instances where just making it numeric is not sensible, because the variability of the different date/times may be very low compared to the size of the observations?

@dannyparsons
Copy link
Contributor

The question was sort of both, in R these give an error, but you can convert to numeric to get days since 1970/1/1. As you say I think the origin is arbitrary, since its the differences and not the actual values that are of interest usual.
So I think its sensible to treat dates as numeric for this dialog.

@rdstern rdstern changed the title Describe > Two Variables > Summarise and Graph Describe > Two Variables > Summarise Dec 18, 2019
@rdstern
Copy link
Collaborator Author

rdstern commented Dec 28, 2019

I assume this is the place to comment on the 2-variable summarise. I am using @dannyparsons new version.
a) A detail - the dialogue title for the Graph is Two Variable Graph - which seems sensible. Here the Title is Describe Two Variables. I suggest Two Variable Summary.
b) I tried the same
b) The graph dialogue now has some useful options. Could we please have the same for the summary. In particular they would be great for the situation with categorical by categorical.

Currently you only get the Counts. You don't get the margins - I think?.
Could we have a box with the 4 options as check-boxes, namely Counts, Row%, Column%, (or Col%), Cell%.
Ideally you could have them all. If that is a mess (maybe, because there are also multiple sets of tables) then perhaps you initially choose one of them. Or, if there is just one variable in the first receiver, then you allow all, but if more than 1, then you choose a single option. (There would be some sense to that in that you would then (to some extent be comparing the different variables, and hence you choose on which summary to compare them.

The margins are also interesting. Sometimes you want one, but not both. Initially I would be happy with a single checkbox so you either get the margins or not. Ideally there would eventually also be another checkbox, perhaps only visible when you ask for one of the percents. It could be labelled as "Counts for 100%". (This is what Genstat does as default.)
The 100% is useful for teaching, as it reminds you what is 100%. But once you know that, then the table is much more useful if the 100% is replaced (perhaps an option to add?) by the Counts - that answers the question of "percent of what".

@rdstern rdstern added this to the 0.7.4 milestone Oct 29, 2021
@rdstern
Copy link
Collaborator Author

rdstern commented Oct 29, 2021

@Ivanluv the suggestions just above are for the situation with categorical by categorical.
There is also the trivial one of changing the name on the top of the dialogue to Two Variables Summarise.
Then all options may change, but the urgent one is Categorical by Categorical. Here is the current dialogue and results for the standard rice survey data:
image

There are no options and the display is pretty awful!
I suggest this dialogue could become one of our "work-horses" for many users doing simple analyses. They often do look at the results from 2 variables. So the initial set of suggestions is to have a good display of 2-way frequency tables, and this makes a set of tables that are special cases of our new Describe > Specific > Frequency Tables dialogue. The layout shown above could still be included, perhaps with percentages included and these could then be saved into a new data frame. It would be slightly different, because count would be the name of the variable - and there could be others (percents) and there would be a first variable called Table. And there should be an option "Include zeros". This - with these changes, would also be one option for the display in the output window.

@Ivanluv
Copy link
Contributor

Ivanluv commented Nov 10, 2021

@rdstern should I use the sjPlot::sjtab function as the one in Describe > Specific > Frequency Tables dialogue to implement the improvements you have suggested above?

@rdstern
Copy link
Collaborator Author

rdstern commented Jan 6, 2022

@Ivanluv now I see at least one example of the questions where you expected an answer - and I missed it, and you didn't remind me! Also perhaps that you would like more specific direction - though (as a programmer) you are asking more detail from me than (as a user) I have.
I assume sjtab is a possibility. But I was hoping it would be a simple case for mmtable2 as the default. That is what we are using for the general tabulation now, so it would be a nice introduction to that package to use it here (and in the 3-way, once we get that dialogue? That may be obvious to you now, but otherwise, perhaps @lilyclements could confirm or deny?

@lilyclements
Copy link
Contributor

Using mmtable2 here seems like a good solution, and implementation-wise (I assume) is very similar to the work already done in the summaries dialog.

@Ivanluv
Copy link
Contributor

Ivanluv commented Jan 19, 2022

@lilyclements the object produced by
data_book$frequency_tables(data_name="survey", x_col_names=c("village","fertgrp"), y_col_name="variety", store_results=FALSE, as_html=FALSE) is of typeOf NULL .How can I have it passed to mmtable ?

@dannyparsons dannyparsons modified the milestones: 0.7.4, 0.7.5 Jan 24, 2022
@lilyclements
Copy link
Contributor

@lilyclements the object produced by data_book$frequency_tables(data_name="survey", x_col_names=c("village","fertgrp"), y_col_name="variety", store_results=FALSE, as_html=FALSE) is of typeOf NULL .How can I have it passed to mmtable ?

It cannnot be passed to mmtable2 in it's current form. If you run this code in R, you can see the output is two tables

image

Out of interest, how does frequency_tables differ to using summary_tables and limiting to the frequency-type variables? (like we are in the frequency tables dialog) Is there a reason this is being used here?

@rdstern
Copy link
Collaborator Author

rdstern commented Jan 26, 2022

I wonder if this starts to raise the more general question of whether we have a separate Describe > Specific > Frequency and Summary tables dialogue? Should we consider having a Tables dialogue with a frequency and Summary button at the top?

The main differences in the frequency tables is just the need for the percentages.

@dannyparsons
Copy link
Contributor

That sounds sensible.

@rdstern rdstern changed the title Describe > Two Variables > Summarise Describe > Two/Three Variables > Summarise Nov 5, 2023
@rdstern
Copy link
Collaborator Author

rdstern commented Nov 10, 2023

@lilyclements many thanks for that. With your nice neat layout above why not have another summary table from Categorical by Numeric by Categorical? So it is the same as Categorical by Categorical by Numeric?

@lilyclements
Copy link
Contributor

@rdstern I've amended my table to reflect those changes :)

@rdstern
Copy link
Collaborator Author

rdstern commented Nov 10, 2023

@derekagorhom can you try this one? Perhaps even share with Raphael, if he is ready? It is a good one to build carefully on the 2-variable code and includes a lot of statistics too! It would be good for Sabi to test.

@rdstern
Copy link
Collaborator Author

rdstern commented Nov 23, 2023

@derekagorhom I can understand why you have been quiet on this one, given all you have been doing concerned with the AIMS course. Are you happy to work on this one, once that is over, or do you have too many other tasks just now?

@derekagorhom
Copy link
Contributor

@rdstern sorry for the late reply, yes i will work on it next week but if someone else would like to attempt it. that is fine with me

@rdstern
Copy link
Collaborator Author

rdstern commented Dec 6, 2023

@derekagorhom this is an important dialog we need to get working. I am starting to be concerned that you may be spending too long helping on the new visualise dialog, which is fun but much less important. I had hoped that work on this one might have started while Lily was visiting. Now it could involve @fran2or for support. I'd be happy for him to be spending a bit more time on R-Instat, and this one is also now in the climatic menu as well as in describe.

@rdstern
Copy link
Collaborator Author

rdstern commented Dec 8, 2023

@derekagorhom you have been very quiet these last 2 weeks. Is everything ok?

@derekagorhom
Copy link
Contributor

derekagorhom commented Dec 8, 2023

@rdstern sorry for being quiet on this issue.
I was able to implement CxCxC and CxCxN for this option.
i am having problem adding CxNxN function because of how the summaries were programmed
image
for the three variable option the second variable only displayes catergorical even when it is a numeric value...
I was hoping to get it fixes with antoine by monday.

@rdstern
Copy link
Collaborator Author

rdstern commented Dec 10, 2023

@derekagorhom that's great - many thanks. I was only concerned if the work hadn't started.
Four of the 8 options are N by something, by something and we assume all can be ANOVA, so they should be quick, once you start on them.

@rdstern
Copy link
Collaborator Author

rdstern commented Apr 22, 2024

@Vitalis95 I have yet to check your recent pull request. But I had a good discussion with @volloholic and am now ready to list the way I suggest this dialog - with our summary metods should work.
I am now even more comfortable with the main change, starting with the 3-way, that when the multiple (first) receiver is numeric, then the summary should include anova. Here I am therefore specifying what the 2-variable should do. The next entry will move to the 3-variable. But I suggest we merge one the 2-variable is presentable - and keep the 3-variable then hidden. We may even merge when the 2-variable is sort of ok, even if all the improvements are not yet implemented.

  1. So here goes for 2 variable - initially:
    a) Multiple Categoric, second Categoric, gives frequency tables, one table for each of the multiple receiver. That is as now, but I think currently it may put everything into one big frequency table. This is now a set of separate 2-way frequency tables, now we can do that.
    b) Multiple Categoric, second Numeric. Summary tables, where the (maybe multiple) summaries are for each factor. So I think it should be multiple summaries as columns, by each of the categorical variables in turn. So again multiple tables - now we can do that!
    c) Multiple numeric now gives ANOVA table, whether second is numeric or categorical. It gives a separate ANOVA table for each of the variables in the multiple receiver.
    This is done (for one of the options already, but it uses a function. I would prefer it to use just the code for the commands instead, so you see that it is using lm. This is now done already (by @lilyclements and @derekagorhom in the (more complicated) 3-way case. I hope you can adapt that code.

That's stage 1 for 2 variables. Notice we have lost the correlations option. Don't delete that code, because I suggest we still need it, see below.

So two (new) improvements:

a) Numeric by Numeric we could also have the correlations. So add a checkbox Correlations. Default unchecked. If checked, then it gives the ANOVA anyway, plus the correlations. (Later we may add another checkbox perhaps saying Model where we give the formula for the regression line. Again default is unchecked.)
b) Numeric by Categorical. Have a checkbox with label Means. If checked it gives the Means as well as the ANOVA table.

c) .And another change - maybe later. (But I think it is a reall "goodie" and the first steps can be done now!) Add a Checkbox saying Swap y and x) Default unchecked. For now make it disabled.

d) In the variables for this Summary (top radio button) Add (y) to the name, so it becomes First Variables (y): And also Second Variable (x):

I would like to merge initially at this stage. Then continue with the rest below:

Initially I am just interested in
a) Numeric by Numeric: Then it give the ANOVA with the same y (second variable) and each x in turn. The default is ANOVA for lots of alternative y variables and same x.
b) Categorical by Numeric becomes effectively Numeric by Categorical, so now ANOVA with one factor (as the ordinary Numeric by Categorical) would be, but with the same Y and lots of categorical x's.
(I'll worry about the other combinations later! I hope we don't need to change anything there! So the Swap y and x checkbox is currently disabled for Categorical by Categorical and Numeric by Categorical.)

@Vitalis95
Copy link
Contributor

@rdstern , @lilyclements a clarifications on the following;
In the 2 var summaries ,for now the Categorical by Numeric gives Anova table, should it be summary tables so that when we swap it gives Anova tables?
Also for the Numeric by Categorical, it gives summary tables , should it be Anova tables or both?

@lilyclements
Copy link
Contributor

lilyclements commented Apr 29, 2024

@Vitalis95

If the y is numerical, and the x is categorical, it should give an ANOVA table. Is this what you mean by categorical by numeric? (Apologies, I can get confused!)

If the x is numerical, and the y is categorical, we can get summaries. If the y is categorical, then we shouldn't have an ANOVA table. (an ANOVA table is fitted to a model where y is normally distributed)

@rdstern
Copy link
Collaborator Author

rdstern commented Apr 30, 2024

@Vitalis95 we can chat today. I think you are correct and that's what I posted last week. You may want to read that post again?
Numeric (Multiple) by Categorical now gives ANOVA and so does Numeric (Multiple) by Numeric.
With Numeric (Multiple) by Numeric you now also add a Correlations checkbox, default unchecked.
With Numeric (Multiple) by Categorical you now add a Means checkbox. Default unchecked. If checked it also gives a table of means.

@rdstern
Copy link
Collaborator Author

rdstern commented Jul 12, 2024

This still needs the 3 variable, so I'm re-opening

@Vitalis95
Copy link
Contributor

@lilyclements , for the 3 variables , when the means=TRUE,

y_col_names_list <- "yield"
purrr::walk(.x=y_col_names_list, .f= ~ data_book$anova_tables2(data="survey",  x_col_names=c("variety", "fertgrp"), y_col_name=.x, signif.stars=FALSE, sign_level=FALSE, means=TRUE, total=TRUE))
rm(y_col_names_list)

We get the following error;

image

Please can you also add the interaction term

@Vitalis95
Copy link
Contributor

@lilyclements , include_margins argument in summary_table produces an error, it used to work before

image

here is the code;

survey <- data_book$get_data_frame(data_name="survey")
last_table <- survey %>% pivot_wider(names_from={{ .x }}, values_from=value) %>% purrr::map(.f=~data_book$summary_table(data_name="survey", percentage_type="factors", perc_total_factors="variety", summaries=count_label, include_margins=TRUE, margin_name="All", treat_columns_as_factor=FALSE, columns_to_summarise=.x, factors=c("village",.x)) %>% pivot_wider(names_from={{ .x }}, values_from=value) %>% gt::gt(), .x="variety")
data_book$add_object(data_name="survey", object_name="last_table", object_type_label="table", object_format="html", object=last_table)
data_book$get_object_data(data_name="survey", object_name="last_table", as_file=TRUE)

@lilyclements
Copy link
Contributor

lilyclements commented Oct 21, 2024

@Vitalis95 thanks for this. To fix this, can you amend the anova_tables2 function in the data_object_R6.R file to be:

(Really simple - just changing the line

    if (class(mod$model[[x_col_names]]) %in% c("numeric", "integer")){

to

    if (class(mod$model[[x_col_names[[1]]]]) %in% c("numeric", "integer")){

)

If it's easier: The entire function should now be:

DataSheet$set("public", "anova_tables2", function(x_col_names, y_col_name, total = FALSE, signif.stars = FALSE, sign_level = FALSE, means = FALSE) {
  if (missing(x_col_names) || missing(y_col_name)) stop("Both x_col_names and y_col_names are required")
  if (sign_level || signif.stars) message("This is no longer descriptive")
  if (sign_level) end_col = 5 else end_col = 4
  
  # Construct the formula
  if (length(x_col_names) == 1) {
    formula_str <- paste0(as.name(y_col_name), "~ ", as.name(x_col_names))
  } else if (length(x_col_names) > 1) {
    formula_str <- paste0(as.name(y_col_name), "~ ", as.name(paste(x_col_names, collapse = " + ")))
  }

  # Fit the model
  mod <- lm(formula = as.formula(formula_str), data = self$get_data_frame())
  anova_mod <- anova(mod)[1:end_col] %>% tibble::as_tibble(rownames = " ")

  # Add the total row if requested
  if (total) anova_mod <- anova_mod %>% tibble::add_row(` ` = "Total", dplyr::summarise(., across(where(is.numeric), sum)))
  anova_mod$`F value` <- round(anova_mod$`F value`, 4)
  if (sign_level) anova_mod$`Pr(>F)` <- format.pval(anova_mod$`Pr(>F)`, digits = 4, eps = 0.001)
  cat(paste0("ANOVA of ", formula_str, ":\n"))
  print(anova_mod)
  cat("\n")
  # Optionally print means
  if (means) {
    if (class(mod$model[[x_col_names[[1]]]]) %in% c("numeric", "integer")){
      cat("Model coefficients:\n")
      print(mod$coefficients)
      cat("\n")
    } else {
      cat(paste0("Means table of ", y_col_name, ":\n"))
      print(model.tables(aov(mod), type = "means"))
      cat("\n")
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment