Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor prepare_serodata #191

Closed
zmcucunuba opened this issue May 15, 2024 · 3 comments
Closed

refactor prepare_serodata #191

zmcucunuba opened this issue May 15, 2024 · 3 comments

Comments

@zmcucunuba
Copy link
Member

zmcucunuba commented May 15, 2024

At the moment, prepare_serodata only produces age_mean_f and adds some columns such as binomial confidence intervals. They could remain. But perhaps we don't need the sample_size column, which simply calculates the total sample size and adds that column, which may be misleading. In the short term, we must remove the sample_size column.

In the mid-long term, we need prepare_serodata to quality-control the data for the user.

@ben18785
Copy link
Collaborator

A few thoughts:

  • We already do checking of the user inputs within the fit_seromodel function through validate_prepared_serodata function. To me, this means that it's not necessary to force users to go through a pre-fitting quality checking.
  • prepare_serodata outputs a data frame with lots of columns in it. Only a small subset of these are necessary for fitting the model: total, counts, age_mean_f (I think just renamed age_middle) and tsur (this only being needed if there are multiple serosurveys).
  • The other columns outputted by prepare_serodata are useful for plotting.

A proposal:

  • prepare_serodata is kept but perhaps renamed to prepare_serodata_for_plotting. I would suggest modifying it to change column names as per the below.
  • For fitting via fit_seromodel, we require a user only to supply a data frame with age_min, age_max, n_tested, n_seropositive and year_survey. We make it clear within the function documentation that we just use the mean of age_min and age_max for the age when performing inference. This gives us flexibility for the future when we may want to account for the uncertainty in the distribution of ages within the bins when doing inference. This would mean that we would need to change validate_prepared_serodata (perhaps renamed as validate_serodata) to handle the new inputs.

@zmcucunuba
Copy link
Member Author

zmcucunuba commented May 17, 2024

Yes, I agree—many thanks. We agreed with Sumali to have something like:

prepare_sero_data <- function(my_raw_sero_data) {

# Use my_raw_sero_data to produce a data.frame with 5 names: 
sero_data <- data.frame(n_tested, n_seropositive, year_survey, age_min, age_max)
return(sero_data)

}

`sero_data` <- prepare_sero_data(my_raw_sero_data)

sero_data witll be used in other functions like set_sero_model, fit_sero_modeland others related to #190

@ntorresd
Copy link
Member

As pointed out by @ben18785, the only needed variables for inference are the age group marker age_group, n_seropositive and n_sample (old sample_size). From age_min and age_max we can obtain age_group (if not present) using add_age_group_to_serosurvey(); this is done both for inference and plotting purposes. This is why from v1.0.1 we introduced prepare_serosurvey_for_plotting(), which only adds the binomial confidence intervals (this function is exported).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants