-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1 write get baseline predictions function #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! I added a few comments -- mostly minor, one question of substance about how to handle the two different output types.
#' @param seed integer specifying a seed to set for reproducible results. | ||
#' Defaults to NULL, in which case no seed is set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote to take this argument out, and if the user wants reproducible results they can just set the seed before calling the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that the results actually weren't reproducible unless I set the seed within the function, which is why I added the argument
num_locs <- length(unique(target_ts[["location"]])) | ||
if (num_locs != 1) { | ||
cli::cli_abort("{.arg target_ts} contains {.val num_locs} but only one may be provided.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is the only place in this function where the location
column is used. It seems like this limits the flexibility of this function -- e.g., the input data must encode the location in the column named location
and the validation will not check for other types of disaggregation, e.g. by age group.
As a proposal, what if we get rid of the requirement of a column named "location"
and this check that there is only a single location, and replace it with a check that there are no duplicated values in the time_index
column. This is not exactly logically equivalent, but in most cases a data set with more than one location will also contain duplicates for the time index.
Another idea heading in the other direction would be to check that there is only one value in any columns not named time_index
or observation
. That would mean we could still do this check, but it would apply generally and would not rely on this specific column name.
I'm open to other ideas too, or just keeping this as is for now and filing an issue to address this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the code like this based on the hubverse framework of target data in a time series format always having the three columns "time_index", "location", and "observation", so the validation function for target time series data also makes this assumption. I don't know if this framework has changed for the hubverse, or if we'd prefer to do something different for this package, but I think we should discuss and then decide on a package-level, rather than just for this function in particular.
one other request -- could we add the github action to run unit tests on pull requests to this repo? to avoid adding more stuff to this PR, maybe good to add that in a separate PR we could merge first |
Co-authored-by: Evan Ray <[email protected]>
Co-authored-by: Evan Ray <[email protected]>
Co-authored-by: Evan Ray <[email protected]>
No description provided.