-
Notifications
You must be signed in to change notification settings - Fork 24
Scenario File Checks
The complete list of checks is available on the Scenario Modeling Hub website - Validation Documentation
Each submission will be validated using the validate_submision()
function from the SMHvalidation R package.
The package is currently available on GitHub. To install it please follow the following steps:
install.packages("remotes")
remotes::install_github("midas-network/SMHvalidation",
build_vignettes = TRUE)
This package can also be manually installed by directly cloning/forking/downloading the package from GitHub.
To load the package, execute the following command:
library(SMHvalidation)
The package contains a validate_submission()
function allowing the user to check their SMH submissions locally.
To test a submission file, the function requires multiple parameters:
-
path
: path to the submissions file (or folder, for partitioned data) to test. A vector of parquet files can also be inputted, in this case, the validation will be run on the aggregation of all the parquet files together, and each file individually should match the expected SMH standard. If partition is not set toNULL
, the path to the folder containing only the partitioned data should be input. -
js_def
: path to JSON file containing round definitions: names of columns, target names, ... following thetasks.json
Hubverse format- the information in the JSON file can be separated in multiple groups for
each round:
- The "task_ids" object defines both labels and contents for each
column in submission files. Any unique combination of the values
define a single modeling task. For example, for SMH it can be the
columns:
"scenario_id"
,"location"
,"origin_date"
,"horizon"
,"target"
,"age_group"
,"race_ethnicity"
. - The "output_type" object defines accepted representations for
each task. For example, for SMH it concerns the columns:
"output_type"
,"output_type_id"
,"run_grouping"
,"stochastic_run"
and"value"
.
- The "task_ids" object defines both labels and contents for each
column in submission files. Any unique combination of the values
define a single modeling task. For example, for SMH it can be the
columns:
- the information in the JSON file can be separated in multiple groups for
each round:
Some additional optional parameters are available:
-
lst_gs
: named list of the data frame containing the observed data. We highly recommend using the output of theSMHvalidation::pull_gs_data()
function as input. This function will generate the output in the expected format with the expected data. For more information, please see?pull_gs_data()
. This parameter can be set toNULL
(default) to not compare between the value and the observed data. -
pop_path
: path to a table containing the population size of each geographical entity by FIPS (in a column "location") and by location name. For example, path to the locations file in the COVID19 Scenario Modeling Hub GitHub repository. This parameter can be set toNULL
(default) to not run a comparison between the value and the population size data. -
merge_sample_col
: Boolean to indicate if in the submission file(s), the output type"sample"
has the"output_type_id"
column set toNA
and the information is instead contained in 2 columns:"run_grouping"
and"stochastic_run
". By default,FALSE
-
partition
: character vector indicating if the submission file is partitioned and if so, which field (or column) names correspond to the path segments. By default,NULL
(no partition). See arrow R package for more information, and especiallyarrow::write_dataset()
,arrow::open_dataset()
functions. Warning: If the submission files is in a "partitioned" format, thepath
parameter should be to a directory to a folder containing ONLY the "partitioned" files. If any other file is present in the directory, it will be included in the validation. -
n_decimal
: integer, number of decimal points accepted in the column"value"
(only for"sample"
output type), ifNULL
(default) no limit expected. -
round_id
: character string, round identifier. This identifier is used to extract the associated round information from thejs_def
parameter. IfNULL
(default), extracted from path. -
`verbose`: Boolean, if `TRUE` (default) the report will contain additional information about the sample pairing information in output report. **only available for submission with samples output type**
To test the model output projections from Round 1 2024-2025, please use at least the version 0.1.1 of the validation package:
It is important to set the working directory to the folder containing
all the data required to run the validation, here, for
example, we will take a path to flu-scenario-modeling-hub/
and
We will use the projection from the "MyTeam-MyModel" group as example.
Then, all the parameters can be set:
setwd("~/flu-scenario-modeling-hub")
# Path to the file to validate
projection_path <- "data-processed/MyTeam-MyModel/2024-08-11-MyTeam-MyModel.gz.parquet"
# Path to JSON file containing round definitions
js_def <- "hub-config/tasks.json"
# path to a table containing the population size of each geographical entity by FIPS
pop_path <- "data-locations/locations.csv"
Following the documentation associated with the round, available in the data-processed/README.md, some optional parameters should be set:
n_decimal = 1
-
merge_sample_col = TRUE
as the sample pairing information is expected to be available into two columns (run_grouping
andstochastic_run
)
validate_submission(projection_path, js_def, pop_path = pop_path, n_decimal = 1,
merge_sample_col = TRUE)
The function can generate 3 different outputs (additional pairing
information on the sample
output type might also be added to
the report):
- message when the submission does not contain any issues
- warning + report message when the submission contains one or multiple minor issues that do not prevent the submission from being included.
- error + report message when the submission contains one or multiple minor and/or major issues that prevent the submission from being included. In this case the submission file will have to be updated to be included in the corresponding SMH round.
An example run can for example return a message:
Run validation on files: 2024-08-11-MyTeam-MyModel.gz.parquet
End of validation check: all the validation checks were successful
Please verify before submitting that the submission file(s) are in
the expected data-processed/
team-model folder and no additional
folder are in the repository, to avoid issue during the automatic
validation.
If you want to test a previous round's submission:
As the submission file format has been updated in 2024, please use past version of the package to validation previous round. As the validation requirement and parameter behavior as evolve with time, please refer to the past version of the documentation included in the package to run the validation function.
The SMHvalidation R package contains plotting functionality to output a plot of each location and target, with all scenarios and observed data incorporated. The visualization function accept only quantile output type.
To run this visualization locally:
lst_gs <- NULL # set to NULL to not compare to observed data
generate_validation_plots(projection_path, NULL, save_path = getwd(), partition = c("origin_date", "target"))
The function will generate a PDF file with the visualizations.
If projections files are submitted with quantiles output type, the visualization function will be called in the automatic validation and the output PDF file will be available as an "artifact" of the GitHub Action. Please click on 'details' on the right of the 'Validate submission' GitHub Action checks. The PDF is available in a ZIP file as an artifact of the GH Actions. For more information, please see here