Skip to content

Scenario File Checks

Lucie Contamin edited this page Aug 14, 2024 · 9 revisions

File Check List

The complete list of checks is available on the Scenario Modeling Hub website - Validation Documentation

File Checks Running Locally

Each submission will be validated using the validate_submision() function from the SMHvalidation R package.

The package is currently available on GitHub. To install it please follow the following steps:

install.packages("remotes")
remotes::install_github("midas-network/SMHvalidation", 
                        build_vignettes = TRUE) 

This package can also be manually installed by directly cloning/forking/downloading the package from GitHub.

To load the package, execute the following command:

library(SMHvalidation)

The package contains a validate_submission() function allowing the user to check their SMH submissions locally.

Prerequisite

To test a submission file, the function requires multiple parameters:

  • path: path to the submissions file (or folder, for partitioned data) to test. A vector of parquet files can also be inputted, in this case, the validation will be run on the aggregation of all the parquet files together, and each file individually should match the expected SMH standard. If partition is not set to NULL, the path to the folder containing only the partitioned data should be input.

  • js_def: path to JSON file containing round definitions: names of columns, target names, ... following the tasks.json Hubverse format

    • the information in the JSON file can be separated in multiple groups for each round:
      • The "task_ids" object defines both labels and contents for each column in submission files. Any unique combination of the values define a single modeling task. For example, for SMH it can be the columns: "scenario_id", "location", "origin_date", "horizon", "target", "age_group", "race_ethnicity".
      • The "output_type" object defines accepted representations for each task. For example, for SMH it concerns the columns: "output_type", "output_type_id", "run_grouping", "stochastic_run" and "value".

Some additional optional parameters are available:

  • lst_gs: named list of the data frame containing the observed data. We highly recommend using the output of the SMHvalidation::pull_gs_data() function as input. This function will generate the output in the expected format with the expected data. For more information, please see ?pull_gs_data(). This parameter can be set to NULL (default) to not compare between the value and the observed data.
  • pop_path: path to a table containing the population size of each geographical entity by FIPS (in a column "location") and by location name. For example, path to the locations file in the COVID19 Scenario Modeling Hub GitHub repository. This parameter can be set to NULL (default) to not run a comparison between the value and the population size data.
  • merge_sample_col: Boolean to indicate if in the submission file(s), the output type "sample" has the "output_type_id" column set to NA and the information is instead contained in 2 columns: "run_grouping" and ⁠"stochastic_run⁠". By default, FALSE
  • partition: character vector indicating if the submission file is partitioned and if so, which field (or column) names correspond to the path segments. By default, NULL (no partition). See arrow R package for more information, and especially arrow::write_dataset(), arrow::open_dataset() functions. Warning: If the submission files is in a "partitioned" format, the path parameter should be to a directory to a folder containing ONLY the "partitioned" files. If any other file is present in the directory, it will be included in the validation.
  • n_decimal: integer, number of decimal points accepted in the column "value" (only for "sample" output type), if NULL (default) no limit expected.
  • round_id: character string, round identifier. This identifier is used to extract the associated round information from the js_def parameter. If NULL (default), extracted from path.
  • `verbose`: Boolean, if `TRUE` (default) the report will contain additional
     information about the sample pairing information in output report. 
     **only available for submission with samples output type**
    

Run the validation

To test the model output projections from Round 1 2024-2025, please use at least the version 0.1.1 of the validation package:

Prerequisite

It is important to set the working directory to the folder containing all the data required to run the validation, here, for example, we will take a path to flu-scenario-modeling-hub/ and We will use the projection from the "MyTeam-MyModel" group as example.

Then, all the parameters can be set:

setwd("~/flu-scenario-modeling-hub")
# Path to the file to validate
projection_path <- "data-processed/MyTeam-MyModel/2024-08-11-MyTeam-MyModel.gz.parquet"
# Path to JSON file containing round definitions
js_def <- "hub-config/tasks.json"
# path to a table containing the population size of each geographical entity by FIPS
pop_path <- "data-locations/locations.csv"

Validation

Following the documentation associated with the round, available in the data-processed/README.md, some optional parameters should be set:

  • n_decimal = 1
  • merge_sample_col = TRUE as the sample pairing information is expected to be available into two columns (run_grouping and stochastic_run)
validate_submission(projection_path, js_def, pop_path = pop_path, n_decimal = 1, 
                    merge_sample_col = TRUE)

Output

The function can generate 3 different outputs (additional pairing information on the sample output type might also be added to the report):

  • message when the submission does not contain any issues
  • warning + report message when the submission contains one or multiple minor issues that do not prevent the submission from being included.
  • error + report message when the submission contains one or multiple minor and/or major issues that prevent the submission from being included. In this case the submission file will have to be updated to be included in the corresponding SMH round.

An example run can for example return a message:

Run validation on files: 2024-08-11-MyTeam-MyModel.gz.parquet
End of validation check: all the validation checks were successful

Please verify before submitting that the submission file(s) are in the expected data-processed/ team-model folder and no additional folder are in the repository, to avoid issue during the automatic validation.

Previous round

If you want to test a previous round's submission:

As the submission file format has been updated in 2024, please use past version of the package to validation previous round. As the validation requirement and parameter behavior as evolve with time, please refer to the past version of the documentation included in the package to run the validation function.

File Visualization Running Locally (only for quantiles values)

The SMHvalidation R package contains plotting functionality to output a plot of each location and target, with all scenarios and observed data incorporated. The visualization function accept only quantile output type.

To run this visualization locally:

lst_gs <- NULL # set to NULL to not compare to observed data
generate_validation_plots(projection_path, NULL, save_path = getwd(), partition = c("origin_date", "target"))

The function will generate a PDF file with the visualizations.

If projections files are submitted with quantiles output type, the visualization function will be called in the automatic validation and the output PDF file will be available as an "artifact" of the GitHub Action. Please click on 'details' on the right of the 'Validate submission' GitHub Action checks. The PDF is available in a ZIP file as an artifact of the GH Actions. For more information, please see here