Skip to content

Submission format

Hugo Gruson edited this page Feb 21, 2022 · 5 revisions

Each forecast should be stored as a comma-separated value (csv) file in your data-processed/team-model folder.

The csv file must use a standardised file name, and contain specific variable names and values which identify the forecast you are submitting. This allows us to evaluate and compare across forecasts. The automatic check validates both the filename and file contents to ensure the file can be used in the visualization and ensemble forecasting.

File name

Each forecast file within the subdirectory should have the following name format:

YYYY-MM-DD-team-model.csv

Forecast date

The date YYYY-MM-DD is the forecast date. This should be the last day of the submission period (Monday).

team-model

The team and model in this file name must match the name of the data-processed directory this file is in.

File format

Required variables

The csv file must be contain only the following columns (in any order). No additional columns are allowed.

column column type description
forecast_date date Date as YYYY-MM-DD, last day (Monday) of submission window
Optional: scenario_id string One of "forecast" or a specified "scenario ID". If this column is not included it will be assumed that its value is "forecast" for all rows
target string "# wk ahead inc case", "# wk ahead inc death" or "# wk ahead inc hosp" where # is usually between 1 and 4
target_end_date date Date as YYYY-MM-DD, the last day (Saturday) of the target week
location string An ISO-2 country code
type string One of "point" or "quantile"
quantile numeric For quantile forecasts, one of the 23 quantiles in c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99
value numeric The predicted count, a non-negative integer number of new cases or deaths in the forecast week

Notes on each variable

forecast_date

This should correspond with the date in the filename: see above.

scenario_id

This optional column identifies whether a model is predicting a forecast, or using a scenario. The value of scenario_id should be a character (string) and one of:

  • "forecast", indicating that the values are true forecasts, i.e. reflect probabilities of observing future values in the truth data
  • a valid scenario ID

Initially, only forecasts will be accepted, but with the ECDC, we are developing scenarios, e.g. around vaccination and policies. See scenarios for details.

If this column is not included it will be assumed that its value is "forecast" for all rows in the file.

target

Values in the target column must be a character (string) and be one of the following specific targets:

  • "# wk ahead inc case"
  • "# wk ahead inc hosp"
  • "# wk ahead inc death"
"# wk ahead"

"#" will usually be a number between 1 and 4.

For the week ahead horizon, we use Epidemiological Weeks (EW) defined by the US CDC. Each week starts on Sunday and ends on Saturday. See here for more detail on EW weeks, and the template file for csv files converting between dates and EW weeks.

"inc"

All forecasts should be for the incident (weekly count) number of cases predicted by the model during the week that is N weeks after forecast_date.

Predictions for this target will be evaluated compared to the number of new reported cases, as recorded by JHU.

target_end_date

Values in the target_end_date column must be a date in the format YYYY-MM-DD.

This is the date for the forecast target and will be the Saturday at the end of the week time period. We provide a template csv to convert between an Epidemiological Week and its end date.

location

Values in the location column must be one of the ISO 3166-1 alpha-2 (ISO-2) geocodes. We provide a geocode file to convert between country names and ISO-2 code (column "iso2c"), or if using R, you can use the countrycode package.

type

Values in the type column are one of

  • “point”
  • “quantile”

This value indicates whether that row corresponds to a point forecast or a quantile forecast. Point forecasts are used in visualization, while quantile forecasts are used in visualisation and in ensemble construction, as long as all the quantiles given above are present. Both are considered in the evaluation, but with a focus on models that do provide quantiles.

Forecasts must include exactly 1 “point” forecast for each unique combination of location and target (usually 1 to 4 week ahead incident cases or deaths).

quantile

For quantile forecasts, this value indicates the quantile for the value in this row, in the format "0.###"". Teams should provide the following 23 quantiles:

c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99)

i.e.

0.010 0.025 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.650 0.700 0.750 0.800 0.850 0.900 0.950 0.975 0.990

Together with the single point forecast, this means that there should be 24 rows for every location-target pair.

If type is “point”, the quantile column value should be set to “NA”.

value

Values should be non-negative, integer counts.

  • For a “point” prediction, value is simply the value of that point prediction for the target and location associated with that row.
  • For a “quantile” prediction, value is the inverse of the cumulative distribution function (CDF) for the target, location, and quantile associated with that row.