Extend Chronos evaluation to all 28 datasets from the paper #281
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
StatisticalEnsemble
exceeds 24 hours for some datasets, which makes sequential evaluation infeasible.src/
into a packagesrc/eval_utils
that can be installed viapip
usingpyproject.toml
.Dockerfile
,build_docker.sh
,.dockerignore
).StatsForecast
to fix crashes caused by AutoARIMA model on some datasets (e.g.,car_parts
)Extended comparison of Chronos against the statistical ensemble
We present an extension to the original comparison by Nixtla of Chronos [1] against the SCUM ensemble [2]. In this analysis on over 200K unique time series across 28 datasets from Benchmark II in the Chronos paper [1], we show that zero-shot Chronos models perform comparably to this strong ensemble of 4 statistical models while being significantly faster on average. We follow the original study as closely as possible, including loading task definitions from GluonTS and computing metrics using utilsforecast.
Empirical Evaluation
This study considers over 200K unique time series from Benchmark II in the Chronos paper, spanning various time series domains, frequencies, history lengths, and prediction horizons. Chronos did not use these datasets in the training phase, so this is a zero-shot evaluation of Chronos against the statistical ensemble fitted on these datasets. We report results for two sizes of Chronos, Large and Mini, to highlight the trade-off between forecast quality and inference speed. As in the original benchmark, we have included comparisons to the seasonal naive baseline. For each model, we also report the aggregated relative score which is the geometric mean of the relative improvement over seasonal naive across datasets (see Sec. 5.4 of [1] for details).
Results
The CRPS, MASE, sMAPE, and inference time (in seconds) for each model across 28 datasets have been tabulated below. The best and second best results have been highlighted in bold and underlined. Note that the use of sMAPE is discouraged by forecasting experts and we only report it here for completeness and parity with the previous benchmark.
Notes
batch_size=8
for all Chronos models. However, on theg5.2xlarge
instance used in the benchmark, we can safely use batch size of 16 for Chronos (large) and batch size of 64 for Chronos (mini).StatisticalEnsemble
is on average ~45 seconds higher than in the original benchmark. This does not affect the overall conclusions and the runtime ranking ofStatisticalEnsemble
and Chronos models.References
[1] Chronos: Learning the Language of Time Series
[2] A Simple Combination of Univariate Models