Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Chronos evaluation to all 28 datasets from the paper #281

Closed
wants to merge 1 commit into from

Conversation

shchur
Copy link

@shchur shchur commented Apr 3, 2024

  • Extend the benchmark suite to 28 datasets
  • Provide an option to run experiments in parallel on AWS Batch using Metaflow. This change is necessary because runtime of StatisticalEnsemble exceeds 24 hours for some datasets, which makes sequential evaluation infeasible.
    • Instead of using relative imports, wrap the code in src/ into a package src/eval_utils that can be installed via pip using pyproject.toml.
    • Add files necessary to build a Docker container used by Metaflow (Dockerfile, build_docker.sh, .dockerignore).
  • Add fallback model for StatsForecast to fix crashes caused by AutoARIMA model on some datasets (e.g., car_parts)
  • Cap context length of statistical models to last 5000 observations of each series to avoid extremely high runtimes
  • Update the README with full results & instructions on how to run code with Metaflow

Extended comparison of Chronos against the statistical ensemble

We present an extension to the original comparison by Nixtla of Chronos [1] against the SCUM ensemble [2]. In this analysis on over 200K unique time series across 28 datasets from Benchmark II in the Chronos paper [1], we show that zero-shot Chronos models perform comparably to this strong ensemble of 4 statistical models while being significantly faster on average. We follow the original study as closely as possible, including loading task definitions from GluonTS and computing metrics using utilsforecast.

Empirical Evaluation

This study considers over 200K unique time series from Benchmark II in the Chronos paper, spanning various time series domains, frequencies, history lengths, and prediction horizons. Chronos did not use these datasets in the training phase, so this is a zero-shot evaluation of Chronos against the statistical ensemble fitted on these datasets. We report results for two sizes of Chronos, Large and Mini, to highlight the trade-off between forecast quality and inference speed. As in the original benchmark, we have included comparisons to the seasonal naive baseline. For each model, we also report the aggregated relative score which is the geometric mean of the relative improvement over seasonal naive across datasets (see Sec. 5.4 of [1] for details).

Results

The CRPS, MASE, sMAPE, and inference time (in seconds) for each model across 28 datasets have been tabulated below. The best and second best results have been highlighted in bold and underlined. Note that the use of sMAPE is discouraged by forecasting experts and we only report it here for completeness and parity with the previous benchmark.

full_benchmark_results

Notes

  • The original study by Nixtla used batch_size=8 for all Chronos models. However, on the g5.2xlarge instance used in the benchmark, we can safely use batch size of 16 for Chronos (large) and batch size of 64 for Chronos (mini).
  • The original Nixtla benchmark re-used compiled Numba code across experiments, while this is not feasible in the current setup because of the distributed compute environment. Therefore, the reported runtime for StatisticalEnsemble is on average ~45 seconds higher than in the original benchmark. This does not affect the overall conclusions and the runtime ranking of StatisticalEnsemble and Chronos models.
  • Due to differences in task definitions and metric implementations, the numbers in the above table are not directly comparable with the results reported in the Chronos paper.

References

[1] Chronos: Learning the Language of Time Series
[2] A Simple Combination of Univariate Models

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@mergenthaler
Copy link
Member

@shchur, any comments on our update?

@shchur
Copy link
Author

shchur commented May 7, 2024

Hi @mergenthaler, can you please clarify which update you are referring to?

@AzulGarza
Copy link
Member

@shchur closing this pr since it depends on shchur#1. feel free to reopen once we have comments on the depending pr.

@AzulGarza AzulGarza closed this May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants