Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Chronos inference in foundation ts arena #382

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

abdulfatir
Copy link

@abdulfatir abdulfatir commented Jun 3, 2024

Thank you for evaluating Chronos again. It's great to see it performing accurately on this benchmark as well.

We found some problems with the way inference is being done for Chronos:

This PR fixes these issues. The following table shows a comparison of Chronos (Large)'s performance before (taken from the original table in this repo) and after these fixes, and also reports the performance of other variants of Chronos. These experiments were performed on a g5.4xlarge instance, as in the original study.

Accuracy Inference Time
Monthly Weekly Daily Hourly Monthly Weekly Daily Hourly
Chronos-Large (Before) 0.960 0.709 0.652 0.735 38.581 5.081 7.908 11.662
Chronos-Large 0.950 0.704 0.652 0.654 5.402 5.054 7.882 11.500
Chronos-Base 0.966 0.709 0.663 0.646 1.966 1.712 2.940 4.714
Chronos-Small 0.982 0.724 0.669 0.671 0.689 0.550 0.986 1.818
Chronos-Mini 0.968 0.736 0.682 0.729 0.476 0.356 0.688 1.371
Chronos-Tiny 0.976 0.765 0.686 0.799 0.316 0.212 0.427 0.965

We observe:

  • improvements in the MASE for Monthly (~1%) and Hourly (~11%) datasets.
  • a significant improvement (~38mins to ~5mins) in the inference time for the Monthly subset which has many very short time series.
  • smaller Chronos models provide a quality-speed trade-off with the Base model performing almost as well as Large while being much faster, and even the mini model performing better than most baselines in the original study.

Here's how the average MASE ranking plots look like before and after the fix:
image

After the fix, Chronos-Large achieves the best overall rank (center plot). Chronos-Base obtains the same overall ranking as TimesFM and TimeGPT (right plot).

For the fidelity of the study, we recommend that the authors update their results and discussions accordingly, ideally after an independent verification with the latest code change (see usage below). Thank you again for your effort!

Usage

  • Download data and setup environment as described here.
  • Run python eval-chronos.py to re-evaluate (only) Chronos.

@CLAassistant
Copy link

CLAassistant commented Jun 3, 2024

CLA assistant check
All committers have signed the CLA.

@abdulfatir
Copy link
Author

@AzulGarza @cchallu @mergenthaler Did you get a chance to take a look at this? I hope the main results in the repo can be updated soon so people do not get an inaccurate impression.

@AzulGarza
Copy link
Member

hey @abdulfatir! thank you. could you please sign the CLA?

@abdulfatir
Copy link
Author

@AzulGarza thanks for your reply. Signed.

@abdulfatir
Copy link
Author

@AzulGarza @mergenthaler @cchallu Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants