Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential problem of the CSI score #1

Open
sxjscience opened this issue Jul 27, 2021 · 3 comments
Open

Potential problem of the CSI score #1

sxjscience opened this issue Jul 27, 2021 · 3 comments

Comments

@sxjscience
Copy link

Dear authors of the SEVIR benchmark, we are recently running the SEVIR benchmark but noticed that there's a potential problem of the current implementation of the CSI score:

return (hits+1e-6)/(hits+misses+fas+1e-6)

When the threshold is large, it is possible that the hits, misses and fas of the model are all zero. The current formulation will produce csi=1.0 in such case (while it should give csi=0.0). To avoid this problem, a better formula might be

return hits / (hits+misses+fas+1e-6)

CC @gaozhihan also

@gaozhihan
Copy link

The same problem also appears in the implementation of POD, SURR and BIAS. We found that the CSI score is the most vulnerable one, which can be quite misleading, especially when batch_size is small.

@markveillette
Copy link
Collaborator

Thanks for the feedback -- I remember struggling with this choice. In general these metrics are unreliable when using small batches, so I recommended that they only be used for evaluating over your entire testing set. If running in batches is necessary, one can also tally hits, misses and false alarm in batches, and combine them in the end.

That being said, I agree there is probably a better way to handle cases where 0/0 is possible. The method of Laplace Smoothing (https://en.wikipedia.org/wiki/Additive_smoothing) helps solves this problem by assuming a Dirichelt prior over the probabilities of the three categories (hits, miss, false alarm). Using this, another possibility would be

return (hits+alpha) / (hits+misses+fas+3*alpha)

which would be more of a compromise between your suggestion, and the way it is now. Here alpha is a parameter (which we can keep at 1e-6 or increase as necessary).

@gaozhihan
Copy link

gaozhihan commented Aug 24, 2021

Thanks for your reply. It's quite helpful.
BTW I want to ask another question. As you mentioned, calculating the mean of skill scores is different from calculating skill scores using accumulated hits, misses and fas. In your implementation, I find that you report the scores of each forecasting lead:
https://github.com/MIT-AI-Accelerator/neurips-2020-sevir/blob/master/test_nowcast.py#L142-L168
My question is, how did you get the final scores? Did you directly calculate the mean over 12 leads, or accumulate all hits, misses and fas first and then get the scores? Which one is the better approach to measure the performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants