Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should multi-model statistics work for a single cube? #1211

Closed
zklaus opened this issue Jul 5, 2021 · 17 comments · Fixed by #1849
Closed

How should multi-model statistics work for a single cube? #1211

zklaus opened this issue Jul 5, 2021 · 17 comments · Fixed by #1849
Milestone

Comments

@zklaus
Copy link

zklaus commented Jul 5, 2021

The new multi-model statistics code refuses to calculate statistics on a single model.
That is one approach to things, but for most statistics it's perfectly possible to compute them on a single cube as well, albeit perhaps not super useful.

To me, it seems the two alternatives here are

  • Refuse computation on a single cube
  • Compute according to the following table that lists all supported statistics
Statistic Result
mean input
median input
max input
min input
std 0 (zero)
pXX.YY ???

What do you prefer, @ESMValGroup/esmvaltool-developmentteam ?

@zklaus zklaus added this to the v2.3.1 milestone Jul 5, 2021
@stefsmeets
Copy link
Contributor

stefsmeets commented Jul 5, 2021

The result for pXX.YY should also be the input.

@bouweandela
Copy link
Member

I would prefer a refusal to compute, because trying to compute a multimodel statistic on a single model probably means that there is a mistake in the recipe.

@zklaus
Copy link
Author

zklaus commented Jul 6, 2021

I also think refusing to compute makes sense. @remi-kazeroni, @katjaweigel I think you guys stumbled over this recently. What do you think?

@zklaus
Copy link
Author

zklaus commented Jul 7, 2021

There seem to be no opposing views, so we'll keep the current behavior.

@schlunma
Copy link
Contributor

I'd like to re-open this discussion. I think multi-model statistics should work for single cubes. My main arguments are:

  • This behavior is super counterintuitive since basically every other serious python library allows calculating statistics over a single element, e.g., np.mean([1.0]) = 1.0, np.std([1.0]) = 0.0, da.arange(1.0).mean().compute() = 0.0, or pd.DataFrame({'a': [1]}).mean() = a 1.0.
  • With the introduction of ensemble_statistics and the fact that many models only provide one ensemble member you can quickly run into the Cannot perform multicube statistics for a single cube. issue when using this preprocessor. This can be resolved by using multiple preprocessors, but this unreasonably complicates the recipe and is really frustrating.

@schlunma schlunma reopened this Nov 18, 2022
@schlunma
Copy link
Contributor

@ESMValGroup/esmvaltool-coreteam any opinions on that?

@axel-lauer
Copy link
Contributor

I agree with that. From a user perspective, it would be great if the multi-model statistics could handle single elements.

@schlunma
Copy link
Contributor

I just realized that I already opened another issue for that back in February since I didn't search for closed issues 😄

#1469

I will close that one and continue the discussion here.

@bouweandela
Copy link
Member

bouweandela commented Nov 22, 2022

With the introduction of ensemble_statistics and the fact that many models only provide one ensemble member you can quickly run into the Cannot perform multicube statistics for a single cube. issue when using this preprocessor.

That sounds like a good reason to allow it for ensemble_statistics.

@schlunma
Copy link
Contributor

Yes, but since ensemble_statistics basically only calls multi_model_statistics internally, it would be much easier to allow it for multi_model_statistics and get ensemble_statistics for free.

def ensemble_statistics(products, statistics,
output_products, span='overlap'):
"""Entry point for ensemble statistics.
An ensemble grouping is performed on the input products.
The statistics are then computed calling
the :func:`esmvalcore.preprocessor.multi_model_statistics` module,
taking the grouped products as an input.
Parameters
----------
products: list
Cubes (or products) over which the statistics will be computed.
statistics: list
Statistical metrics to be computed, e.g. [``mean``, ``max``]. Choose
from the operators listed in the iris.analysis package. Percentiles can
be specified like ``pXX.YY``.
output_products: dict
For internal use only. A dict with statistics names as keys and
preprocessorfiles as values. If products are passed as input, the
statistics cubes will be assigned to these output products.
span: str (default: 'overlap')
Overlap or full; if overlap, statitstics are computed on common time-
span; if full, statistics are computed on full time spans, ignoring
missing data.
Returns
-------
set
A set of output_products with the resulting ensemble statistics.
See Also
--------
:func:`esmvalcore.preprocessor.multi_model_statistics` for
the full description of the core statistics function.
"""
ensemble_grouping = ('project', 'dataset', 'exp', 'sub_experiment')
return multi_model_statistics(
products=products,
span=span,
statistics=statistics,
output_products=output_products,
groupby=ensemble_grouping,
keep_input_datasets=False
)

@Peter9192
Copy link
Contributor

Would it make sense to log a warning when multimodel stats is called on a single cube? Then we make an effort to alert users and have them think twice, without completely blocking their flow.

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Dec 7, 2022

ah but how did I manage to miss this exciting discussion 😁 Thanks for re-opening the discussion, Manu! I'd go a bit more extremo than what @Peter9192 suggests and ask for a check if ensemble_statistic is called then fine, single-cube stat can be performed, if not don't do it - it's statistically irrelevant, and may lead to wrongly accepted results as multi- model when in fact, they are single-model results 👍

@dhohn
Copy link
Contributor

dhohn commented Dec 8, 2022

Is all that babysitting necessary? A set with one member is well defined as are the summary statistics for it. IMHO there should at most be a warning that a trivial operation is conducted if that at all and be done with it.

@schlunma
Copy link
Contributor

schlunma commented Dec 8, 2022

I kinda disagree here. From a mathematical point of view, most of these operations are well-defined for single cubes. For example, the arithmetic mean

$$ \bar{x} = \frac{1}{N} \sum_{n=1}^{N} x_n $$

makes perfectly sense for $N=1$. Why should we not allow it? As I mentioned above, basically all other Python packages allow that.

@schlunma
Copy link
Contributor

schlunma commented Dec 8, 2022

Is all that babysitting necessary? A set with one member is well defined as are the summary statistics for it. IMHO there should at most be a warning that a trivial operation is conducted if that at all and be done with it.

Yes, fully agree here!

@valeriupredoi
Copy link
Contributor

@schlunma @dhohn et al - my apologies, I forgot to mention here that I had an inner monologue and I too agree we should be fine with N = 1 without any bells and whistles/red flags etc - am gonna go review Manu's PR so we can have it in 2.8 release 👍

@schlunma
Copy link
Contributor

Perfect, thanks V!! 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants