Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using weighted mean estimator for bootstrapped confidence intervals in seaborn plots #3563

Closed
iainAtIon opened this issue Nov 17, 2023 · 13 comments · Fixed by #3586
Closed

Comments

@iainAtIon
Copy link

Hi, I would like to use a weighted mean estimator for calculating confidence intervals on various seaborn plots. In the past I have done this via a 'hack' suggested here which uses complex numbers to encode the data and its weights before passing to a seaborn plotting function.

Unfortunately, as of upgrading to seaborn v0.13.0, this approach no longer works as it seems like the complex numbers are cast to reals at some point in the plotting process (and hence lose part of the data). This had previously worked up until v0.12.2.

I appreciate this was always a bit of a hack, but would either of the following be possible:
a) Add native support for weighted mean estimators to the seaborn plotting functions or,
b) Restore this hacky behaviour for now in a future release

I have tried alternatives such as storing the data and its weights in tuples or dataclasses, however neither of these approaches work as the data types are not numeric.

Language and package versions:

  • Python v3.11.5
  • numpy v1.26.2
  • matplotlib v3.8.1
  • pandas v2.1.3

Example code:

import pandas as pd
import seaborn as sns
import numpy as np

randomGenerator = np.random.default_rng(123)

values = randomGenerator.normal(10, 5, size=(100,))
weights = randomGenerator.uniform(size=(100,))

dataFrame = pd.DataFrame({'values': values, 'weights': weights})
dataFrame['valuesWithWeights'] = dataFrame['values'] + 1j * dataFrame['weights']

def WeightedMean(valuesWithWeights, **kwargs):
    values, weights = np.real(valuesWithWeights), np.imag(valuesWithWeights)
    weightedSum = np.sum((weights * values)) / np.sum(weights)
    return weightedSum

sns.barplot(data=dataFrame, y='valuesWithWeights', estimator=WeightedMean)

Output using seaborn v0.12.2

image

Output using seaborn v0.13.0

[c:\Temp\seaborn_test\seaborn-venv\Lib\site-packages\matplotlib\cbook.py:1699](file:///C:/Temp/seaborn_test/seaborn-venv/Lib/site-packages/matplotlib/cbook.py:1699): ComplexWarning: Casting complex values to real discards the imaginary part
  return math.isfinite(val)
[c:\Temp\seaborn_test\seaborn-venv\Lib\site-packages\pandas\core\dtypes\astype.py:134](file:///C:/Temp/seaborn_test/seaborn-venv/Lib/site-packages/pandas/core/dtypes/astype.py:134): ComplexWarning: Casting complex values to real discards the imaginary part
  return arr.astype(dtype, copy=True)
[C:\Users\idunn\AppData\Local\Temp\ipykernel_40880\4206068624.py:3](file:///C:/Users/idunn/AppData/Local/Temp/ipykernel_40880/4206068624.py:3): RuntimeWarning: invalid value encountered in scalar divide
  weightedSum = np.sum((weights * values)) / np.sum(weights)
[c:\Temp\seaborn_test\seaborn-venv\Lib\site-packages\numpy\lib\nanfunctions.py:1384](file:///C:/Temp/seaborn_test/seaborn-venv/Lib/site-packages/numpy/lib/nanfunctions.py:1384): RuntimeWarning: All-NaN slice encountered
  return _nanquantile_unchecked(

image

@mwaskom
Copy link
Owner

mwaskom commented Nov 19, 2023

Hm, I do remember that complex dtype trick — very clever but still ultimately a hack. I don't know exactly what broke it in v0.13 (the relevant code was more or less completely rewritten) and it's pretty unlikely that it's going to come back as a supported use case.

That said, seaborn has support for weights in a few other places (i.e. the distribution plots) and it probably makes some sense to have them in the categorical plots too. In fact, with the v0.13 rewrite, it'll be a lot easier to add. But there are still a few challenges:

  • In the interest of API consistency, we'd want to add it in as many functions as make sense. E.g. if it's gonna be in barplot, it definitely needs to be in pointplot too, and then probably lineplot. And then since there's support in kdeplot already we'd want to support them in violinplot. What about boxplot and boxenplot? I'm not totally sure it makes sense in that context but maybe it does. Since the boxplot stat computation delegates to matplotlib, that might be an issue.
  • Because barplot and pointplot accept an arbitrary estimator, the API is trickier than in the distribution plots. If you assign a weights variable, do we assume that you've also passed an estimator function with a weights parameter? And if you don't change the estimator, should the presence of weights imply np.average rather than mean? If not, should it silently ignore weights or raise?
  • In the objects interface, there's a stronger rule that all mapping parameters are singular, but there is precedent for weights as a plural in a few functions. That's unfortunate, but just noting that we'd have to go one way or the other and neither is ideal.

FWIW while there isn't currently a weighted-average stat in the objects interface I'd be on board with adding one, and doing so would skirt most of these questions.

@mwaskom
Copy link
Owner

mwaskom commented Nov 19, 2023

I guess also the other part of what makes this challenging in the function interface is that the bootstrap logic would need to work a little bit differently to bootstrap the observations and weights together.

@iainAtIon
Copy link
Author

Hey Michael, thanks for the reply. Yeah I figured that restoring this behaviour would be a long shot - only asked in case there was a simple one/two line fix which would do the job. But I guess if this area of the code base has be rewritten then that will be a non-starter.

Agreed that this would be useful on other functions beside barplot (I actually use it with lineplot most often, I just used barplot for the example as I figured it would be clearer). I can see the use-case for all the plots you mentioned but appreciate there might be implementation details which make some more difficult than others. In terms of behaviour with/without weights, perhaps it would make sense to make all estimators weighted versions by default and assume that if no weights are passed by the user then this equates to using equal weights for each observation. Then estimators passed would need a signature something like estimator(data, **kwargs) where one of **kwargs could be weights=. I don't know the full list of estimators which are already supported by seaborn and how feasible it would be to convert them to weighted versions, although the np.mean/average example you gave above seems like it would be relatively straightforward.

When you say a weighted stat, do you mean something like statsmodels DescrStatsW?

Anyway, I see this is not a straightforward feature so wouldn't expect something quick. However, I'd be more than happy to continue discussing potential implementations if/when this is added.

@mwaskom
Copy link
Owner

mwaskom commented Nov 20, 2023

I don't know the full list of estimators which are already supported by seaborn and how feasible it would be to convert them to weighted versions

Well that's the thing: there's no "list of supported estimators": from a a seaborn perspective the estimator just needs to be a function that takes a vector and returns a scalar (or the name of a method on a pandas series that operates that way). There are no other operational constraints. Which is why adding hard-to-explain nuances like "if you pass weights, then estimator must be a callable with a weights parameter" adds an API complexity cost.

@iainAtIon
Copy link
Author

iainAtIon commented Nov 21, 2023

OK I understand the complexity this adds to the current API. And it possible that weights would not be the only type of additional parameter one might want to pass to an estimator.

Then perhaps an alternative is to allow the user to pass an arbitrary data type to the plotters (but only as a single vector), and it is their responsibility to ensure that the estimator that is passed is compatible with the data type they've supplied. Similar to the original complex number system above, but perhaps slightly more formal. Perhaps this is what you meant in your original reply?

FWIW while there isn't currently a weighted-average stat in the objects interface I'd be on board with adding one, and doing so would skirt most of these questions.

This would allow something like this This currently throws as below, but perhaps it's simple to relax the restriction the data type must be numeric?

import pandas as pd
import seaborn as sns
import numpy as np
from dataclasses import dataclass

@dataclass(frozen=True)
class ValueWithWeight:
    value : float
    weight : float

def WeightedMean(valuesWithWeights, **kwargs):
    values = np.array([x.value for x in valuesWithWeights])
    weights = np.array([x.weight for x in valuesWithWeights])
    return (values * weights).sum() / weights.sum()

randomGenerator = np.random.default_rng(123)

values = randomGenerator.normal(10, 5, size=(100,))
weights = randomGenerator.uniform(size=(100,))

dataFrame = pd.DataFrame({'values': values, 'weights': weights})
dataFrame['valuesWithWeights'] = [ValueWithWeight(value, weight) for value, weight in zip(dataFrame['values'], dataFrame['weights'])]

sns.barplot(data=dataFrame, y='valuesWithWeights', estimator=WeightedMean)
TypeError                                 Traceback (most recent call last)
File lib.pyx:2368, in pandas._libs.lib.maybe_convert_numeric()

TypeError: Invalid object type

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Untitled-1.ipynb Cell 5 line 2
     [22](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#X10sdW50aXRsZWQ%3D?line=21) dataFrame['valuesWithWeights'] = [ValueWithWeight(value, weight) for value, weight in zip(dataFrame['values'], dataFrame['weights'])]
     [24](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#X10sdW50aXRsZWQ%3D?line=23) WeightedMean(dataFrame['valuesWithWeights'])
---> [26](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#X10sdW50aXRsZWQ%3D?line=25) sns.barplot(data=dataFrame, y='valuesWithWeights', estimator=WeightedMean)

File [c:\Temp\seaborn_test\seaborn-venv\Lib\site-packages\seaborn\categorical.py:2364](file:///C:/Temp/seaborn_test/seaborn-venv/Lib/site-packages/seaborn/categorical.py:2364), in barplot(data, x, y, hue, order, hue_order, estimator, errorbar, n_boot, units, seed, orient, color, palette, saturation, fill, hue_norm, width, dodge, gap, log_scale, native_scale, formatter, legend, capsize, err_kws, ci, errcolor, errwidth, ax, **kwargs)
   2361 # Deprecations to remove in v0.15.0.
   2362 err_kws, capsize = p._err_kws_backcompat(err_kws, errcolor, errwidth, capsize)
-> 2364 p.plot_bars(
   2365     aggregator=aggregator,
   2366     dodge=dodge,
   2367     width=width,
   2368     gap=gap,
   2369     color=color,
   2370     fill=fill,
   2371     capsize=capsize,
   2372     err_kws=err_kws,
   2373     plot_kws=kwargs,
   2374 )
   2376 p._add_axis_labels(ax)
   2377 p._adjust_cat_axis(ax, axis=p.orient)

File [c:\Temp\seaborn_test\seaborn-venv\Lib\site-packages\seaborn\categorical.py:1264](file:///C:/Temp/seaborn_test/seaborn-venv/Lib/site-packages/seaborn/categorical.py:1264), in _CategoricalPlotter.plot_bars(self, aggregator, dodge, gap, width, fill, color, capsize, err_kws, plot_kws)
   1260     plot_kws.setdefault("linewidth", 1.5 * mpl.rcParams["lines.linewidth"])
   1262 err_kws.setdefault("linewidth", 1.5 * mpl.rcParams["lines.linewidth"])
-> 1264 for sub_vars, sub_data in self.iter_data(iter_vars,
   1265                                          from_comp_data=True,
   1266                                          allow_empty=True):
   1268     ax = self._get_axes(sub_vars)
   1270     agg_data = sub_data if sub_data.empty else (
   1271         sub_data
   1272         .groupby(self.orient)
   1273         .apply(aggregator, agg_var)
   1274         .reset_index()
   1275     )

File [c:\Temp\seaborn_test\seaborn-venv\Lib\site-packages\seaborn\_base.py:902](file:///C:/Temp/seaborn_test/seaborn-venv/Lib/site-packages/seaborn/_base.py:902), in VectorPlotter.iter_data(self, grouping_vars, reverse, from_comp_data, by_facet, allow_empty, dropna)
    899 grouping_vars = [var for var in grouping_vars if var in self.variables]
    901 if from_comp_data:
--> 902     data = self.comp_data
    903 else:
    904     data = self.plot_data

File [c:\Temp\seaborn_test\seaborn-venv\Lib\site-packages\seaborn\_base.py:999](file:///C:/Temp/seaborn_test/seaborn-venv/Lib/site-packages/seaborn/_base.py:999), in VectorPlotter.comp_data(self)
    994 if var in self.var_levels:
    995     # TODO this should happen in some centralized location
    996     # it is similar to GH2419, but more complicated because
    997     # supporting `order` in categorical plots is tricky
    998     orig = orig[orig.isin(self.var_levels[var])]
--> 999 comp = pd.to_numeric(converter.convert_units(orig)).astype(float)
   1000 transform = converter.get_transform().transform
   1001 parts.append(pd.Series(transform(comp), orig.index, name=orig.name))

File [c:\Temp\seaborn_test\seaborn-venv\Lib\site-packages\pandas\core\tools\numeric.py:222](file:///C:/Temp/seaborn_test/seaborn-venv/Lib/site-packages/pandas/core/tools/numeric.py:222), in to_numeric(arg, errors, downcast, dtype_backend)
    220 coerce_numeric = errors not in ("ignore", "raise")
    221 try:
--> 222     values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload]  # noqa: E501
    223         values,
    224         set(),
    225         coerce_numeric=coerce_numeric,
    226         convert_to_masked_nullable=dtype_backend is not lib.no_default
    227         or isinstance(values_dtype, StringDtype),
    228     )
    229 except (ValueError, TypeError):
    230     if errors == "raise":

File lib.pyx:2410, in pandas._libs.lib.maybe_convert_numeric()

TypeError: Invalid object type at position 0

@mwaskom
Copy link
Owner

mwaskom commented Dec 4, 2023

See #3580 for what this looks like in the objects interface.

@iainAtIon
Copy link
Author

Hi Mark thanks for working on this! Much appreciated. Is the plan for this to be included in the next seaborn release?

Also, as a more general question about seaborn - are you gradually moving towards developing new features for the objects interface only? Or will support for the function interface continue?

@mwaskom
Copy link
Owner

mwaskom commented Dec 5, 2023

Is the plan for this to be included in the next seaborn release?

Seaborn doesn't have an explicit roadmap or release schedule, but once the PR is merged it would be in master and would be part of the next release (0.13.1).

Are you gradually moving towards developing new features for the objects interface only? Or will support for the function interface continue?

I'd say these are two different things. "Support" for the function interface in the form of bug fixes and core functionality will continue. And even in terms of major feature development, you'll note that the 0.13.0 release was focused on the function interface, with lots of new features. But one motivation for the development of the objects interface is that the function interface's design can impose API-usability cost to new features, and also it can be annoying to need to add some functionality in lots of places, as I've sort of elaborated above.

So there will be some kinds of features where we could add them to the function interface, but it would be a lot easier and cleaner to just do it in the objects interface. I'm not totally sure which class this issue falls into though; I think that basic support for a weighted mean + CI would be pretty straightforward (since the computational code is already shared by the two interfaces) and then it's just a matter of coming to terms with the additional complexity where weights would work only with specific choices of estimator and errorbar. Since it would work with the default values, I think that's probably ok.

@iainAtIon
Copy link
Author

Thanks Michael, understood

@mwaskom
Copy link
Owner

mwaskom commented Dec 5, 2023

Would weighted mean+ci be sufficient for you? Do you have a use case for other estimators / errorbars?

@iainAtIon
Copy link
Author

Personally, I just use the weighted mean at the moment. So that would be sufficient. In terms of plotting, I tend to use a mixture of sns.lineplot and sns.barplot for visualising these.

@mwaskom
Copy link
Owner

mwaskom commented Dec 6, 2023

Please take a look at #3586 and see if that would work for you.

@iainAtIon
Copy link
Author

That looks great! I've pulled the branch and tested locally with a couple of examples and it works as I would expect. Thanks a lot for implementing this - really appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants