Add MultiIndex._data and MultiIndex.array #27138
Labels
Closing Candidate
May be closeable, needs more eyeballs
Enhancement
ExtensionArray
Extending pandas with custom dtypes or arrays.
MultiIndex
Needs Discussion
Requires discussion from core team before further action
I propose adding a
MultiIndex._data
that is of typeList[Categorical]
, where all the underlying data of a MultiIndex would be stored. AmultiIndex.array
property would also be added, that accesses the_data
.This has the advantage of collecting the data that is underlying MultiIndex into one data structure, that is human readable, and also makes access to zero-copy data very easy, e.g. would
mi.array[1]
return the data of the second level as aCategorical
, in a easy-to-read form.A
MultiIndex
could with the above changes be explained as just "a container over a list of Categoricals", which is easier to explain than the current mode. TheMultiIndex
could also be related toCategoricalIndex
, which is "a container over a single Categorical".This change means that
MultiIndex.levels
will become a property that returns aFrozenList(cat.categories for cat in self._data)
, andMultiIndex.codes
will be a property that returnsFrozenList(cat.codes for cat in self._data)
.MultiIndex.array
will be added and will simply be a property that returns a FrozenList ofself._data
.Performance will not be affected, as most operations would still go through
MultiIndex.codes
andMultiIndex.levels
.Moving names from MultiIndex.levels to MultiIndex._names
Currently the levels' names are stored at each level's
name
attribute. This is not very compatible with extracting the categories from_data
. (the.categories
is actually part of the dtype, which ideally should be immutable, so we shouldn't set or change its name attribute).To make my suggestion practically possible, the level names should be stored in
MultiIndex._names
instead, andMultiIndex.names
will become a property that reads from/writes toMultiIndex._names
. I think this change simplifies the MultiIndex a bit, as data and names are dealt with separately. This is a small backward breaking change though.So, I suggest making two PRs:
_data
,array
and changelevels
andcodes
into properties.The text was updated successfully, but these errors were encountered: