Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH union_categoricals supports ignore_order GH13410 #15219

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -693,6 +693,17 @@ The below raises ``TypeError`` because the categories are ordered and not identi
Out[3]:
TypeError: to union ordered Categoricals, all categories must be the same

.. versionadded:: 0.20.0

Ordered categoricals with different categories or orderings can be combined by
using the ``ignore_ordered=True`` argument.

.. ipython:: python

a = pd.Categorical(["a", "b", "c"], ordered=True)
b = pd.Categorical(["c", "b", "a"], ordered=True)
union_categoricals([a, b], ignore_order=True)

``union_categoricals`` also works with a ``CategoricalIndex``, or ``Series`` containing
categorical data, but note that the resulting array will always be a plain ``Categorical``

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ Other enhancements
- HTML table output skips ``colspan`` or ``rowspan`` attribute if equal to 1. (:issue:`15403`)

.. _ISO 8601 duration: https://en.wikipedia.org/wiki/ISO_8601#Durations
- ``ignore_ordered`` argument added to ``pd.types.concat.union_categoricals``; setting the argument to true will ignore the ordered attribute of unioned categoricals (:issue:`13410`) . See the :ref:`categorical union docs <categorical.union>` for more information.

.. _whatsnew_0200.api_breaking:

Expand Down
54 changes: 54 additions & 0 deletions pandas/tests/tools/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -1662,6 +1662,60 @@ def test_union_categoricals_ordered(self):
with tm.assertRaisesRegexp(TypeError, msg):
union_categoricals([c1, c2])

def test_union_categoricals_ignore_order(self):
# GH 15219
c1 = Categorical([1, 2, 3], ordered=True)
c2 = Categorical([1, 2, 3], ordered=False)

res = union_categoricals([c1, c2], ignore_order=True)
exp = Categorical([1, 2, 3, 1, 2, 3])
tm.assert_categorical_equal(res, exp)

msg = 'Categorical.ordered must be the same'
with tm.assertRaisesRegexp(TypeError, msg):
union_categoricals([c1, c2], ignore_order=False)

res = union_categoricals([c1, c1], ignore_order=True)
exp = Categorical([1, 2, 3, 1, 2, 3])
tm.assert_categorical_equal(res, exp)

res = union_categoricals([c1, c1], ignore_order=False)
exp = Categorical([1, 2, 3, 1, 2, 3],
categories=[1, 2, 3], ordered=True)
tm.assert_categorical_equal(res, exp)

c1 = Categorical([1, 2, 3, np.nan], ordered=True)
c2 = Categorical([3, 2], categories=[1, 2, 3], ordered=True)

res = union_categoricals([c1, c2], ignore_order=True)
exp = Categorical([1, 2, 3, np.nan, 3, 2])
tm.assert_categorical_equal(res, exp)

c1 = Categorical([1, 2, 3], ordered=True)
c2 = Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True)

res = union_categoricals([c1, c2], ignore_order=True)
exp = Categorical([1, 2, 3, 1, 2, 3])
tm.assert_categorical_equal(res, exp)

res = union_categoricals([c2, c1], ignore_order=True,
sort_categories=True)
exp = Categorical([1, 2, 3, 1, 2, 3], categories=[1, 2, 3])
tm.assert_categorical_equal(res, exp)

c1 = Categorical([1, 2, 3], ordered=True)
c2 = Categorical([4, 5, 6], ordered=True)
result = union_categoricals([c1, c2], ignore_order=True)
expected = Categorical([1, 2, 3, 4, 5, 6])
tm.assert_categorical_equal(result, expected)

msg = "to union ordered Categoricals, all categories must be the same"
with tm.assertRaisesRegexp(TypeError, msg):
union_categoricals([c1, c2], ignore_order=False)

with tm.assertRaisesRegexp(TypeError, msg):
union_categoricals([c1, c2])

def test_union_categoricals_sort(self):
# GH 13846
c1 = Categorical(['x', 'y', 'z'])
Expand Down
16 changes: 12 additions & 4 deletions pandas/types/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ def _concat_asobject(to_concat):
return _concat_asobject(to_concat)


def union_categoricals(to_union, sort_categories=False):
def union_categoricals(to_union, sort_categories=False, ignore_order=False):
"""
Combine list-like of Categorical-like, unioning categories. All
categories must have the same dtype.
Expand All @@ -222,6 +222,11 @@ def union_categoricals(to_union, sort_categories=False):
sort_categories : boolean, default False
If true, resulting categories will be lexsorted, otherwise
they will be ordered as they appear in the data.
ignore_order: boolean, default False
If true, the ordered attribute of the Categoricals will be ignored.
Results in an unordered categorical.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded 0.20.0 tag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the versionadded tag


.. versionadded:: 0.20.0

Returns
-------
Expand All @@ -235,7 +240,7 @@ def union_categoricals(to_union, sort_categories=False):
- all inputs are ordered and their categories are not identical
- sort_categories=True and Categoricals are ordered
ValueError
Emmpty list of categoricals passed
Empty list of categoricals passed
"""
from pandas import Index, Categorical, CategoricalIndex, Series

Expand Down Expand Up @@ -264,15 +269,15 @@ def _maybe_unwrap(x):
ordered = first.ordered
new_codes = np.concatenate([c.codes for c in to_union])

if sort_categories and ordered:
if sort_categories and not ignore_order and ordered:
raise TypeError("Cannot use sort_categories=True with "
"ordered Categoricals")

if sort_categories and not categories.is_monotonic_increasing:
categories = categories.sort_values()
indexer = categories.get_indexer(first.categories)
new_codes = take_1d(indexer, new_codes, fill_value=-1)
elif all(not c.ordered for c in to_union):
elif ignore_order or all(not c.ordered for c in to_union):
# different categories - union and recode
cats = first.categories.append([c.categories for c in to_union[1:]])
categories = Index(cats.unique())
Expand All @@ -297,6 +302,9 @@ def _maybe_unwrap(x):
else:
raise TypeError('Categorical.ordered must be the same')

if ignore_order:
ordered = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ordered is already False? (line 263) Is this still needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement on line 264 can be entered if the ordered categoricals have the same categories and order.

is_dtype_equal checks categories and ordering


return Categorical(new_codes, categories=categories, ordered=ordered,
fastpath=True)

Expand Down