Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop concat from attempting to sort mismatched columns by default #20613

Merged
merged 38 commits into from
May 1, 2018
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
913723b
Stop concat from attempting to sort mismatched columns by default
brycepg Apr 5, 2018
5da763f
Merge remote-tracking branch 'upstream/master' into brycepg-master
TomAugspurger Apr 26, 2018
02b2db9
Updates
TomAugspurger Apr 26, 2018
a497763
Test fallout
TomAugspurger Apr 26, 2018
954a1b6
Updated append
TomAugspurger Apr 26, 2018
2a20377
versionadded
TomAugspurger Apr 26, 2018
35570c4
Squash more test warnings
TomAugspurger Apr 27, 2018
983d0c1
py2 compat
TomAugspurger Apr 27, 2018
27d2d32
Merge remote-tracking branch 'upstream/master' into brycepg-master
TomAugspurger Apr 27, 2018
4960e3f
Document outer is not affected
TomAugspurger Apr 27, 2018
8bbbdd5
Docs
TomAugspurger Apr 27, 2018
dcfa6d0
Sort for intersection
TomAugspurger Apr 28, 2018
2eaeb1e
More tests
TomAugspurger Apr 28, 2018
bc7dd48
Test fixup.
TomAugspurger Apr 29, 2018
b3f95dd
Stop concat from attempting to sort mismatched columns by default
brycepg Apr 5, 2018
f37d7ef
Updates
TomAugspurger Apr 26, 2018
e467f91
Test fallout
TomAugspurger Apr 26, 2018
058fae5
Updated append
TomAugspurger Apr 26, 2018
04e5151
versionadded
TomAugspurger Apr 26, 2018
c864679
Squash more test warnings
TomAugspurger Apr 27, 2018
7e975c9
py2 compat
TomAugspurger Apr 27, 2018
a8ba430
Document outer is not affected
TomAugspurger Apr 27, 2018
62b1e7b
Docs
TomAugspurger Apr 27, 2018
0ace673
Sort for intersection
TomAugspurger Apr 28, 2018
d5cafdf
More tests
TomAugspurger Apr 28, 2018
ce8ff05
Test fixup.
TomAugspurger Apr 29, 2018
ce756d4
ugh
TomAugspurger Apr 29, 2018
362e84d
quoting
TomAugspurger Apr 30, 2018
0210d33
Clarify
TomAugspurger Apr 30, 2018
06772b4
Removed unnescesary check
TomAugspurger Apr 30, 2018
95cdf67
Merge remote-tracking branch 'upstream/master' into brycepg-master
TomAugspurger Apr 30, 2018
d10f5bd
Merge remote-tracking branch 'brycepg/master' into brycepg-master
TomAugspurger Apr 30, 2018
e47cbb9
Prune tests
TomAugspurger Apr 30, 2018
0182c98
Default sort
TomAugspurger Apr 30, 2018
7e58998
Make both tests happy
TomAugspurger Apr 30, 2018
5b58e75
Explicit columns
TomAugspurger Apr 30, 2018
074d03c
List of series
TomAugspurger Apr 30, 2018
5e1b024
test, fix pivot
TomAugspurger May 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 16 additions & 8 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -153,10 +153,10 @@ Set logic on the other axes
~~~~~~~~~~~~~~~~~~~~~~~~~~~

When gluing together multiple DataFrames, you have a choice of how to handle
the other axes (other than the one being concatenated). This can be done in
the other axes (other than the one being concatenated). This can be done in
the following three ways:

- Take the (sorted) union of them all, ``join='outer'``. This is the default
- Take the union of them all, ``join='outer'``. This is the default
option as it results in zero information loss.
- Take the intersection, ``join='inner'``.
- Use a specific index, as passed to the ``join_axes`` argument.
Expand All @@ -167,10 +167,10 @@ behavior:
.. ipython:: python

df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
result = pd.concat([df1, df4], axis=1)
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
result = pd.concat([df1, df4], axis=1, sort=False)


.. ipython:: python
Expand All @@ -181,8 +181,16 @@ behavior:
labels=['df1', 'df4'], vertical=False);
plt.close('all');

Note that the row indexes have been unioned and sorted. Here is the same thing
with ``join='inner'``:
.. warning::

.. versionchanged:: 0.23.0

The default behavior with ``join='outer'`` is to sort the other axis
(columns in this case). In a future version of pandas, the default will
be to not sort. We specified ``sort=False`` to opt in to the new
behavior now.

Here is the same thing with ``join='inner'``:

.. ipython:: python

Expand Down
30 changes: 30 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -639,6 +639,36 @@ Returning a ``Series`` allows one to control the exact return structure and colu

df.apply(lambda x: Series([1, 2, 3], index=['D', 'E', 'F']), axis=1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a ref here

.. _whatsnew_0230.api_breaking.concat:

Concatenation will no longer sort
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In a future version of pandas :func:`pandas.concat` will no longer sort the non-concatenation axis when it is not already aligned.
The current behavior is the same as the previous (sorting), but now a warning is issued when ``sort`` is not specified and the non-concatenation axis is not aligned (:issue:`4588`).

.. ipython:: python
:okwarning:

df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2]}, columns=['b', 'a'])
df2 = pd.DataFrame({"a": [4, 5]})

pd.concat([df1, df2])

To keep the previous behavior (sorting) and silence the warning, pass ``sort=True``

.. ipython:: python

pd.concat([df1, df2], sort=True)

To accept the future behavior (no sorting), pass ``sort=False``

.. ipython

pd.concat([df1, df2], sort=False)

Note that this change also applies to :meth:`DataFrame.append`, which has also received a ``sort`` keyword for controlling this behavior.


.. _whatsnew_0230.api_breaking.build_changes:

Expand Down
11 changes: 6 additions & 5 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ def fast_unique_multiple(list arrays):

@cython.wraparound(False)
@cython.boundscheck(False)
def fast_unique_multiple_list(list lists):
def fast_unique_multiple_list(list lists, bint sort=True):
cdef:
list buf
Py_ssize_t k = len(lists)
Expand All @@ -174,10 +174,11 @@ def fast_unique_multiple_list(list lists):
if val not in table:
table[val] = stub
uniques.append(val)
try:
uniques.sort()
except Exception:
pass
if sort:
try:
uniques.sort()
except Exception:
pass

return uniques

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -507,7 +507,7 @@ def is_any_frame():
for r in compat.itervalues(result))

if isinstance(result, list):
return concat(result, keys=keys, axis=1), True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thre are a bunch more concats in this same section. basically they control the resulting ordeing of the aggregation. I guess sort=True is fine here (for the other cases).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe these should be sort=False, not sure what is actually affected (as these might be aligned ops already)

return concat(result, keys=keys, axis=1, sort=True), True

elif is_any_frame():
# we have a dict of DataFrames
Expand Down
16 changes: 13 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6038,7 +6038,8 @@ def infer(x):
# ----------------------------------------------------------------------
# Merging / joining methods

def append(self, other, ignore_index=False, verify_integrity=False):
def append(self, other, ignore_index=False,
verify_integrity=False, sort=None):
"""
Append rows of `other` to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
Expand All @@ -6051,6 +6052,14 @@ def append(self, other, ignore_index=False, verify_integrity=False):
If True, do not use the index labels.
verify_integrity : boolean, default False
If True, raise ValueError on creating index with duplicates.
sort : boolean, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put before verify_integrity

Sort columns if the columns of `self` and `other` are not aligned.
The default sorting is deprecated and will change to not-sorting
in a future version of pandas. Explicitly pass ``sort=True`` to
silence the warning and sort. Explicitly pass ``sort=False`` to
silence the warning and not sort.

.. versionadded:: 0.23.0

Returns
-------
Expand Down Expand Up @@ -6162,7 +6171,8 @@ def append(self, other, ignore_index=False, verify_integrity=False):
else:
to_concat = [self, other]
return concat(to_concat, ignore_index=ignore_index,
verify_integrity=verify_integrity)
verify_integrity=verify_integrity,
sort=sort)

def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
sort=False):
Expand Down Expand Up @@ -7481,7 +7491,7 @@ def _list_of_series_to_arrays(data, columns, coerce_float=False, dtype=None):
from pandas.core.index import _get_objs_combined_axis

if columns is None:
columns = _get_objs_combined_axis(data)
columns = _get_objs_combined_axis(data, sort=False)

indexer_cache = {}

Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1098,7 +1098,8 @@ def reset_identity(values):
group_names = self.grouper.names

result = concat(values, axis=self.axis, keys=group_keys,
levels=group_levels, names=group_names)
levels=group_levels, names=group_names,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, there are like 10 calls to concat. I think should be explicit about sort

sort=False)
else:

# GH5610, returns a MI, with the first level being a
Expand Down
53 changes: 40 additions & 13 deletions pandas/core/indexes/api.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
import textwrap
import warnings

from pandas.core.indexes.base import (Index,
_new_Index,
_ensure_index,
Expand All @@ -17,6 +20,16 @@
from pandas._libs import lib
from pandas._libs.tslib import NaT

_sort_msg = textwrap.dedent("""\
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False
""")


# TODO: there are many places that rely on these private methods existing in
# pandas.core.index
__all__ = ['Index', 'MultiIndex', 'NumericIndex', 'Float64Index', 'Int64Index',
Expand All @@ -31,33 +44,40 @@
'_all_indexes_same']


def _get_objs_combined_axis(objs, intersect=False, axis=0):
def _get_objs_combined_axis(objs, intersect=False, axis=0, sort=True):
# Extract combined index: return intersection or union (depending on the
# value of "intersect") of indexes on given axis, or None if all objects
# lack indexes (e.g. they are numpy arrays)
obs_idxes = [obj._get_axis(axis) for obj in objs
if hasattr(obj, '_get_axis')]
if obs_idxes:
return _get_combined_index(obs_idxes, intersect=intersect)
return _get_combined_index(obs_idxes, intersect=intersect, sort=sort)


def _get_combined_index(indexes, intersect=False):
def _get_combined_index(indexes, intersect=False, sort=False):
# TODO: handle index names!
indexes = com._get_distinct_objs(indexes)
if len(indexes) == 0:
return Index([])
if len(indexes) == 1:
return indexes[0]
if intersect:
index = Index([])
elif len(indexes) == 1:
index = indexes[0]
elif intersect:
index = indexes[0]
for other in indexes[1:]:
index = index.intersection(other)
return index
union = _union_indexes(indexes)
return _ensure_index(union)
else:
index = _union_indexes(indexes, sort=sort)
index = _ensure_index(index)

if sort:
try:
index = index.sort_values()
except TypeError:
pass
return index


def _union_indexes(indexes):
def _union_indexes(indexes, sort=True):
if len(indexes) == 0:
raise AssertionError('Must have at least 1 Index to union')
if len(indexes) == 1:
Expand All @@ -74,7 +94,8 @@ def conv(i):
i = i.tolist()
return i

return Index(lib.fast_unique_multiple_list([conv(i) for i in inds]))
return Index(
lib.fast_unique_multiple_list([conv(i) for i in inds], sort=sort))

if kind == 'special':
result = indexes[0]
Expand All @@ -89,13 +110,19 @@ def conv(i):
index = indexes[0]
for other in indexes[1:]:
if not index.equals(other):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this already passed in as sort=True?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this is deeply buried here, why is this showing a user facing warning? IOW what is the user call that triggers this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW what is the user call that triggers this?

pd.concat or DataFrame.append with unaligned objects and an unspecified sort.

Where else would it go? Before this point we don't know what the indexes to align are, and whether or not they're aligned.

if sort is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move _unique_indices from a nested function to module level (e.g. same as _union_indices (and maybe conform the spelling indices / indexes), and simply add a sort= kwarg (which you are already passing into fast_unique_multiple_lists). Then you can do the warning there. just makes this whole function a bit simpler.

# TODO: remove once pd.concat sort default changes
warnings.warn(_sort_msg, FutureWarning, stacklevel=8)
sort = True

return _unique_indices(indexes)

name = _get_consensus_names(indexes)[0]
if name != index.name:
index = index._shallow_copy(name=name)
return index
else:
else: # kind='list'
return _unique_indices(indexes)


Expand Down
5 changes: 4 additions & 1 deletion pandas/core/panel.py
Original file line number Diff line number Diff line change
Expand Up @@ -1499,8 +1499,11 @@ def _extract_axis(self, data, axis=0, intersect=False):
raw_lengths.append(v.shape[axis])

if have_frames:
# we want the "old" behavior here, of sorting only
# 1. we're doing a union (intersect=False)
# 2. the indices are not aligned.
index = _get_objs_combined_axis(data.values(), axis=axis,
intersect=intersect)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this always warns? shouldn't this be sort=False?

intersect=intersect, sort=None)

if have_raw_arrays:
lengths = list(set(raw_lengths))
Expand Down
24 changes: 20 additions & 4 deletions pandas/core/reshape/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually move before verify_integrity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this an API breaking change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it more logical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so?

I'm OK with breaking API when necessary, but this seems unnecessary.

sort=None, copy=True):
"""
Concatenate pandas objects along a particular axis with optional set logic
along the other axes.
Expand Down Expand Up @@ -60,6 +60,19 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
verify_integrity : boolean, default False
Check whether the new concatenated axis contains duplicates. This can
be very expensive relative to the actual data concatenation
sort : boolean, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Sort non-concatenation axis if it is not already aligned when `join`
is 'outer'. The current default of sorting is deprecated and will
change to not-sorting in a future version of pandas.

Explicitly pass ``sort=True`` to silence the warning and sort.
Explicitly pass ``sort=False`` to silence the warning and not sort.

This has no effect when ``join='inner'``, which already preserves
the order of the non-concatenation axis.

.. versionadded:: 0.23.0

copy : boolean, default True
If False, do not copy data unnecessarily

Expand Down Expand Up @@ -209,7 +222,7 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
ignore_index=ignore_index, join=join,
keys=keys, levels=levels, names=names,
verify_integrity=verify_integrity,
copy=copy)
copy=copy, sort=sort)
return op.get_result()


Expand All @@ -220,7 +233,8 @@ class _Concatenator(object):

def __init__(self, objs, axis=0, join='outer', join_axes=None,
keys=None, levels=None, names=None,
ignore_index=False, verify_integrity=False, copy=True):
ignore_index=False, verify_integrity=False, copy=True,
sort=False):
if isinstance(objs, (NDFrame, compat.string_types)):
raise TypeError('first argument must be an iterable of pandas '
'objects, you passed an object of type '
Expand Down Expand Up @@ -355,6 +369,7 @@ def __init__(self, objs, axis=0, join='outer', join_axes=None,
self.keys = keys
self.names = names or getattr(keys, 'names', None)
self.levels = levels
self.sort = sort

self.ignore_index = ignore_index
self.verify_integrity = verify_integrity
Expand Down Expand Up @@ -447,7 +462,8 @@ def _get_comb_axis(self, i):
data_axis = self.objs[0]._get_block_manager_axis(i)
try:
return _get_objs_combined_axis(self.objs, axis=data_axis,
intersect=self.intersect)
intersect=self.intersect,
sort=self.sort)
except IndexError:
types = [type(x).__name__ for x in self.objs]
raise TypeError("Cannot concatenate list of {types}"
Expand Down
6 changes: 3 additions & 3 deletions pandas/tests/frame/test_combine_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def test_append_series_dict(self):

result = df.append(series[::-1][:3], ignore_index=True)
expected = df.append(DataFrame({0: series[::-1][:3]}).T,
ignore_index=True)
ignore_index=True, sort=True)
assert_frame_equal(result, expected.loc[:, result.columns])

# can append when name set
Expand All @@ -119,8 +119,8 @@ def test_append_list_of_series_dicts(self):
# different columns
dicts = [{'foo': 1, 'bar': 2, 'baz': 3, 'peekaboo': 4},
{'foo': 5, 'bar': 6, 'baz': 7, 'peekaboo': 8}]
result = df.append(dicts, ignore_index=True)
expected = df.append(DataFrame(dicts), ignore_index=True)
result = df.append(dicts, ignore_index=True, sort=True)
expected = df.append(DataFrame(dicts), ignore_index=True, sort=True)
assert_frame_equal(result, expected)

def test_append_empty_dataframe(self):
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/frame/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -1071,6 +1071,17 @@ def test_constructor_list_of_series(self):
expected = DataFrame.from_dict(sdict, orient='index')
tm.assert_frame_equal(result, expected)

def test_constructor_list_of_series_aligned_index(self):
series = [pd.Series(i, index=['b', 'a', 'c'], name=str(i))
for i in range(3)]
result = pd.DataFrame(series)
expected = pd.DataFrame({'b': [0, 1, 2],
'a': [0, 1, 2],
'c': [0, 1, 2]},
columns=['b', 'a', 'c'],
index=['0', '1', '2'])
tm.assert_frame_equal(result, expected)

def test_constructor_list_of_derived_dicts(self):
class CustomDict(dict):
pass
Expand Down
Loading