Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop concat from attempting to sort mismatched columns by default #20613

Merged
merged 38 commits into from
May 1, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
913723b
Stop concat from attempting to sort mismatched columns by default
brycepg Apr 5, 2018
5da763f
Merge remote-tracking branch 'upstream/master' into brycepg-master
TomAugspurger Apr 26, 2018
02b2db9
Updates
TomAugspurger Apr 26, 2018
a497763
Test fallout
TomAugspurger Apr 26, 2018
954a1b6
Updated append
TomAugspurger Apr 26, 2018
2a20377
versionadded
TomAugspurger Apr 26, 2018
35570c4
Squash more test warnings
TomAugspurger Apr 27, 2018
983d0c1
py2 compat
TomAugspurger Apr 27, 2018
27d2d32
Merge remote-tracking branch 'upstream/master' into brycepg-master
TomAugspurger Apr 27, 2018
4960e3f
Document outer is not affected
TomAugspurger Apr 27, 2018
8bbbdd5
Docs
TomAugspurger Apr 27, 2018
dcfa6d0
Sort for intersection
TomAugspurger Apr 28, 2018
2eaeb1e
More tests
TomAugspurger Apr 28, 2018
bc7dd48
Test fixup.
TomAugspurger Apr 29, 2018
b3f95dd
Stop concat from attempting to sort mismatched columns by default
brycepg Apr 5, 2018
f37d7ef
Updates
TomAugspurger Apr 26, 2018
e467f91
Test fallout
TomAugspurger Apr 26, 2018
058fae5
Updated append
TomAugspurger Apr 26, 2018
04e5151
versionadded
TomAugspurger Apr 26, 2018
c864679
Squash more test warnings
TomAugspurger Apr 27, 2018
7e975c9
py2 compat
TomAugspurger Apr 27, 2018
a8ba430
Document outer is not affected
TomAugspurger Apr 27, 2018
62b1e7b
Docs
TomAugspurger Apr 27, 2018
0ace673
Sort for intersection
TomAugspurger Apr 28, 2018
d5cafdf
More tests
TomAugspurger Apr 28, 2018
ce8ff05
Test fixup.
TomAugspurger Apr 29, 2018
ce756d4
ugh
TomAugspurger Apr 29, 2018
362e84d
quoting
TomAugspurger Apr 30, 2018
0210d33
Clarify
TomAugspurger Apr 30, 2018
06772b4
Removed unnescesary check
TomAugspurger Apr 30, 2018
95cdf67
Merge remote-tracking branch 'upstream/master' into brycepg-master
TomAugspurger Apr 30, 2018
d10f5bd
Merge remote-tracking branch 'brycepg/master' into brycepg-master
TomAugspurger Apr 30, 2018
e47cbb9
Prune tests
TomAugspurger Apr 30, 2018
0182c98
Default sort
TomAugspurger Apr 30, 2018
7e58998
Make both tests happy
TomAugspurger Apr 30, 2018
5b58e75
Explicit columns
TomAugspurger Apr 30, 2018
074d03c
List of series
TomAugspurger Apr 30, 2018
5e1b024
test, fix pivot
TomAugspurger May 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1160,6 +1160,7 @@ Reshaping
- Bug in :meth:`DataFrame.astype` where column metadata is lost when converting to categorical or a dictionary of dtypes (:issue:`19920`)
- Bug in :func:`cut` and :func:`qcut` where timezone information was dropped (:issue:`19872`)
- Bug in :class:`Series` constructor with a ``dtype=str``, previously raised in some cases (:issue:`19853`)
- Stop :func:`concat` and ``Dataframe.append`` from sorting columns by default. Use ``sort=True`` to retain old behavior (:issue:`4588`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be sort=True by default to preserve backwards compatibility, right?

Or rather, I think the eventual goal is to have sort=False be the default, so for now it should be

  • sort=None is the default
  • If the default is passed, use sort=True and warn that the default is changing in the futrue
  • If True or False, no warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this needs a sub-section. this is a rather large change (even if its None by default). highliting it is best. pls show an example of previous and new

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger
If I do what @jorisvandenbossche suggests then, sort=True will not be backwards compatible because it will sort the axes in question regardless of whether the columns are mismatched.
I could have sort=None be the default, give a warning and revert to old behavior. In future versions this behavior of only sorting the axes sometimes would not be available because it doesn't make sense and concat could default to sort=False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brycepg I think we can do both (it might complicate the code a bit, but not too much I think, as in _get_combined_index those cases are already handled separately). As @TomAugspurger suggests, the default can be None for now, so we can raise a warning in the appropriate cases:

  • when all axes are equal (current case when there is already no sorting): no warning should be raised, as the future default of sort=False will not change anything, but add the ability to also sort the index with sort=True
  • when not all axes are equal (current case that the unwanted sorting happens): issue a warning that in the future this will no longer sort


Other
^^^^^
Expand Down
11 changes: 6 additions & 5 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ def fast_unique_multiple(list arrays):

@cython.wraparound(False)
@cython.boundscheck(False)
def fast_unique_multiple_list(list lists):
def fast_unique_multiple_list(list lists, bint sort=True):
cdef:
list buf
Py_ssize_t k = len(lists)
Expand All @@ -174,10 +174,11 @@ def fast_unique_multiple_list(list lists):
if val not in table:
table[val] = stub
uniques.append(val)
try:
uniques.sort()
except Exception:
pass
if sort:
try:
uniques.sort()
except Exception:
pass

return uniques

Expand Down
8 changes: 6 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5982,7 +5982,8 @@ def infer(x):
# ----------------------------------------------------------------------
# Merging / joining methods

def append(self, other, ignore_index=False, verify_integrity=False):
def append(self, other, ignore_index=False,
verify_integrity=False, sort=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort before verify_integrity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback why do you want sort before verify_integrity?

"""
Append rows of `other` to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
Expand All @@ -5995,6 +5996,8 @@ def append(self, other, ignore_index=False, verify_integrity=False):
If True, do not use the index labels.
verify_integrity : boolean, default False
If True, raise ValueError on creating index with duplicates.
sort: boolean, default False
Sort columns if given object doesn't have the same columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a versionadded.

use does not


Returns
-------
Expand Down Expand Up @@ -6103,7 +6106,8 @@ def append(self, other, ignore_index=False, verify_integrity=False):
else:
to_concat = [self, other]
return concat(to_concat, ignore_index=ignore_index,
verify_integrity=verify_integrity)
verify_integrity=verify_integrity,
sort=sort)

def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
sort=False):
Expand Down
13 changes: 7 additions & 6 deletions pandas/core/indexes/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,17 +31,17 @@
'_all_indexes_same']


def _get_objs_combined_axis(objs, intersect=False, axis=0):
def _get_objs_combined_axis(objs, intersect=False, axis=0, sort=True):
# Extract combined index: return intersection or union (depending on the
# value of "intersect") of indexes on given axis, or None if all objects
# lack indexes (e.g. they are numpy arrays)
obs_idxes = [obj._get_axis(axis) for obj in objs
if hasattr(obj, '_get_axis')]
if obs_idxes:
return _get_combined_index(obs_idxes, intersect=intersect)
return _get_combined_index(obs_idxes, intersect=intersect, sort=sort)


def _get_combined_index(indexes, intersect=False):
def _get_combined_index(indexes, intersect=False, sort=True):
# TODO: handle index names!
indexes = com._get_distinct_objs(indexes)
if len(indexes) == 0:
Expand All @@ -53,11 +53,11 @@ def _get_combined_index(indexes, intersect=False):
for other in indexes[1:]:
index = index.intersection(other)
return index
union = _union_indexes(indexes)
union = _union_indexes(indexes, sort=sort)
return _ensure_index(union)


def _union_indexes(indexes):
def _union_indexes(indexes, sort=True):
if len(indexes) == 0:
raise AssertionError('Must have at least 1 Index to union')
if len(indexes) == 1:
Expand All @@ -74,7 +74,8 @@ def conv(i):
i = i.tolist()
return i

return Index(lib.fast_unique_multiple_list([conv(i) for i in inds]))
return Index(
lib.fast_unique_multiple_list([conv(i) for i in inds], sort=sort))

if kind == 'special':
result = indexes[0]
Expand Down
13 changes: 9 additions & 4 deletions pandas/core/reshape/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually move before verify_integrity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this an API breaking change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it more logical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so?

I'm OK with breaking API when necessary, but this seems unnecessary.

sort=False, copy=True):
"""
Concatenate pandas objects along a particular axis with optional set logic
along the other axes.
Expand Down Expand Up @@ -60,6 +60,8 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
verify_integrity : boolean, default False
Check whether the new concatenated axis contains duplicates. This can
be very expensive relative to the actual data concatenation
sort : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simiar to above

Sort columns if all passed object columns are not the same
copy : boolean, default True
If False, do not copy data unnecessarily

Expand Down Expand Up @@ -209,7 +211,7 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
ignore_index=ignore_index, join=join,
keys=keys, levels=levels, names=names,
verify_integrity=verify_integrity,
copy=copy)
copy=copy, sort=sort)
return op.get_result()


Expand All @@ -220,7 +222,8 @@ class _Concatenator(object):

def __init__(self, objs, axis=0, join='outer', join_axes=None,
keys=None, levels=None, names=None,
ignore_index=False, verify_integrity=False, copy=True):
ignore_index=False, verify_integrity=False, copy=True,
sort=False):
if isinstance(objs, (NDFrame, compat.string_types)):
raise TypeError('first argument must be an iterable of pandas '
'objects, you passed an object of type '
Expand Down Expand Up @@ -355,6 +358,7 @@ def __init__(self, objs, axis=0, join='outer', join_axes=None,
self.keys = keys
self.names = names or getattr(keys, 'names', None)
self.levels = levels
self.sort = sort

self.ignore_index = ignore_index
self.verify_integrity = verify_integrity
Expand Down Expand Up @@ -447,7 +451,8 @@ def _get_comb_axis(self, i):
data_axis = self.objs[0]._get_block_manager_axis(i)
try:
return _get_objs_combined_axis(self.objs, axis=data_axis,
intersect=self.intersect)
intersect=self.intersect,
sort=self.sort)
except IndexError:
types = [type(x).__name__ for x in self.objs]
raise TypeError("Cannot concatenate list of {types}"
Expand Down
35 changes: 28 additions & 7 deletions pandas/tests/reshape/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from numpy.random import randn

from datetime import datetime
from pandas.compat import StringIO, iteritems, PY2
from pandas.compat import StringIO, iteritems
import pandas as pd
from pandas import (DataFrame, concat,
read_csv, isna, Series, date_range,
Expand Down Expand Up @@ -852,8 +852,9 @@ def test_append_dtype_coerce(self):
dt.datetime(2013, 1, 2, 0, 0),
dt.datetime(2013, 1, 3, 0, 0),
dt.datetime(2013, 1, 4, 0, 0)],
name='start_time')], axis=1)
result = df1.append(df2, ignore_index=True)
name='start_time')],
axis=1, sort=True)
result = df1.append(df2, ignore_index=True, sort=True)
assert_frame_equal(result, expected)

def test_append_missing_column_proper_upcast(self):
Expand Down Expand Up @@ -1011,7 +1012,8 @@ def test_concat_ignore_index(self):
frame1.index = Index(["x", "y", "z"])
frame2.index = Index(["x", "y", "q"])

v1 = concat([frame1, frame2], axis=1, ignore_index=True)
v1 = concat([frame1, frame2], axis=1,
ignore_index=True, sort=True)

nan = np.nan
expected = DataFrame([[nan, nan, nan, 4.3],
Expand Down Expand Up @@ -1463,7 +1465,7 @@ def test_concat_series_axis1(self):
# must reindex, #2603
s = Series(randn(3), index=['c', 'a', 'b'], name='A')
s2 = Series(randn(4), index=['d', 'a', 'b', 'c'], name='B')
result = concat([s, s2], axis=1)
result = concat([s, s2], axis=1, sort=True)
expected = DataFrame({'A': s, 'B': s2})
assert_frame_equal(result, expected)

Expand Down Expand Up @@ -2070,8 +2072,6 @@ def test_concat_order(self):
for i in range(100)]
result = pd.concat(dfs).columns
expected = dfs[0].columns
if PY2:
expected = expected.sort_values()
tm.assert_index_equal(result, expected)

def test_concat_datetime_timezone(self):
Expand Down Expand Up @@ -2155,3 +2155,24 @@ def test_concat_empty_and_non_empty_series_regression():
expected = s1
result = pd.concat([s1, s2])
tm.assert_series_equal(result, expected)


def test_concat_preserve_column_order_differing_columns():
# GH 4588 regression test
# for new columns in concat
dfa = pd.DataFrame(columns=['C', 'A'], data=[[1, 2]])
dfb = pd.DataFrame(columns=['C', 'Z'], data=[[5, 6]])
result = pd.concat([dfa, dfb])
assert result.columns.tolist() == ['C', 'A', 'Z']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an expected frame and use assert_frame_equal



def test_concat_preserve_column_order_uneven_data():
# GH 4588 regression test
# add to column, concat with uneven data
df = pd.DataFrame()
df['b'] = [1, 2, 3]
df['c'] = [1, 2, 3]
df['a'] = [1, 2, 3]
df2 = pd.DataFrame({'a': [4, 5]})
df3 = pd.concat([df, df2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use result =

assert df3.columns.tolist() == ['b', 'c', 'a']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same