-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop concat from attempting to sort mismatched columns by default #20613
Changes from 1 commit
913723b
5da763f
02b2db9
a497763
954a1b6
2a20377
35570c4
983d0c1
27d2d32
4960e3f
8bbbdd5
dcfa6d0
2eaeb1e
bc7dd48
b3f95dd
f37d7ef
e467f91
058fae5
04e5151
c864679
7e975c9
a8ba430
62b1e7b
0ace673
d5cafdf
ce8ff05
ce756d4
362e84d
0210d33
06772b4
95cdf67
d10f5bd
e47cbb9
0182c98
7e58998
5b58e75
074d03c
5e1b024
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5982,7 +5982,8 @@ def infer(x): | |
# ---------------------------------------------------------------------- | ||
# Merging / joining methods | ||
|
||
def append(self, other, ignore_index=False, verify_integrity=False): | ||
def append(self, other, ignore_index=False, | ||
verify_integrity=False, sort=False): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sort before verify_integrity There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback why do you want sort before verify_integrity? |
||
""" | ||
Append rows of `other` to the end of this frame, returning a new | ||
object. Columns not in this frame are added as new columns. | ||
|
@@ -5995,6 +5996,8 @@ def append(self, other, ignore_index=False, verify_integrity=False): | |
If True, do not use the index labels. | ||
verify_integrity : boolean, default False | ||
If True, raise ValueError on creating index with duplicates. | ||
sort: boolean, default False | ||
Sort columns if given object doesn't have the same columns | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. needs a versionadded. use does not |
||
|
||
Returns | ||
------- | ||
|
@@ -6103,7 +6106,8 @@ def append(self, other, ignore_index=False, verify_integrity=False): | |
else: | ||
to_concat = [self, other] | ||
return concat(to_concat, ignore_index=ignore_index, | ||
verify_integrity=verify_integrity) | ||
verify_integrity=verify_integrity, | ||
sort=sort) | ||
|
||
def join(self, other, on=None, how='left', lsuffix='', rsuffix='', | ||
sort=False): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,7 +20,7 @@ | |
|
||
def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, | ||
keys=None, levels=None, names=None, verify_integrity=False, | ||
copy=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually move before verify_integrity There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls do this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why make this an API breaking change? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. because it more logical. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How so? I'm OK with breaking API when necessary, but this seems unnecessary. |
||
sort=False, copy=True): | ||
""" | ||
Concatenate pandas objects along a particular axis with optional set logic | ||
along the other axes. | ||
|
@@ -60,6 +60,8 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, | |
verify_integrity : boolean, default False | ||
Check whether the new concatenated axis contains duplicates. This can | ||
be very expensive relative to the actual data concatenation | ||
sort : boolean, default False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. simiar to above |
||
Sort columns if all passed object columns are not the same | ||
copy : boolean, default True | ||
If False, do not copy data unnecessarily | ||
|
||
|
@@ -209,7 +211,7 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, | |
ignore_index=ignore_index, join=join, | ||
keys=keys, levels=levels, names=names, | ||
verify_integrity=verify_integrity, | ||
copy=copy) | ||
copy=copy, sort=sort) | ||
return op.get_result() | ||
|
||
|
||
|
@@ -220,7 +222,8 @@ class _Concatenator(object): | |
|
||
def __init__(self, objs, axis=0, join='outer', join_axes=None, | ||
keys=None, levels=None, names=None, | ||
ignore_index=False, verify_integrity=False, copy=True): | ||
ignore_index=False, verify_integrity=False, copy=True, | ||
sort=False): | ||
if isinstance(objs, (NDFrame, compat.string_types)): | ||
raise TypeError('first argument must be an iterable of pandas ' | ||
'objects, you passed an object of type ' | ||
|
@@ -355,6 +358,7 @@ def __init__(self, objs, axis=0, join='outer', join_axes=None, | |
self.keys = keys | ||
self.names = names or getattr(keys, 'names', None) | ||
self.levels = levels | ||
self.sort = sort | ||
|
||
self.ignore_index = ignore_index | ||
self.verify_integrity = verify_integrity | ||
|
@@ -447,7 +451,8 @@ def _get_comb_axis(self, i): | |
data_axis = self.objs[0]._get_block_manager_axis(i) | ||
try: | ||
return _get_objs_combined_axis(self.objs, axis=data_axis, | ||
intersect=self.intersect) | ||
intersect=self.intersect, | ||
sort=self.sort) | ||
except IndexError: | ||
types = [type(x).__name__ for x in self.objs] | ||
raise TypeError("Cannot concatenate list of {types}" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,7 @@ | |
from numpy.random import randn | ||
|
||
from datetime import datetime | ||
from pandas.compat import StringIO, iteritems, PY2 | ||
from pandas.compat import StringIO, iteritems | ||
import pandas as pd | ||
from pandas import (DataFrame, concat, | ||
read_csv, isna, Series, date_range, | ||
|
@@ -852,8 +852,9 @@ def test_append_dtype_coerce(self): | |
dt.datetime(2013, 1, 2, 0, 0), | ||
dt.datetime(2013, 1, 3, 0, 0), | ||
dt.datetime(2013, 1, 4, 0, 0)], | ||
name='start_time')], axis=1) | ||
result = df1.append(df2, ignore_index=True) | ||
name='start_time')], | ||
axis=1, sort=True) | ||
result = df1.append(df2, ignore_index=True, sort=True) | ||
assert_frame_equal(result, expected) | ||
|
||
def test_append_missing_column_proper_upcast(self): | ||
|
@@ -1011,7 +1012,8 @@ def test_concat_ignore_index(self): | |
frame1.index = Index(["x", "y", "z"]) | ||
frame2.index = Index(["x", "y", "q"]) | ||
|
||
v1 = concat([frame1, frame2], axis=1, ignore_index=True) | ||
v1 = concat([frame1, frame2], axis=1, | ||
ignore_index=True, sort=True) | ||
|
||
nan = np.nan | ||
expected = DataFrame([[nan, nan, nan, 4.3], | ||
|
@@ -1463,7 +1465,7 @@ def test_concat_series_axis1(self): | |
# must reindex, #2603 | ||
s = Series(randn(3), index=['c', 'a', 'b'], name='A') | ||
s2 = Series(randn(4), index=['d', 'a', 'b', 'c'], name='B') | ||
result = concat([s, s2], axis=1) | ||
result = concat([s, s2], axis=1, sort=True) | ||
expected = DataFrame({'A': s, 'B': s2}) | ||
assert_frame_equal(result, expected) | ||
|
||
|
@@ -2070,8 +2072,6 @@ def test_concat_order(self): | |
for i in range(100)] | ||
result = pd.concat(dfs).columns | ||
expected = dfs[0].columns | ||
if PY2: | ||
expected = expected.sort_values() | ||
tm.assert_index_equal(result, expected) | ||
|
||
def test_concat_datetime_timezone(self): | ||
|
@@ -2155,3 +2155,24 @@ def test_concat_empty_and_non_empty_series_regression(): | |
expected = s1 | ||
result = pd.concat([s1, s2]) | ||
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_concat_preserve_column_order_differing_columns(): | ||
# GH 4588 regression test | ||
# for new columns in concat | ||
dfa = pd.DataFrame(columns=['C', 'A'], data=[[1, 2]]) | ||
dfb = pd.DataFrame(columns=['C', 'Z'], data=[[5, 6]]) | ||
result = pd.concat([dfa, dfb]) | ||
assert result.columns.tolist() == ['C', 'A', 'Z'] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. create an expected frame and use assert_frame_equal |
||
|
||
|
||
def test_concat_preserve_column_order_uneven_data(): | ||
# GH 4588 regression test | ||
# add to column, concat with uneven data | ||
df = pd.DataFrame() | ||
df['b'] = [1, 2, 3] | ||
df['c'] = [1, 2, 3] | ||
df['a'] = [1, 2, 3] | ||
df2 = pd.DataFrame({'a': [4, 5]}) | ||
df3 = pd.concat([df, df2]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use result = |
||
assert df3.columns.tolist() == ['b', 'c', 'a'] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be
sort=True
by default to preserve backwards compatibility, right?Or rather, I think the eventual goal is to have
sort=False
be the default, so for now it should besort=None
is the defaultsort=True
and warn that the default is changing in the futrueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this needs a sub-section. this is a rather large change (even if its None by default). highliting it is best. pls show an example of previous and new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger
If I do what @jorisvandenbossche suggests then,
sort=True
will not be backwards compatible because it will sort the axes in question regardless of whether the columns are mismatched.I could have
sort=None
be the default, give a warning and revert to old behavior. In future versions this behavior of only sorting the axes sometimes would not be available because it doesn't make sense and concat could default tosort=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brycepg I think we can do both (it might complicate the code a bit, but not too much I think, as in
_get_combined_index
those cases are already handled separately). As @TomAugspurger suggests, the default can be None for now, so we can raise a warning in the appropriate cases:sort=False
will not change anything, but add the ability to also sort the index withsort=True