pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

xuancong84 · 2020-09-10T04:47:03Z

I just found out that when we concatenate two dataframes horizontally, if one dataframe has duplicate indices, pd.concat() will crash, but df.join() will not crash. Instead, df.join() will spread the values into all rows with the same index value. Is this behavior by design? Thanks!

df1 = pd.DataFrame(np.random.randn(5), index=[0,1,2,3,3], columns=['a'])
df2 = pd.DataFrame(np.random.randn(5), index=[0,1,2,2,4], columns=['b'])
dfj = df1.join(df2, how='outer')
display(df1, df2, dfj)
dfc = pd.concat([df1, df2], axis=1)
display(dfc)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-150ce3e802ea> in <module>
      3 dfj = df1.join(df2, how='outer')
      4 display(df1, df2, dfj)
----> 5 dfc = pd.concat([df1, df2], axis=1)
      6 display(dfc)

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    282     )
    283 
--> 284     return op.get_result()
    285 
    286 

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in get_result(self)
    495 
    496             new_data = concatenate_block_managers(
--> 497                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy
    498             )
    499             if not self.copy:

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   2025         blocks.append(b)
   2026 
-> 2027     return BlockManager(blocks, axes)

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    137 
    138         if do_integrity_check:
--> 139             self._verify_integrity()
    140 
    141         self._consolidate_check()

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    332         for block in self.blocks:
    333             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 334                 construction_error(tot_items, block.shape[1:], self.axes)
    335         if len(self.items) != tot_items:
    336             raise AssertionError(

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
   1692     if block_shape[0] == 0:
   1693         raise ValueError("Empty data passed with indices specified.")
-> 1694     raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
   1695 
   1696 

ValueError: Shape of passed values is (9, 2), indices imply (7, 2)

By right, if the dataframes have duplicate indices, it can behave like df.join() and at least it should NOT crash.
I suggest we introduce additional arguments to handle duplicate indices, e.g., if the same index has X(>0) rows in df1, Y(>0) rows in df2, then if dup_index=:

combinatorial: after merging, it will have X*Y rows for every combination possibility.
outer-top-align: after merging, it will have max(X, Y) rows, in which the rows align from top
outer-bottom-align: after merging, it will have max(X, Y) rows, in which the rows align from bottom
inner-top-align: after merging, it will have min(X, Y) rows, in which the rows align from top
inner-bottom-align: after merging, it will have min(X, Y) rows, in which the rows align from bottom
raise: raise an exception with the warning message

The text was updated successfully, but these errors were encountered:

jreback · 2020-12-24T15:09:00Z

this should now raise (need tests)

cc @ivirshup @phofl

kasmith11 · 2021-09-22T17:24:08Z

take

xuancong84 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 10, 2020

dsaxton added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 10, 2020

phofl self-assigned this Sep 10, 2020

phofl mentioned this issue Sep 11, 2020

[BUG]: Fix ValueError in concat() when at least one Index has duplicates #36290

Merged

5 tasks

jreback added this to the 1.2 milestone Sep 13, 2020

jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020

jreback closed this as completed in #36290 Nov 19, 2020

jreback mentioned this issue Dec 24, 2020

[BUG] Concat duplicates errors (or lack there of) #38654

Merged

6 tasks

jreback reopened this Dec 24, 2020

jreback modified the milestones: 1.2, 1.2.1 Dec 24, 2020

This was referenced Dec 30, 2020

[WIP] Test (and more fixes) for duplicate indices with concat #38745

Closed

API/ ENH: Unambiguous indexing should be allowed, even if duplicates are present #38797

Open

jreback modified the milestones: 1.2.1, 1.3 Jan 6, 2021

simonjayhawkins modified the milestones: 1.3, Contributions Welcome May 24, 2021

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 13, 2021

github-actions bot assigned kasmith11 Sep 22, 2021

kasmith11 mentioned this issue Sep 22, 2021

GH36236 - Additional Test for df.join() #43700

Closed

4 tasks

zhengfeiwang mentioned this issue Feb 9, 2022

TST: add test for DataFrame with duplicate indices concat #45888

Merged

4 tasks

mroeschke modified the milestones: Contributions Welcome, 1.5 Feb 10, 2022

mroeschke closed this as completed in #45888 Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

xuancong84 commented Sep 10, 2020 •

edited

Loading

jreback commented Dec 24, 2020

kasmith11 commented Sep 22, 2021

pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

Comments

xuancong84 commented Sep 10, 2020 • edited Loading

jreback commented Dec 24, 2020

kasmith11 commented Sep 22, 2021

xuancong84 commented Sep 10, 2020 •

edited

Loading