Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: type conversions on merges #15332

Open
chris-b1 opened this issue Feb 7, 2017 · 5 comments
Open

API: type conversions on merges #15332

chris-b1 opened this issue Feb 7, 2017 · 5 comments
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Needs Discussion Requires discussion from core team before further action

Comments

@chris-b1
Copy link
Contributor

chris-b1 commented Feb 7, 2017

Currently any type conversions on merge are silent, e.g.

In [24]: a = pd.DataFrame({'cat_key': pd.Categorical(['a', 'b', 'c']), 'int_key': [1, 2, 3]})

In [25]: b = pd.DataFrame({'cat_key': pd.Categorical(['b', 'a', 'c']), 'values': [1, 2, 3]})

In [26]: a.merge(b).dtypes
Out[26]: 
cat_key    object
int_key     int64
values      int64
dtype: object

In [29]: b2 = pd.DataFrame({'int_key': [2.0, 1.0, 3.0], 'values': [1, 2, 3]})

In [30]: a.merge(b2)
Out[30]: 
  cat_key  int_key  values
0       a        1       2
1       b        2       1
2       c        3       3

In [31]: a.merge(b2).dtypes
Out[31]: 
cat_key    object
int_key     int64
values      int64
dtype: object

#15321 will make [26] preserve a categorical dtype, but if the categories don't overlap, it will be converted to object.

So, should there be a something like a conversions='ignore'|'warn'|'error' option?

@chris-b1 chris-b1 added API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Feb 7, 2017
@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Feb 7, 2017
@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Feb 7, 2017
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 7, 2017

In the integer case, what would be the rule? As I would actually expect it to be floats, not ints (or at least in the non-left-join case)

I am not sure I find it worth it add a keyword for this.

@jreback jreback added this to the Next Major Release milestone Dec 10, 2017
@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

xref #18674

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

cc @reidy-p

this is a little bit trickier, we need to allow merges on compat categorical types, but should raise I think on incompat, but might be some cases where we are ok to merge (note that these should still merge just be turned into object)

e.g. if the categoricals have the same categories and are unordered but just in a different order I think this might be ok

In [2]: dtype1 = CategoricalDtype(list('abc'),ordered=False)

In [3]: dtype2 = CategoricalDtype(list('bac'),ordered=False)

In [4]: dtype1 == dtype2
Out[4]: True

@reidy-p
Copy link
Contributor

reidy-p commented Dec 10, 2017

@jreback thanks.

At the moment, merging on categorical columns seems to retain the categorical type when:

  1. Columns have same categories and both are unordered
  2. Columns have same categories and same ordering

But many other cases lead to casting to object. For example:

  1. Columns have same categories but different ordering
In [1]: cat1 = pd.Categorical(list('abc'), ordered=True, categories=['c', 'b', 'a'])
In [2]: cat2 = pd.Categorical(list('bac'), ordered=True, categories=['a', 'b', 'c'])

In [3]: a = pd.DataFrame({'A': cat1})
In [4]: b = pd.DataFrame({'A': cat2})
In [5]: a.merge(b, on='A', how='outer').dtypes
Out[5]:
A    object
dtype: object
  1. One categorical column is ordered while the other is not
In [6]: cat1 = pd.Categorical(list('abc'), ordered=False)
In [7]: cat2 = pd.Categorical(list('bac'), ordered=True, categories=['a', 'b', 'c'])

In [8]: a = pd.DataFrame({'A': cat1})
In [9]: b = pd.DataFrame({'A': cat2})
In [10]: a.merge(b, on='A', how='outer').dtypes
Out[10]:
A    object
dtype: object
  1. Columns have different categories
In [11]: cat1 = pd.Categorical(list('abcd'), ordered=False)
In [12]: cat2 = pd.Categorical(list('bac'), ordered=False)

In [13]: a = pd.DataFrame({'A': cat1})
In [14]: b = pd.DataFrame({'A': cat2})
In [15]: a.merge(b, on='A', how='outer').dtypes
Out[15]:
A    object
dtype: object

And there are probably other cases when casting to object occurs too.

Should we raise instead of casting to object in the above cases? And are there any cases where we should allow casting to object and not raise?

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

cc @TomAugspurger @chris-b1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants