Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.eval division operation upcasts float32 to float64 #12388

Closed
jennolsen84 opened this issue Feb 19, 2016 · 13 comments
Closed

pd.eval division operation upcasts float32 to float64 #12388

jennolsen84 opened this issue Feb 19, 2016 · 13 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@jennolsen84
Copy link
Contributor

The current behavior is inconsistent with normal python division of two DataFrames (see code sample).

Pandas upcasts both terms to 64-bit floats when it detects a division, see:

https://github.com/pydata/pandas/blob/528108bba4104b939bcfe6923677ddacc916ff00/pandas/computation/ops.py#L453

I think numexpr can handle different types too, and upcast automatically, though I am not 100% sure. I can submit a PR, but how do you recommend fixing this? Something like the following?

if truediv or PY3:
    for term in com.flatten(self):
        try:
            dt = term.values.dtype  # can .values be expensive?    
        except AttributeError:
            dt = type(term)

        if dt == np.float32:
            continue        
        else:
            _cast_inplace([term], np.float_)

The downside is that if someone does 2 + df, they'll probably still end up upcasting it. But this proposal is still better than what we have today

I might re-write the above using filter too, but at this time I just wanted to discuss the general approach

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(3, dtype=np.float32))
print('normal', (df/df).values.dtype)
print('pd_eval', pd.eval('df/df').values.dtype)
assert ((df/df).dtypes == pd.eval('df/df').dtypes).all()

Expected Output

normal float32 
pd_eval float32

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: None
pip: 8.0.2
setuptools: 19.6.2
Cython: 0.23.4
numpy: 1.10.4
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.4.2
sqlalchemy: 1.0.9
pymysql: 0.6.7.None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None
@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

In operations implying a scalar and an array, the normal rules of casting are used in Numexpr, in contrast with NumPy, where array types takes priority. For example, if 'a' is an array of type `float32` and 'b' is an scalar of type `float64` (or Python `float` type, which is equivalent), then 'a*b' returns a `float64` in Numexpr, but a `float32` in NumPy (i.e. array operands take priority in determining the result type). If you need to keep the result a `float32`, be sure you use a `float32` scalar too.

(this is different that what you are saying, but should prob handle non-the-less). I would do this test/casting in _cast_inplace itself.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 19, 2016
@jennolsen84
Copy link
Contributor Author

numpy behavior seems to make more sense.

pd.eval('3.5 / float32array')

is much easier to write than:

s = np.float32('3.5')
pd.eval('s / float32array')

Also, if someone that didn't read the numexpr docs super carefully, they would've missed the little detail.

Therefore, should we mimic numpy behavior?

As for _cast_inplace, should we modify the signature? After the changes, it would be much more specialized function. It looks like it is only used once, so we have that going for us.

@jennolsen84
Copy link
Contributor Author

Thought about it some more

We could look at the whole expression, and come up with an output datatype:

If all array elements in an expression are floats32 and ints:
then
    output type = float32
else:
    output type = float64

This still has corner cases like adding two int32 arrays will result in float64. It is unclear what the solution of adding two int32 arrays should be: If the numbers are small, then int32 array as an output array is OK, but if the numbers are big you need int64 arrays. A way around this would be to let the user specify an out parameter. We could do extra checks to warn the user in case there are incompatiblities, like if two float64s are being added, but the output type is float32, etc.

So, the proposal now becomes:

  1. Add out parameter to let user specify the destination of the datatype. must be ndarray or a pandas object (so either has .dtype or .values.dtype)
  2. Choose an output array dtype to be one of {float64, float32}, depending on datatypes of arrays in the expression. float32 is chosen if all arrays in the expression have dtypes of float32 or any of the ints, otherwise float64 is chosen.
  3. Warn if out is specified, and is float32 array, but input contains float64 array.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

I don't recall why we are casting in the first place. I would ideally like to defer this entirey to the engine.
@chris-b1 @cpcloud any recall?

if not, then would be ok with passing a dtype= argument for casting and default to the minimum casting needed (though this just adds another layer of indirection but I guess needs to be done).

@jennolsen84
Copy link
Contributor Author

Should we go with numpy casting behavior (instead of numexpr)? numpy behavior is consistent pandas when numexpr is not used.

So, what we'd have to do here is to down-cast constants from float64 to float32, if and only if all arrays are float32s. E.g., numpy and pandas will use float64 as output dtype when int32 arrays are multiplied with float32 constant. So, it seems like float32 array case is the main thing we have to worry about.

e.g.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.Series(np.arange(5, dtype=np.float32)) * 2.0
Out[3]: 
0    0
1    2
2    4
3    6
4    8
dtype: float32

In [11]: a = pd.Series(np.arange(5, dtype=np.int32)) * np.float32(1.1)
In [12]: a
Out[12]: 
0    0.0
1    1.1
2    2.2
3    3.3
4    4.4
dtype: float64

In [13]: np.arange(5, dtype=np.int32) * np.float32(1.1)
Out[13]: array([ 0.        ,  1.10000002,  2.20000005,  3.30000007,  4.4000001 ])
In [14]: z = np.arange(5, dtype=np.int32) * np.float32(1.1)
In [15]: z.dtype
Out[15]: dtype('float64')

@jreback
Copy link
Contributor

jreback commented Feb 23, 2016

I think you have to upcast by default, the only way I wouldn't would be if the users indicated (with dtype=) that its ok to proceed and then I would simply cast things to the passed dtype so the underlying wouldn't then upcast.

@jennolsen84
Copy link
Contributor Author

but wouldn't this result in inconsistent behavior between normal pandas binary operations (like s * 2.0, which does not upcast s if it is a float32 series) and pd.eval('s * 2.0'), which will end up upcasting?

@jreback
Copy link
Contributor

jreback commented Feb 23, 2016

@jennolsen84 hmm. that is a good point. just trying to avoid pandas do any casting here. What if we remove that and just let the engine do it? (I don't really recall why this is special cased here). Or if we are forced to do it, then I guess you are right would have to do a lowest-common denonimator cast (maybe use np.find_common_type

@jennolsen84
Copy link
Contributor Author

how about this as a start? jennolsen84@c82819f

I manually tested it, and the behavior is now consistent with non-numexpr related code. I am trying to avoid casting un-necessarily as you recommended, and letting the lower-level libraries take care of a lot of things.

I did run the nosetests, and they all pass on existing tests.

If the commit looks good to you, I can add in some tests, add to docs, etc. and submit a PR.

@jennolsen84
Copy link
Contributor Author

@jreback can you please take another look at the commit? I addressed your comment, and I am not sure if you missed it.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2016

@jennolsen84 yeh just getting back to this.

your soln seems fine. However I still don't understand why it is necessary to upcast (and only for division); what does numexpr do (if you don't upcast)? is it wrong?

@jennolsen84
Copy link
Contributor Author

We're casting to float32 in all ops (not just division).

The division thing was another case where pandas was casting to float(64), so I had to make a change there as well.

The reason why the cast happens at all is for some reason numexpr would cast a scalar 64 bit float * array 32 bit float to 64-bit floats. I am not sure why. This is inconsistent with numpy, and un-necessarily slower and takes up more RAM.

I will submit a PR (with whatsnew and tests)

@jreback
Copy link
Contributor

jreback commented Mar 1, 2016

thanks @jennolsen84 why don't you submit and we'll go from there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants