-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: assign doesnt cast SparseDataFrame to DataFrame #19178
Conversation
The problem here is that a SparseDataFrame that calls assign should cast to a DataFrame mainly because SparseDataFrames are a special case.
pandas/core/frame.py
Outdated
data = self.copy() | ||
|
||
# See GH19163 | ||
data = self.copy().to_dense() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you define assign on SparseDataFrame and only densify if nescessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and actually you don't want to densify, rather you want to do something like (in SparseDataFrame)
def assign(self, **kwargs):
# coerce to a DataFrame
self = DataFrame(self)
return self.assign(**kwargs)
this actually ends up copying twice though. So the real solution is to move the guts of DataFrame.assign
to _assign
(and leave the copy part in .assign
), then call ._assign
in the sparse version.
@@ -55,6 +55,13 @@ def test_assign(self): | |||
result = df.assign(A=lambda x: x.A + x.B) | |||
assert_frame_equal(result, expected) | |||
|
|||
# SparseDataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a separate test
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -448,6 +448,7 @@ Reshaping | |||
- Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`) | |||
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`) | |||
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`) | |||
- Bug in :func:`DataFrame.assign` which doesn't cast ``SparseDataFrame`` as ``DataFrame``. (:issue:`19163`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use :class`DataFrame`
and so on here
pandas/core/frame.py
Outdated
data = self.copy() | ||
|
||
# See GH19163 | ||
data = self.copy().to_dense() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and actually you don't want to densify, rather you want to do something like (in SparseDataFrame)
def assign(self, **kwargs):
# coerce to a DataFrame
self = DataFrame(self)
return self.assign(**kwargs)
this actually ends up copying twice though. So the real solution is to move the guts of DataFrame.assign
to _assign
(and leave the copy part in .assign
), then call ._assign
in the sparse version.
So I updated the PR locally but... I feel the real problem is inside of SparseArray. If you init a SparseArray with False and pass in an index it will assume the dtype is float64 which coerces False to 0.0. Why does it assume the dtype is float64? Shouldn't it infer it based on infer_dtype_from or something similar? Thanks! |
Can you give an example? |
Sure @TomAugspurger In [4]: pd.SparseArray(False, index=[1], fill_value=False)
Out[4]:
[0.0]
Fill: False
IntIndex
Indices: array([], dtype=int32) The line that coerces it is here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/sparse/array.py#L198 |
Versus In [5]: pd.SparseArray(False, fill_value=False)
Out[5]:
[False]
Fill: False
IntIndex
Indices: array([], dtype=int32) |
Hello @hexgnu! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on February 12, 2018 at 11:40 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #19178 +/- ##
==========================================
+ Coverage 91.58% 91.59% +<.01%
==========================================
Files 150 150
Lines 48807 48806 -1
==========================================
+ Hits 44702 44703 +1
+ Misses 4105 4103 -2
Continue to review full report at Codecov.
|
Alright I think I have found the issue. I added some tests and it seems to chooch. Please review and give me any feedback. Thanks @TomAugspurger and @jreback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking this down @hexgnu, just a couple clarifying questions.
And could you add a simple test that hits this directly? Something that constructs a arr = SparseArray(value, index=[0, 1])
and assert that the arr.dtype
matches the correct dtype for various value
.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -491,7 +491,7 @@ Groupby/Resample/Rolling | |||
Sparse | |||
^^^^^^ | |||
|
|||
- | |||
- Bug in :class:`SparseArray` where if a scalar and index are passed in it will coerce to float64 regardless of scalar's dtype. (:issue:`19163`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what "passed in" refers to here? Is it specifically .assign
? Or any method setting / updating the sparse array?
pandas/core/sparse/array.py
Outdated
@@ -195,7 +195,7 @@ def __new__(cls, data, sparse_index=None, index=None, kind='integer', | |||
data = np.nan | |||
if not is_scalar(data): | |||
raise Exception("must only pass scalars with an index ") | |||
values = np.empty(len(index), dtype='float64') | |||
values = np.empty(len(index), dtype=infer_dtype_from(data)[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is infer_dtype_from_scalar
more appropriate here, since we've validated that data
is a scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea that's a good point. Since infer_dtype_from just checks is_scalar again.
@TomAugspurger I did add a test in pandas/tests/sparse/test_array.py that tests this directly instead of indirectly. |
@jreback @TomAugspurger ping. This is now green after fixing a linting error. Let me know if you would like me to squash any of these commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'll merge later today, in case @jreback has any comments.
pandas/core/sparse/array.py
Outdated
@@ -161,7 +161,8 @@ def __new__(cls, data, sparse_index=None, index=None, kind='integer', | |||
data = np.nan | |||
if not is_scalar(data): | |||
raise Exception("must only pass scalars with an index ") | |||
values = np.empty(len(index), dtype='float64') | |||
values = np.empty(len(index), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a routine for this
construct_1d_from_scalar in pandas.core.dtypes.cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the empty/fill step
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -555,6 +555,7 @@ Sparse | |||
|
|||
- Bug in which creating a ``SparseDataFrame`` from a dense ``Series`` or an unsupported type raised an uncontrolled exception (:issue:`19374`) | |||
- Bug in :class:`SparseDataFrame.to_csv` causing exception (:issue:`19384`) | |||
- Bug in constructing a :class:`SparseArray`: if `data` is a scalar and `index` is defined it will coerce to float64 regardless of scalar's dtype. (:issue:`19163`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, just noticed we don't have SparseArray
in our API docs so this link won't work. Just
``SparseArray``
until we get that sorted out. Also double ticks around data nd index.
thanks @hexgnu |
great job on sparse fixes! keep em coming! |
The problem here is that a SparseDataFrame that calls assign should cast
to a DataFrame mainly because SparseDataFrames are a special case.
git diff upstream/master -u -- "*.py" | flake8 --diff