Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: concat on sparse values #25719

Merged
merged 1 commit into from
Mar 19, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,42 @@ is respected in indexing. (:issue:`24076`, :issue:`16785`)
df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']

Concatenating Sparse Values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen you make this review comment before, but IMO, there is not really much added value in asking for this. It doesn't change anything in the built documentation, and is only needed if you make a link to it from somewhere else in our docs (and in this case, I think it is rather unlikely that we will ever link to it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disagree. as a matter of course if we have subsections we should have links. its not much of a burden and promotes consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that you are asking something (and we are often already asking a lot of contributors in many rounds) that is basically useless, except for the consistency argument that you give (IMO that's not worth the overhead of always adding it, but OK, we can disagree on that).

as a matter of course if we have subsections we should have links

Note that this does not introduce links. Things like anchors so that people can link to that section on the html docs, that is done automatically by sphinx, you don't need this label for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again disagree. consistency is SO important in the pandas docs / code. If it can be automated (either the actual code / docs or the check of it great), otherwise the purpose is NOT to special case the world. Again having to make that judgement call on each PR would make a big inconsistency across PR's & maintainers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having to make that judgement call on each PR would make a big inconsistency across PR's & maintainers.

The thing is that you don't need to make a judgement call here: it simply is not needed for new sections being added. Only when you ask to add somewhere an internal link, you need to add a label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are still missing the point

pls add the ref here - arguing about reviews is a waste of time

^^^^^^^^^^^^^^^^^^^^^^^^^^^

When passed DataFrames whose values are sparse, :func:`concat` will now return a
Series or DataFrame with sparse values, rather than a ``SparseDataFrame`` (:issue:`25702`).

.. ipython:: python

df = pd.DataFrame({"A": pd.SparseArray([0, 1])})

*Previous Behavior:*

.. code-block:: ipython

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

*New Behavior:*

.. ipython:: python

type(pd.concat([df, df]))


This now matches the existing behavior of :class:`concat` on ``Series`` with sparse values.
:func:`concat` will continue to return a ``SparseDataFrame`` when all the values
are instances of ``SparseDataFrame``.

This change also affects routines using :func:`concat` internally, like :func:`get_dummies`,
which now returns a :class:`DataFrame` in all cases (previously a ``SparseDataFrame`` was
returned if all the columns were dummy encoded, and a :class:`DataFrame` otherwise).

Providing any ``SparseSeries`` or ``SparseDataFrame`` to :func:`concat` will
cause a ``SparseSeries`` or ``SparseDataFrame`` to be returned, as before.


.. _whatsnew_0250.api_breaking.deps:

Increased minimum versions for dependencies
Expand Down
3 changes: 1 addition & 2 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@ def _get_frame_result_type(result, objs):
"""

if (result.blocks and (
all(is_sparse(b) for b in result.blocks) or
all(isinstance(obj, ABCSparseDataFrame) for obj in objs))):
any(isinstance(obj, ABCSparseDataFrame) for obj in objs))):
from pandas.core.sparse.api import SparseDataFrame
return SparseDataFrame
else:
Expand Down
8 changes: 8 additions & 0 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
import pandas.core.indexes.base as ibase
from pandas.core.internals import BlockManager, make_block
from pandas.core.series import Series
from pandas.core.sparse.frame import SparseDataFrame

from pandas.plotting._core import boxplot_frame_groupby

Expand Down Expand Up @@ -198,9 +199,16 @@ def aggregate(self, arg, *args, **kwargs):
assert not args and not kwargs
result = self._aggregate_multiple_funcs(
[arg], _level=_level, _axis=self.axis)

result.columns = Index(
result.columns.levels[0],
name=self._selected_obj.columns.name)

if isinstance(self.obj, SparseDataFrame):
# Backwards compat for groupby.agg() with sparse
# values. concat no longer converts DataFrame[Sparse]
# to SparseDataFrame, so we do it here.
result = SparseDataFrame(result._data)
except Exception:
result = self._aggregate_generic(arg, *args, **kwargs)

Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/reshape/test_reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -577,6 +577,16 @@ def test_get_dummies_duplicate_columns(self, df):

tm.assert_frame_equal(result, expected)

def test_get_dummies_all_sparse(self):
df = pd.DataFrame({"A": [1, 2]})
result = pd.get_dummies(df, columns=['A'], sparse=True)
dtype = SparseDtype('uint8', 0)
expected = pd.DataFrame({
'A_1': SparseArray([1, 0], dtype=dtype),
'A_2': SparseArray([0, 1], dtype=dtype),
})
tm.assert_frame_equal(result, expected)


class TestCategoricalReshape(object):

Expand Down