Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Integer NA Extension Array #21160

Merged
merged 23 commits into from
Jul 20, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ v0.24.0 (Month XX, 2018)
New features
~~~~~~~~~~~~


- ``ExcelWriter`` now accepts ``mode`` as a keyword argument, enabling append to existing workbooks when using the ``openpyxl`` engine (:issue:`3441`)

.. _whatsnew_0240.enhancements.extension_array_operators:
Expand All @@ -31,6 +32,62 @@ See the :ref:`ExtensionArray Operator Support
<extending.extension.operator>` documentation section for details on both
ways of adding operator support.

.. _whatsnew_0240.enhancements.intna:

Optional Integer NA Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Pandas has gained the ability to hold integer dtypes with missing values. This long requested feature is enabled through the use of :ref:`extension types <extending.extension-types>`.
Here is an example of the usage.

We can construct a ``Series`` with the specified dtype. The dtype string ``Int64`` is a pandas ``ExtensionDtype``. Specifying a list or array using the traditional missing value
marker of ``np.nan`` will infer to integer dtype. The display of the ``Series`` will also use the ``NaN`` to indicate missing values in string outputs. (:issue:`20700`, :issue:`20747`)

.. ipython:: python

s = pd.Series([1, 2, np.nan], dtype='Int64')
s


Operations on these dtypes will propagate ``NaN`` as other pandas operations.

.. ipython:: python

# arithmetic
s + 1

# comparison
s == 1

# indexing
s.iloc[1:3]

# operate with other dtypes
s + s.iloc[1:3].astype('Int8')

# coerce when needed
s + 0.01

These dtypes can operate as part of of ``DataFrame``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of of " -> "of a"


.. ipython:: python

df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
df
df.dtypes


These dtypes can be merged & reshaped & casted.

.. ipython:: python

pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
df['A'].astype(float)

.. warning::

The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captilized -> capitalized?


.. _whatsnew_0240.enhancements.read_html:

``read_html`` Enhancements
Expand Down Expand Up @@ -256,6 +313,7 @@ Previous Behavior:
ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instantiate a registered ``DecimalDtype``; furthermore
the ``ExtensionDtype`` has gained the method ``construct_array_type`` (:issue:`21185`)
- The ``ExtensionArray`` constructor, ``_from_sequence`` now take the keyword arg ``copy=False`` (:issue:`21185`)
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
from .base import (ExtensionArray, # noqa
ExtensionOpsMixin,
ExtensionScalarOpsMixin)
from .categorical import Categorical # noqa
from .datetimes import DatetimeArrayMixin # noqa
from .interval import IntervalArray # noqa
from .period import PeriodArrayMixin # noqa
from .timedeltas import TimedeltaArrayMixin # noqa
from .integer import ( # noqa
IntegerArray, to_integer_array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the goal of exposing to_integer_array here? Is it used somewhere else?

15 changes: 9 additions & 6 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
from pandas.errors import AbstractMethodError
from pandas.compat.numpy import function as nv
from pandas.compat import set_function_name, PY3
from pandas.core.dtypes.common import is_list_like
from pandas.core import ops
from pandas.core.dtypes.common import is_list_like

_not_implemented_message = "{} does not implement {}."

Expand Down Expand Up @@ -88,16 +88,19 @@ class ExtensionArray(object):
# Constructors
# ------------------------------------------------------------------------
@classmethod
def _from_sequence(cls, scalars, copy=False):
def _from_sequence(cls, scalars, dtype=None, copy=False):
"""Construct a new ExtensionArray from a sequence of scalars.

Parameters
----------
scalars : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
dtype : dtype, optional
Construct for this particular dtype. This should be a Dtype
compatible with the ExtensionArray.
copy : boolean, default False
if True, copy the underlying data
If True, copy the underlying data.
Returns
-------
ExtensionArray
Expand Down Expand Up @@ -378,7 +381,7 @@ def fillna(self, value=None, method=None, limit=None):
func = pad_1d if method == 'pad' else backfill_1d
new_values = func(self.astype(object), limit=limit,
mask=mask)
new_values = self._from_sequence(new_values)
new_values = self._from_sequence(new_values, dtype=self.dtype)
else:
# fill with value
new_values = self.copy()
Expand Down Expand Up @@ -407,7 +410,7 @@ def unique(self):
from pandas import unique

uniques = unique(self.astype(object))
return self._from_sequence(uniques)
return self._from_sequence(uniques, dtype=self.dtype)

def _values_for_factorize(self):
# type: () -> Tuple[ndarray, Any]
Expand Down Expand Up @@ -559,7 +562,7 @@ def take(self, indices, allow_fill=False, fill_value=None):

result = take(data, indices, fill_value=fill_value,
allow_fill=allow_fill)
return self._from_sequence(result)
return self._from_sequence(result, dtype=self.dtype)
"""
# Implementer note: The `fill_value` parameter should be a user-facing
# value, an instance of self.dtype.type. When passed `fill_value=None`,
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -488,8 +488,8 @@ def _constructor(self):
return Categorical

@classmethod
def _from_sequence(cls, scalars):
return Categorical(scalars)
def _from_sequence(cls, scalars, dtype=None, copy=False):
return Categorical(scalars, dtype=dtype)

def copy(self):
""" Copy constructor. """
Expand Down
Loading