Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Upgrade pandas to 1.2 #7375

Merged
merged 46 commits into from
Feb 26, 2021
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
155f10f
fix issues with updating to latest pandas
galipremsagar Feb 12, 2021
0ec247e
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 12, 2021
454ecf5
remove xfails and fix issues
galipremsagar Feb 12, 2021
a1a928d
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 18, 2021
303c77d
fix isin and misc tests
galipremsagar Feb 22, 2021
18d1fb3
remove redundant code
galipremsagar Feb 22, 2021
b727253
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 22, 2021
01afece
fix more issues
galipremsagar Feb 22, 2021
691d154
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 23, 2021
c7c47b5
fix lots of deprecated warnings
galipremsagar Feb 23, 2021
d106b79
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 23, 2021
aea3313
fix multiple warnings
galipremsagar Feb 23, 2021
9fdbfe7
unpin pandas
galipremsagar Feb 23, 2021
27a782b
cleanup
galipremsagar Feb 23, 2021
3cde2ef
cleanup
galipremsagar Feb 23, 2021
9a3b51a
copyright
galipremsagar Feb 23, 2021
2f8fe18
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 23, 2021
7a534b0
pin pandas upper bound version
galipremsagar Feb 24, 2021
81d9b5d
use only minor version
galipremsagar Feb 24, 2021
14e8c0e
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 24, 2021
c5b83a2
use functools for finding union
galipremsagar Feb 24, 2021
5e6855d
add utility for creating a pandas series and refactor imports in test…
galipremsagar Feb 24, 2021
ea61733
remove is_scalar check
galipremsagar Feb 24, 2021
d8ca966
version all pytest xfails
galipremsagar Feb 24, 2021
8d079f0
add check_order flag
galipremsagar Feb 24, 2021
d8ff534
remove version for cudf apis
galipremsagar Feb 24, 2021
a0637b9
make importing cudf uniform in pytests
galipremsagar Feb 24, 2021
b63ae03
refactor imports to be uniform and less confusing
galipremsagar Feb 24, 2021
c3c3e68
remove versioning of cudf api call
galipremsagar Feb 24, 2021
992b483
Update python/cudf/cudf/tests/test_setitem.py
galipremsagar Feb 24, 2021
355e192
remove double validation
galipremsagar Feb 24, 2021
3942cf1
Merge branch '7367' of https://github.com/galipremsagar/cudf into 7367
galipremsagar Feb 24, 2021
8d06667
move datetime / duration isin logic to a common utility
galipremsagar Feb 24, 2021
032378d
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 25, 2021
dd842f3
add atol
galipremsagar Feb 25, 2021
9fe44cd
rename internal api
galipremsagar Feb 25, 2021
da1a3a3
fix categorical setitem and allow np.nan into categories
galipremsagar Feb 26, 2021
e70686f
add nan setitem test
galipremsagar Feb 26, 2021
39ba07a
make null checks and to_pandas code flow more effecient
galipremsagar Feb 26, 2021
2cc496d
fix repr
galipremsagar Feb 26, 2021
0bd3bba
fix typo
galipremsagar Feb 26, 2021
3d44f5f
fix typo
galipremsagar Feb 26, 2021
c1c2d96
update index code
galipremsagar Feb 26, 2021
19ae2f6
Merge remote-tracking branch 'upstream/branch-0.19' into 7367
galipremsagar Feb 26, 2021
ae1b8c6
add packaging conda install
galipremsagar Feb 26, 2021
416bc92
Merge branch 'branch-0.19' into 7367
galipremsagar Feb 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion conda/environments/cudf_dev_cuda10.1.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ dependencies:
- python>=3.6,<3.8
- numba>=0.49.0,!=0.51.0
- numpy
- pandas>=1.0,<1.2.0dev0
- pandas>=1.0,<1.3.0dev0
- pyarrow=1.0.1
- fastavro>=0.22.9
- notebook>=0.5.0
Expand Down
2 changes: 1 addition & 1 deletion conda/environments/cudf_dev_cuda10.2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ dependencies:
- python>=3.6,<3.8
- numba>=0.49,!=0.51.0
- numpy
- pandas>=1.0,<1.2.0dev0
- pandas>=1.0,<1.3.0dev0
- pyarrow=1.0.1
- fastavro>=0.22.9
- notebook>=0.5.0
Expand Down
2 changes: 1 addition & 1 deletion conda/environments/cudf_dev_cuda11.0.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ dependencies:
- python>=3.6,<3.8
- numba>=0.49,!=0.51.0
- numpy
- pandas>=1.0,<1.2.0dev0
- pandas>=1.0,<1.3.0dev0
- pyarrow=1.0.1
- fastavro>=0.22.9
- notebook>=0.5.0
Expand Down
4 changes: 2 additions & 2 deletions conda/recipes/cudf/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2018, NVIDIA CORPORATION.
# Copyright (c) 2018-2021, NVIDIA CORPORATION.

{% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %}
{% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %}
Expand Down Expand Up @@ -35,7 +35,7 @@ requirements:
- protobuf
- python
- typing_extensions
- pandas >=1.0,<1.2.0dev0
- pandas >=1.0,<1.3.0dev0
- cupy >7.1.0,<9.0.0a0
- numba >=0.49.0
- numpy
Expand Down
3 changes: 2 additions & 1 deletion python/cudf/cudf/core/_compat.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Copyright (c) 2020, NVIDIA CORPORATION.
# Copyright (c) 2020-2021, NVIDIA CORPORATION.

import pandas as pd
from packaging import version

PANDAS_VERSION = version.parse(pd.__version__)
PANDAS_GE_100 = PANDAS_VERSION >= version.parse("1.0")
PANDAS_GE_110 = PANDAS_VERSION >= version.parse("1.1")
PANDAS_GE_120 = PANDAS_VERSION >= version.parse("1.2")
87 changes: 83 additions & 4 deletions python/cudf/cudf/core/column/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
Dict,
Mapping,
Optional,
Sequence,
Tuple,
Union,
cast,
Expand Down Expand Up @@ -867,6 +868,15 @@ def set_base_data(self, value):
else:
super().set_base_data(value)

def _process_values_for_isin(
self, values: Sequence
) -> Tuple[ColumnBase, ColumnBase]:
lhs = self
# We need to convert values to same type as self,
# hence passing dtype=self.dtype
rhs = cudf.core.column.as_column(values, dtype=self.dtype)
return lhs, rhs

def set_base_mask(self, value: Optional[Buffer]):
super().set_base_mask(value)
self._codes = None
Expand Down Expand Up @@ -936,6 +946,21 @@ def unary_operator(self, unaryop: str):
)

def __setitem__(self, key, value):
if cudf.utils.dtypes.is_scalar(
value
) and cudf._lib.scalar._is_null_host_scalar(value):
to_add_categories = 0
else:
to_add_categories = len(
cudf.Index(value).difference(self.categories)
)

if to_add_categories > 0:
raise ValueError(
"Cannot setitem on a Categorical with a new "
"category, set the categories first"
)

if cudf.utils.dtypes.is_scalar(value):
value = self._encode(value) if value is not None else value
else:
Expand Down Expand Up @@ -1046,11 +1071,27 @@ def __cuda_array_interface__(self) -> Mapping[str, Any]:
def to_pandas(
self, index: ColumnLike = None, nullable: bool = False, **kwargs
) -> pd.Series:
signed_dtype = min_signed_type(len(self.categories))
codes = self.cat().codes.astype(signed_dtype).fillna(-1).to_array()
categories = self.categories.to_pandas()

if self.categories.dtype.kind == "f":
new_mask = bools_to_mask(self.notnull())
col = column.build_categorical_column(
categories=self.dtype.categories._values,
codes=column.as_column(
self.codes.base_data, dtype=self.codes.dtype
),
mask=new_mask,
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
ordered=self.dtype.ordered,
offset=self.offset,
size=self.size,
)
else:
col = self

signed_dtype = min_signed_type(len(col.categories))
codes = col.cat().codes.astype(signed_dtype).fillna(-1).to_array()
categories = col.categories.dropna(drop_nan=True).to_pandas()
data = pd.Categorical.from_codes(
codes, categories=categories, ordered=self.ordered
codes, categories=categories, ordered=col.ordered
)
return pd.Series(data, index=index)

Expand Down Expand Up @@ -1180,6 +1221,38 @@ def find_and_replace(
ordered=self.dtype.ordered,
)

def isnull(self) -> ColumnBase:
"""
Identify missing values in a CategoricalColumn.
"""
result = libcudf.unary.is_null(self)

if self.categories.dtype.kind == "f":
# Need to consider `np.nan` values incase
# of an underlying float column
categories = libcudf.unary.is_nan(self.categories)
if categories.any():
code = self._encode(np.nan)
result = result | (self.codes == cudf.Scalar(code))

return result

def notnull(self) -> ColumnBase:
"""
Identify non-missing values in a CategoricalColumn.
"""
result = libcudf.unary.is_valid(self)

if self.categories.dtype.kind == "f":
# Need to consider `np.nan` values incase
# of an underlying float column
categories = libcudf.unary.is_nan(self.categories)
if categories.any():
code = self._encode(np.nan)
result = result & (self.codes != cudf.Scalar(code))

return result

def fillna(
self, fill_value: Any = None, method: Any = None, dtype: Dtype = None
) -> CategoricalColumn:
Expand All @@ -1204,6 +1277,12 @@ def fillna(
raise ValueError(err_msg) from err
else:
fill_value = column.as_column(fill_value, nan_as_null=False)
if isinstance(fill_value, CategoricalColumn):
if self.dtype != fill_value.dtype:
raise ValueError(
"Cannot set a Categorical with another, "
"without identical categories"
)
# TODO: only required if fill_value has a subset of the
# categories:
fill_value = fill_value.cat()._set_categories(
Expand Down
95 changes: 54 additions & 41 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (c) 2018-2021, NVIDIA CORPORATION.

from __future__ import annotations

import builtins
Expand Down Expand Up @@ -49,12 +50,12 @@
get_time_unit,
is_categorical_dtype,
is_decimal_dtype,
is_interval_dtype,
is_list_dtype,
is_numerical_dtype,
is_scalar,
is_string_dtype,
is_struct_dtype,
is_interval_dtype,
min_signed_type,
min_unsigned_type,
np_to_pa_dtype,
Expand Down Expand Up @@ -848,55 +849,65 @@ def isin(self, values: Sequence) -> ColumnBase:
-------
result: Column
Column of booleans indicating if each element is in values.
Raises
-------
TypeError
If values is a string
"""
if is_scalar(values):
raise TypeError(
"only list-like objects are allowed to be passed "
f"to isin(), you passed a [{type(values).__name__}]"
)

lhs = self
rhs = None

try:
# We need to convert values to same type as self,
# hence passing dtype=self.dtype
rhs = as_column(values, dtype=self.dtype)

# Short-circuit if rhs is all null.
if lhs.null_count == 0 and (rhs.null_count == len(rhs)):
return full(len(self), False, dtype="bool")
lhs, rhs = self._process_values_for_isin(values)
res = lhs._isin_earlystop(rhs)
if res is not None:
return res
except ValueError:
# pandas functionally returns all False when cleansing via
# typecasting fails
return full(len(self), False, dtype="bool")

# If categorical, combine categories first
if is_categorical_dtype(lhs):
lhs_cats = lhs.cat().categories._values
rhs_cats = rhs.cat().categories._values

if not np.issubdtype(rhs_cats.dtype, lhs_cats.dtype):
# If they're not the same dtype, short-circuit if the values
# list doesn't have any nulls. If it does have nulls, make
# the values list a Categorical with a single null
if not rhs.has_nulls:
return full(len(self), False, dtype="bool")
rhs = as_column(pd.Categorical.from_codes([-1], categories=[]))
rhs = rhs.cat().set_categories(lhs_cats).astype(self.dtype)

ldf = cudf.DataFrame({"x": lhs, "orig_order": arange(len(lhs))})
res = lhs._obtain_isin_result(rhs)

return res

def _process_values_for_isin(
self, values: Sequence
) -> Tuple[ColumnBase, ColumnBase]:
"""
Helper function for `isin` which pre-process `values` based on `self`.
"""
lhs = self
rhs = as_column(values, nan_as_null=False)
if lhs.null_count == len(lhs):
lhs = lhs.astype(rhs.dtype)
elif rhs.null_count == len(rhs):
rhs = rhs.astype(lhs.dtype)
return lhs, rhs

def _isin_earlystop(self, rhs: ColumnBase) -> Union[ColumnBase, None]:
"""
Helper function for `isin` which determines possibility of
early-stopping or not.
"""
if self.dtype != rhs.dtype:
kkraus14 marked this conversation as resolved.
Show resolved Hide resolved
if self.null_count and rhs.null_count:
return self.isna()
kkraus14 marked this conversation as resolved.
Show resolved Hide resolved
else:
return cudf.core.column.full(len(self), False, dtype="bool")
elif self.null_count == 0 and (rhs.null_count == len(rhs)):
return cudf.core.column.full(len(self), False, dtype="bool")
else:
return None

def _obtain_isin_result(self, rhs: ColumnBase) -> ColumnBase:
"""
Helper function for `isin` which merges `self` & `rhs`
to determine what values of `rhs` exist in `self`.
"""
ldf = cudf.DataFrame({"x": self, "orig_order": arange(len(self))})
rdf = cudf.DataFrame(
{"x": rhs, "bool": full(len(rhs), True, dtype="bool")}
)
res = ldf.merge(rdf, on="x", how="left").sort_values(by="orig_order")
res = res.drop_duplicates(subset="orig_order", ignore_index=True)
res = res._data["bool"].fillna(False)

return res

def as_mask(self) -> Buffer:
Expand Down Expand Up @@ -1052,14 +1063,14 @@ def as_categorical_column(self, dtype, **kwargs) -> ColumnBase:

# columns include null index in factorization; remove:
if self.has_nulls:
cats = cats.dropna()
cats = cats._column.dropna(drop_nan=False)
min_type = min_unsigned_type(len(cats), 8)
labels = labels - 1
if np.dtype(min_type).itemsize < labels.dtype.itemsize:
labels = labels.astype(min_type)

return build_categorical_column(
categories=cats._column,
categories=cats,
codes=labels._column,
mask=self.mask,
ordered=ordered,
Expand Down Expand Up @@ -1250,7 +1261,7 @@ def sum(
def product(
self, skipna: bool = None, dtype: Dtype = None, min_count: int = 0
):
raise TypeError(f"cannot perform prod with type {self.dtype}")
raise TypeError(f"cannot perform product with type {self.dtype}")

def mean(self, skipna: bool = None, dtype: Dtype = None):
raise TypeError(f"cannot perform mean with type {self.dtype}")
Expand All @@ -1262,7 +1273,7 @@ def var(self, skipna: bool = None, ddof=1, dtype: Dtype = np.float64):
raise TypeError(f"cannot perform var with type {self.dtype}")

def kurtosis(self, skipna: bool = None):
raise TypeError(f"cannot perform kurt with type {self.dtype}")
raise TypeError(f"cannot perform kurtosis with type {self.dtype}")

def skew(self, skipna: bool = None):
raise TypeError(f"cannot perform skew with type {self.dtype}")
Expand Down Expand Up @@ -2066,9 +2077,11 @@ def _construct_array(
arbitrary = cupy.asarray(arbitrary, dtype=dtype)
except (TypeError, ValueError):
native_dtype = dtype
if dtype is None and pd.api.types.infer_dtype(arbitrary) in (
"mixed",
"mixed-integer",
if (
dtype is None
and not cudf._lib.scalar._is_null_host_scalar(arbitrary)
and pd.api.types.infer_dtype(arbitrary)
in ("mixed", "mixed-integer",)
):
native_dtype = "object"
arbitrary = np.asarray(
Expand Down
Loading