Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle TimestampType separately when convert to pandas' dtype. #798

Merged
merged 2 commits into from
Sep 19, 2019

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Sep 19, 2019

When the initialization of pandas in pyarrow is not done yet, it can't convert pa.TimestampType to pandas' dtype. In that case, the following example raises an error:

from datetime import datetime
import databricks.koalas as ks

kdf = ks.DataFrame({'t': [datetime(2019, 1, 1, 0, 0, 0), datetime(2019, 1, 2, 0, 0, 0), datetime(2019, 1, 3, 0, 0, 0)]})
kdf[kdf['t'] != kdf['t']]
>>> kdf[kdf['t'] != kdf['t']]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/frame.py", line 6646, in __repr__
    pdf = self.head(max_display_count + 1)._to_internal_pandas()
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/frame.py", line 6639, in _to_internal_pandas
    return self._internal.pandas_df
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/utils.py", line 338, in _lazy_property
    setattr(self, attr_name, fn(self))
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/internal.py", line 638, in pandas_df
    for field in sdf.schema})
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/internal.py", line 638, in <dictcomp>
    for field in sdf.schema})
  File "pyarrow/types.pxi", line 404, in pyarrow.lib.TimestampType.to_pandas_dtype
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 625, in make_datetimetz
    return _pandas_api.datetimetz_type('ns', tz=tz)
TypeError: 'NoneType' object is not callable

We know the dtype should be np.dtype('datetime64[ns]'), so we don't need to rely on pyarrow's implementation.

>>> kdf[kdf['t'] != kdf['t']]
Empty DataFrame
Columns: [t]
Index: []

Resolves #772.

@ueshin
Copy link
Collaborator Author

ueshin commented Sep 19, 2019

FYI: Seems like here is a bug:

https://github.com/apache/arrow/blob/master/python/pyarrow/pandas-shim.pxi#L150-L151

Maybe we need to add self._check_import().

@ueshin
Copy link
Collaborator Author

ueshin commented Sep 19, 2019

also cc @harupy and @itholic

@softagram-bot
Copy link

Softagram Impact Report for pull/798 (head commit: 39117cf)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

datetime(2019, 1, 3, 0, 0, 0)]})
kdf = ks.from_pandas(pdf)
self.assert_eq(kdf[kdf['t'] != kdf['t']], pdf[pdf['t'] != pdf['t']])
self.assert_eq(kdf[kdf['t'] != kdf['t']].dtypes, pdf[pdf['t'] != pdf['t']].dtypes)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this test doesn't reproduce the error in the description since pyarrow is properly initialized while running other tests.

@codecov-io
Copy link

codecov-io commented Sep 19, 2019

Codecov Report

Merging #798 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #798      +/-   ##
==========================================
+ Coverage   94.28%   94.28%   +<.01%     
==========================================
  Files          32       32              
  Lines        5773     5775       +2     
==========================================
+ Hits         5443     5445       +2     
  Misses        330      330
Impacted Files Coverage Δ
databricks/koalas/typedef.py 83.12% <100%> (+0.43%) ⬆️
databricks/koalas/base.py 94.8% <100%> (-0.07%) ⬇️
databricks/koalas/internal.py 95.85% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52685f7...39117cf. Read the comment docs.

@HyukjinKwon HyukjinKwon merged commit 1f552bf into databricks:master Sep 19, 2019
@ueshin ueshin deleted the timestamptype branch September 19, 2019 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filtering rows throws error when result is empty
4 participants