Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a basic support for the categorical type. #2064

Merged
merged 23 commits into from
Mar 13, 2021

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Feb 25, 2021

Experimental.

Add a basic support for the categorical type.

>>> s = ks.Series(list("abbccc"), dtype="category")
>>> s
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> s.cat.codes
0    0
1    1
2    1
3    2
4    2
5    2
dtype: int8

>>> idx = ks.CategoricalIndex(list("abbccc"))
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'],
                 categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.codes
Int64Index([0, 1, 1, 2, 2, 2], dtype='int64')
>>> idx.categories
Index(['a', 'b', 'c'], dtype='object')

Currently type conversion with astype from/to the categorical type is supported, and other operations are not supported yet.

@ueshin
Copy link
Collaborator Author

ueshin commented Feb 25, 2021

This is based on #2061.

@codecov-io
Copy link

codecov-io commented Feb 25, 2021

Codecov Report

Merging #2064 (acc140d) into master (557fb77) will decrease coverage by 0.05%.
The diff coverage is 91.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2064      +/-   ##
==========================================
- Coverage   95.27%   95.21%   -0.06%     
==========================================
  Files          57       60       +3     
  Lines       13257    13460     +203     
==========================================
+ Hits        12630    12816     +186     
- Misses        627      644      +17     
Impacted Files Coverage Δ
databricks/koalas/categorical.py 71.05% <71.05%> (ø)
databricks/koalas/base.py 96.89% <90.62%> (-0.43%) ⬇️
databricks/koalas/indexes/category.py 91.66% <91.66%> (ø)
databricks/koalas/__init__.py 92.10% <100.00%> (+0.10%) ⬆️
databricks/koalas/indexes/base.py 97.31% <100.00%> (+0.01%) ⬆️
databricks/koalas/internal.py 96.67% <100.00%> (+0.06%) ⬆️
databricks/koalas/missing/indexes.py 100.00% <100.00%> (ø)
databricks/koalas/series.py 96.86% <100.00%> (+<0.01%) ⬆️
databricks/koalas/tests/indexes/test_base.py 100.00% <100.00%> (ø)
databricks/koalas/tests/indexes/test_category.py 100.00% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 557fb77...acc140d. Read the comment docs.

@ueshin ueshin marked this pull request as ready for review March 12, 2021 03:12
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine from a cursory look

if len(categories) == 0:
scol = F.lit(-1)
else:
kvs = list(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could think about having a util for this codes ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm going to do some refactoring in the next PR.

@ueshin
Copy link
Collaborator Author

ueshin commented Mar 13, 2021

Thanks! Let me merge this now. Please feel free to leave comments.

@ueshin ueshin merged commit 2fe8796 into databricks:master Mar 13, 2021
@ueshin ueshin deleted the categorical branch March 13, 2021 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants