Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a config to perform natural ordering when head. #1231

Merged
merged 2 commits into from
Jan 28, 2020

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Jan 27, 2020

Koalas doesn't guarantee the row ordering, so head could return some rows from distributed partitions and the result is not deterministic, which might confuse users. (See #411, #1064, or #1219)

>>> import databricks.koalas as ks
>>> spark = ks.utils.default_session({'spark.master': 'local-cluster[4, 2, 1024]'})
>>> spark.conf.set('spark.sql.execution.arrow.enabled', True)
>>> kdf = ks.DataFrame({'a': range(10)})
>>> pdf = kdf.to_pandas().head(3)
>>> for _ in range(100):
...     assert kdf.head(3).to_pandas().equals(pdf), 'NOT equals.'
... else:
...     print('equals.')
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
AssertionError: NOT equals.

If compute.ordered_head is set to True, Koalas performs natural ordering beforehand, but it will cause a performance overhead.

>>> ks.options.compute.ordered_head = True
>>>
>>> for _ in range(100):
...     assert kdf.head(3).to_pandas().equals(pdf), 'NOT equals.'
... else:
...     print('equals.')
...
equals.

I don't think we should enable it by default due to the overhead, but it will help users debug.

@codecov-io
Copy link

codecov-io commented Jan 27, 2020

Codecov Report

Merging #1231 into master will increase coverage by 0.25%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1231      +/-   ##
==========================================
+ Coverage   95.18%   95.43%   +0.25%     
==========================================
  Files          35       35              
  Lines        7265     7740     +475     
==========================================
+ Hits         6915     7387     +472     
- Misses        350      353       +3
Impacted Files Coverage Δ
databricks/koalas/config.py 98.98% <ø> (ø) ⬆️
databricks/koalas/frame.py 97.48% <80%> (+0.51%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77da5b9...76a7f7a. Read the comment docs.

@HyukjinKwon
Copy link
Member

Yeah, I think adding this configuration wouldn't hurt.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth enough to try.

@ueshin
Copy link
Collaborator Author

ueshin commented Jan 28, 2020

Thanks! I'd merge this for now. Please feel free to leave comments if any.

@ueshin ueshin merged commit dd41b7f into databricks:master Jan 28, 2020
@ueshin ueshin deleted the head_ordering branch January 28, 2020 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants