Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Review] Correct the sampling range when sampling with replacement #6884

Merged
merged 10 commits into from
Dec 17, 2020

Conversation

ChrisJar
Copy link
Contributor

@ChrisJar ChrisJar commented Dec 2, 2020

This corrects an issue with the sampling range used when replacement=True. Before, it sampled the range 0 through num_rows meaning it could sample num_rows even though it's one position out of bounds. This caused sample to return values not present in the original DataFrame.

I also created exceptions for sampling on empty DataFrames that match pandas, as well as an exception for sampling when axis=1 and replace=True as cudf does not support DataFrames with duplicate columns.

This closes #6532 and #6882

@ChrisJar ChrisJar requested review from a team as code owners December 2, 2020 21:38
@ChrisJar ChrisJar closed this Dec 2, 2020
@ChrisJar ChrisJar changed the title [WIPCorrect the sampling range when sampling with replacement [WIP] Correct the sampling range when sampling with replacement Dec 2, 2020
@ChrisJar ChrisJar reopened this Dec 2, 2020
@codecov
Copy link

codecov bot commented Dec 7, 2020

Codecov Report

Merging #6884 (d5212f5) into branch-0.18 (6d1b076) will increase coverage by 1.10%.
The diff coverage is 62.50%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #6884      +/-   ##
===============================================
+ Coverage        82.01%   83.12%   +1.10%     
===============================================
  Files               96       96              
  Lines            16340    17873    +1533     
===============================================
+ Hits             13402    14857    +1455     
- Misses            2938     3016      +78     
Impacted Files Coverage Δ
python/cudf/cudf/core/frame.py 89.08% <62.50%> (-1.27%) ⬇️
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 93.62% <0.00%> (+0.29%) ⬆️
python/dask_cudf/dask_cudf/core.py 74.01% <0.00%> (+0.32%) ⬆️
python/cudf/cudf/core/indexing.py 96.35% <0.00%> (+0.72%) ⬆️
python/cudf/cudf/core/series.py 91.98% <0.00%> (+0.88%) ⬆️
python/cudf/cudf/core/column/string.py 87.48% <0.00%> (+0.89%) ⬆️
python/cudf/cudf/core/column/timedelta.py 90.97% <0.00%> (+1.43%) ⬆️
python/dask_cudf/dask_cudf/io/csv.py 96.96% <0.00%> (+1.44%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d1b076...d5212f5. Read the comment docs.

@ChrisJar ChrisJar changed the title [WIP] Correct the sampling range when sampling with replacement [Review] Correct the sampling range when sampling with replacement Dec 7, 2020
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ fix is fine. Just need clarification on the Python exception messages.

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
@harrism harrism added 3 - Ready for Review Ready for review by team bug Something isn't working Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Dec 8, 2020
@harrism harrism removed the 3 - Ready for Review Ready for review by team label Dec 9, 2020
@kkraus14
Copy link
Collaborator

rerun tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] DataFrame.sample() produces unexpected results
3 participants