Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear Interpolation of nans via cupy #8767

Merged
merged 27 commits into from
Aug 10, 2021

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Jul 19, 2021

Adds Series and DataFrame level functions for linear interpolation of missing values, built around CuPy's interp method.

Pandas interpolate API supports somewhat varied functionality for filling NaNs. It currently does not work for actual <NA> values - pandas issue here.. That said one might expect both kinds of missing data to be treated equally for the purposes of interpolation, and this PR does that.

While cp.interp is great for getting us off the ground, but only supports linear interpolation and its results aren't exactly what pandas produces. In particular pandas will not fill NaNs at the start of the series, because the default value of limit_direction is forward and the default limit is None which from my experimentation means 'unlimited'. This means that that despite this, the NaNs at the end WILL get filled. This means we need to actually figure out where the first NaN is and mask out that part of the series with NaNs.

Closes #8685.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 19, 2021
@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change labels Jul 22, 2021
@brandon-b-miller brandon-b-miller marked this pull request as ready for review July 23, 2021 17:02
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner July 23, 2021 17:02
method in {"index", "values"}
and not self.index.is_monotonic_increasing
):
warnings.warn("Unsorted Index...")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this? What should we put here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual, pandas seems OK with some pretty nonsensical cases:

In [83]: pd.Series([2, None, 4, None, 2], index=[1, 2, 3, 2, 1]).interpolate('values')
Out[83]:
1    2.0
2    3.0
3    4.0
2    3.0
1    2.0
dtype: float64

@codecov
Copy link

codecov bot commented Jul 23, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@29b5f9a). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head d293c43 differs from pull request most recent head 94cc6da. Consider uploading reports for the commit 94cc6da to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #8767   +/-   ##
===============================================
  Coverage                ?   10.57%           
===============================================
  Files                   ?      116           
  Lines                   ?    19050           
  Branches                ?        0           
===============================================
  Hits                    ?     2015           
  Misses                  ?    17035           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29b5f9a...94cc6da. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't 100% sure whether you were looking for a review yet or not since this is marked WIP but you posted questions, so I just went through and left some (hopefully helpful) comments.

python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
@brandon-b-miller brandon-b-miller changed the title [WIP] Linear Interpolation of nans via cupy Linear Interpolation of nans via cupy Jul 28, 2021
@brandon-b-miller brandon-b-miller added the 3 - Ready for Review Ready for review by team label Jul 28, 2021
@brandon-b-miller
Copy link
Contributor Author

@vyasr @shwina this look good to go?

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I started this review this morning then forgot to submit. A few more changes to be made, but it's nearly there.

Comment on lines 1510 to 1514
if not isinstance(data._index, cudf.RangeIndex):
# that which was once sorted, now is not
result = result._gather(perm_sort.argsort())

return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not isinstance(data._index, cudf.RangeIndex):
# that which was once sorted, now is not
result = result._gather(perm_sort.argsort())
return result
return result if isinstance(data._index, cudf.RangeIndex) else result._gather(perm_sort.argsort())

I assume this will need a line break somewhere to make black happy.

result = interpolator(col, index=data._index)
columns[colname] = result

result = self.__class__(ColumnAccessor(columns), index=data._index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use _from_data instead.

col = col.astype("float64").fillna(np.nan)

# Interpolation methods may or may not need the index
result = interpolator(col, index=data._index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = interpolator(col, index=data._index)
columns[colname] = interpolator(col, index=data._index)

Github won't let me highlight two lines here because there's a previous comment I guess, but you'll also need to delete the subsequent line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is left over from a lot of pdb-ing around :(

)

data = self
columns = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this to just below the interpolator =... line so it's closer to where it's used.

@pytest.mark.parametrize("method", ["linear"])
@pytest.mark.parametrize("axis", [0])
def test_interpolate_dataframe(data, method, axis):
# doesn't seem to work with NAs just yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an issue still?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this with a more descriptive comment, nullable dtypes don't interpolate in pandas yet as there are some bugs it seems, our impl treats nulls and nans the same.

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
@brandon-b-miller
Copy link
Contributor Author

anything left need updating here @vyasr @shwina ?

@vyasr vyasr self-requested a review August 6, 2021 22:01
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my end.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good and neat! Great work, @brandon-b-miller!

@brandon-b-miller
Copy link
Contributor Author

rerun tests

@brandon-b-miller brandon-b-miller added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 10, 2021
@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b1c2dd4 into rapidsai:branch-21.10 Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Linear interpolation of missing values
3 participants