Linear Interpolation of `nan`s via `cupy` #8767

brandon-b-miller · 2021-07-19T12:50:00Z

Adds Series and DataFrame level functions for linear interpolation of missing values, built around CuPy's interp method.

Pandas interpolate API supports somewhat varied functionality for filling NaNs. It currently does not work for actual <NA> values - pandas issue here.. That said one might expect both kinds of missing data to be treated equally for the purposes of interpolation, and this PR does that.

While cp.interp is great for getting us off the ground, but only supports linear interpolation and its results aren't exactly what pandas produces. In particular pandas will not fill NaNs at the start of the series, because the default value of limit_direction is forward and the default limit is None which from my experimentation means 'unlimited'. This means that that despite this, the NaNs at the end WILL get filled. This means we need to actually figure out where the first NaN is and mask out that part of the series with NaNs.

Closes #8685.

brandon-b-miller · 2021-07-23T17:03:24Z

python/cudf/cudf/core/dataframe.py

+            method in {"index", "values"}
+            and not self.index.is_monotonic_increasing
+        ):
+            warnings.warn("Unsorted Index...")


Should we do this? What should we put here?

As usual, pandas seems OK with some pretty nonsensical cases:

In [83]: pd.Series([2, None, 4, None, 2], index=[1, 2, 3, 2, 1]).interpolate('values') Out[83]: 1 2.0 2 3.0 3 4.0 2 3.0 1 2.0 dtype: float64

python/cudf/cudf/core/algorithms.py

codecov · 2021-07-23T18:27:39Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@29b5f9a). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head d293c43 differs from pull request most recent head 94cc6da. Consider uploading reports for the commit 94cc6da to get more accurate results

@@               Coverage Diff               @@
##             branch-21.10    #8767   +/-   ##
===============================================
  Coverage                ?   10.57%           
===============================================
  Files                   ?      116           
  Lines                   ?    19050           
  Branches                ?        0           
===============================================
  Hits                    ?     2015           
  Misses                  ?    17035           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29b5f9a...94cc6da. Read the comment docs.

vyasr

I wasn't 100% sure whether you were looking for a review yet or not since this is marked WIP but you posted questions, so I just went through and left some (hopefully helpful) comments.

python/cudf/cudf/core/algorithms.py

python/cudf/cudf/core/series.py

python/cudf/cudf/core/frame.py

python/cudf/cudf/core/algorithms.py

Co-authored-by: Vyas Ramasubramani <[email protected]>

…into fea-linear-interp

brandon-b-miller · 2021-08-05T14:18:53Z

@vyasr @shwina this look good to go?

vyasr

Apologies, I started this review this morning then forgot to submit. A few more changes to be made, but it's nearly there.

vyasr · 2021-08-05T16:29:48Z

python/cudf/cudf/core/frame.py

+        if not isinstance(data._index, cudf.RangeIndex):
+            # that which was once sorted, now is not
+            result = result._gather(perm_sort.argsort())
+
+        return result


Suggested change

if not isinstance(data._index, cudf.RangeIndex):

# that which was once sorted, now is not

result = result._gather(perm_sort.argsort())

return result

return result if isinstance(data._index, cudf.RangeIndex) else result._gather(perm_sort.argsort())

I assume this will need a line break somewhere to make black happy.

vyasr · 2021-08-05T16:29:51Z

python/cudf/cudf/core/frame.py

+            result = interpolator(col, index=data._index)
+            columns[colname] = result
+
+        result = self.__class__(ColumnAccessor(columns), index=data._index)


Let's use _from_data instead.

vyasr · 2021-08-05T16:30:00Z

python/cudf/cudf/core/frame.py

+                col = col.astype("float64").fillna(np.nan)
+
+            # Interpolation methods may or may not need the index
+            result = interpolator(col, index=data._index)


Suggested change

result = interpolator(col, index=data._index)

columns[colname] = interpolator(col, index=data._index)

Github won't let me highlight two lines here because there's a previous comment I guess, but you'll also need to delete the subsequent line.

This pattern is left over from a lot of pdb-ing around :(

vyasr · 2021-08-05T16:30:10Z

python/cudf/cudf/core/frame.py

+            )
+
+        data = self
+        columns = {}


I'd move this to just below the interpolator =... line so it's closer to where it's used.

vyasr · 2021-08-05T23:53:25Z

python/cudf/cudf/tests/test_interpolate.py

+@pytest.mark.parametrize("method", ["linear"])
+@pytest.mark.parametrize("axis", [0])
+def test_interpolate_dataframe(data, method, axis):
+    # doesn't seem to work with NAs just yet


Is this an issue still?

Updated this with a more descriptive comment, nullable dtypes don't interpolate in pandas yet as there are some bugs it seems, our impl treats nulls and nans the same.

python/cudf/cudf/core/frame.py

Co-authored-by: Vyas Ramasubramani <[email protected]>

brandon-b-miller · 2021-08-06T21:58:24Z

anything left need updating here @vyasr @shwina ?

vyasr

Looks good on my end.

shwina

Looks really good and neat! Great work, @brandon-b-miller!

brandon-b-miller · 2021-08-09T21:06:15Z

rerun tests

brandon-b-miller · 2021-08-10T02:01:35Z

@gpucibot merge

brandon-b-miller added 5 commits July 13, 2021 08:39

very basic stuff

6b97e6e

forgot test

676388b

move things to frame

d625c30

Merge branch 'branch-21.08' into fea-linear-interp

42b7311

updates

c89d938

github-actions bot added the Python Affects Python cuDF API. label Jul 19, 2021

brandon-b-miller added 6 commits July 19, 2021 06:32

sig and docstring updates

5a4e720

updates

c17cd4f

progress

c16f2b3

refactoring

fe56bb1

test index and values methods

a681616

forgot the index

98608a9

brandon-b-miller added feature request New feature or request non-breaking Non-breaking change labels Jul 22, 2021

brandon-b-miller added 2 commits July 23, 2021 08:39

style

143c798

remove unnecessary older changes

81ffee1

brandon-b-miller marked this pull request as ready for review July 23, 2021 17:02

brandon-b-miller requested a review from a team as a code owner July 23, 2021 17:02

brandon-b-miller requested review from vyasr and shwina July 23, 2021 17:02

brandon-b-miller commented Jul 23, 2021

View reviewed changes

python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved

vyasr reviewed Jul 28, 2021

View reviewed changes

brandon-b-miller changed the title ~~[WIP] Linear Interpolation of nans via cupy~~ Linear Interpolation of nans via cupy Jul 28, 2021

brandon-b-miller and others added 4 commits July 28, 2021 06:07

directly add and test unsorted index case

f859d0e

....but dont do it for RangeIndex based data

71272a9

Apply suggestions from code review

52e431a

Co-authored-by: Vyas Ramasubramani <[email protected]>

Merge branch 'fea-linear-interp' of github.com:brandon-b-miller/cudf …

4fc0978

…into fea-linear-interp

brandon-b-miller added 4 commits July 28, 2021 06:30

fix minor bugs

088618e

address reviews

4785a56

more reviews

ed6cb81

just expose interpolate directly

b85edc1

brandon-b-miller added the 3 - Ready for Review Ready for review by team label Jul 28, 2021

style

82c4f1e

vyasr mentioned this pull request Jul 28, 2021

Refactor Python factories and remove usage of Table for libcudf output handling #8687

Merged

brandon-b-miller added 3 commits August 2, 2021 08:26

Merge branch 'branch-21.10' into fea-linear-interp

bb31ab0

address last review comment

b486c8b

merge

0f29b34

vyasr requested changes Aug 6, 2021

View reviewed changes

brandon-b-miller and others added 2 commits August 6, 2021 05:39

address review

296eddc

Update python/cudf/cudf/core/frame.py

94cc6da

Co-authored-by: Vyas Ramasubramani <[email protected]>

vyasr self-requested a review August 6, 2021 22:01

vyasr approved these changes Aug 6, 2021

View reviewed changes

shwina approved these changes Aug 9, 2021

View reviewed changes

brandon-b-miller added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 10, 2021

rapids-bot bot merged commit b1c2dd4 into rapidsai:branch-21.10 Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear Interpolation of `nan`s via `cupy` #8767

Linear Interpolation of `nan`s via `cupy` #8767

brandon-b-miller commented Jul 19, 2021 •

edited

Loading

brandon-b-miller Jul 23, 2021

vyasr Jul 28, 2021

codecov bot commented Jul 23, 2021 •

edited

Loading

vyasr left a comment

brandon-b-miller commented Aug 5, 2021

vyasr left a comment

vyasr Aug 5, 2021

vyasr Aug 5, 2021

vyasr Aug 5, 2021

brandon-b-miller Aug 6, 2021

vyasr Aug 5, 2021

vyasr Aug 5, 2021

brandon-b-miller Aug 6, 2021

brandon-b-miller commented Aug 6, 2021

vyasr left a comment

shwina left a comment

brandon-b-miller commented Aug 9, 2021

brandon-b-miller commented Aug 10, 2021

	result = interpolator(col, index=data._index)
	columns[colname] = interpolator(col, index=data._index)

Linear Interpolation of nans via cupy #8767

Linear Interpolation of nans via cupy #8767

Conversation

brandon-b-miller commented Jul 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 23, 2021 • edited Loading

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Aug 5, 2021

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller commented Aug 6, 2021

vyasr left a comment

Choose a reason for hiding this comment

shwina left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Aug 9, 2021

brandon-b-miller commented Aug 10, 2021

Linear Interpolation of `nan`s via `cupy` #8767

Linear Interpolation of `nan`s via `cupy` #8767

brandon-b-miller commented Jul 19, 2021 •

edited

Loading

codecov bot commented Jul 23, 2021 •

edited

Loading