Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix rolling-window count for null input #6344

Merged
merged 20 commits into from
Oct 13, 2020

Conversation

mythrocks
Copy link
Contributor

Closes #6343.

Fixes COUNT_ALL, COUNT_VALID for window functions.

In rolling_window() operations, COUNT_VALID/COUNT_ALL should only return null rows if the min_periods requirement is not satisfied. For all other cases, the count produced must be valid, even if the input row is null.

As it currently stands, the COUNT* rolling_window() operation returns null if even one of its input rows is null. That behaviour, while correct for aggregations like SUM, is incorrect for COUNT.

This commit should fix the problem.

@mythrocks mythrocks requested a review from a team as a code owner September 29, 2020 06:00
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@mythrocks mythrocks changed the title Fix rolling-window count for null input [WIP] Fix rolling-window count for null input Sep 29, 2020
@mythrocks mythrocks self-assigned this Sep 29, 2020
@harrism harrism added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Sep 29, 2020
@revans2
Copy link
Contributor

revans2 commented Sep 29, 2020

This looks like it would fix the issue for Spark where we cannot do a COUNT on a column for a rolling window.

@kkraus14 is this going to mess anything up for python side

@mythrocks
Copy link
Contributor Author

(Apologies for the force-push. I thought I'd get the fix for the code-break in before the review begins.)

@mythrocks
Copy link
Contributor Author

@kkraus14 is this going to mess anything up for python side

A more complete accounting of the problem is available at #6343, in case the description here is too terse.

@kkraus14
Copy link
Collaborator

@kkraus14 is this going to mess anything up for python side

cc @shwina @brandon-b-miller who worked on rolling windows

@brandon-b-miller
Copy link
Contributor

I don't think we rely on this operation producing a particular result internally for anything on the python side, so I don't think anything will end up happening except for the associated changes propagating to the user facing version of this API.

I think this might address #5580 and is related to pandas changes discussed in pandas-dev/pandas#34466 and the issues linked to therein. Honestly, when @shwina and I looked at this a few months back, we had a hard time parsing out the logic behind the pandas behavior. It seemed like no matter how we preprocessed min_periods and win_size with count, there was always a set of cases where we got a different answer than pandas and it was very hard to pin down what was happening.

I'm curious to see what happens if we build this branch and remove the xfails which mark those tests right now - I'll build this branch this afternoon and try it and report back here.

@mythrocks
Copy link
Contributor Author

mythrocks commented Sep 29, 2020

I think this might address #5580 and is related to pandas changes discussed in pandas-dev/pandas#34466

Thank you for the link, @brandon-b-miller. Your assessment is accurate.
As a side-note, was anyone else confused with pandas-dev's nomenclature regarding the "element above"?

When your window size is an even number, with center=True you actually include 1 more element above the anchor than below the anchor.

(He means the preceding row, right? This was an informative discussion.) Regardless, it appears that behaviour of count in pandas < 1.0 was incorrect as well.

@mythrocks
Copy link
Contributor Author

Some of the python tests seem to be failing.

=========================== short test summary info ============================
FAILED cudf/tests/test_rolling.py::test_rolling_with_offset[count] - Assertio...
= 1 failed, 53475 passed, 2206 skipped, 954 xfailed, 4040 xpassed, 55395 warnings in 3978.62s (1:06:18) =

Is there an easy way to run a specific Python test? I suspect these tests will need their expected output modified.

@brandon-b-miller
Copy link
Contributor

You can use py.test test_rolling.py -k test_rolling_with_offset[count] to run just that one test

@mythrocks
Copy link
Contributor Author

The failure is quite interesting. Reproduced here in full, to save on browser tabs:

agg = 'count'
    @pytest.mark.parametrize(
        "agg", ["sum", pytest.param("min"), pytest.param("max"), "mean", "count"]
    )
    def test_rolling_with_offset(agg):
        psr = pd.Series(
            [1, 2, 4, 4, np.nan, 9],
            index=[
                pd.Timestamp("20190101 09:00:00"),
                pd.Timestamp("20190101 09:00:01"),
                pd.Timestamp("20190101 09:00:02"),
                pd.Timestamp("20190101 09:00:04"),
                pd.Timestamp("20190101 09:00:07"),
                pd.Timestamp("20190101 09:00:08"),
            ],
        )
        gsr = cudf.from_pandas(psr)
>       assert_eq(
            getattr(psr.rolling("2s"), agg)().fillna(-1),
            getattr(gsr.rolling("2s"), agg)().fillna(-1),
            check_dtype=False,
        )

cudf/tests/test_rolling.py:125: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/_libs/testing.pyx:67: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError: Series are different
E   
E   Series values are different (16.66667 %)
E   [index]: [2019-01-01T09:00:00.000000000, 2019-01-01T09:00:01.000000000, 2019-01-01T09:00:02.000000000, 2019-01-01T09:00:04.000000000, 2019-01-01T09:00:07.000000000, 2019-01-01T09:00:08.000000000]
E   [left]:  [1.0, 2.0, 2.0, 1.0, -1.0, 1.0]
E   [right]: [1, 2, 2, 1, 0, 1]

For some odd reason, the fifth input causes a nan to be produced, (and replaced with -1.0). I'm poring through the pandas documentation now, to see if I can make sense of this. On the face of it, it would appear that in pandas, the nan rows cause a nan output, but only in the indexed column case.

@mythrocks
Copy link
Contributor Author

I suspect that the Pandas series rolling-count min_periods fix has not been extended to "indexed" columns (using time-range based windows). I have requested the author for clarification on this behaviour.

If my hunch is right, we might need to move this test to xfail for the moment.

@brandon-b-miller
Copy link
Contributor

There's some tests in test_rolling.py that are marked as xfail right now due to these issues. I tried removing the marks, but I'm still seeing failures. Here's an example. Any idea what might be happening?

>>> import pandas as pd
>>> import cudf
>>> psr = pd.Series([1,1,1,1])
>>> gsr = cudf.Series([1,1,1,1])
>>> psr.rolling(2,2,True).count()
0    NaN
1    2.0
2    2.0
3    2.0
dtype: float64
>>> gsr.rolling(2,2,True).count()
0    1
1    2
2    2
3    2
dtype: int32
>>> pd.__version__
'1.1.2'

@mythrocks
Copy link
Contributor Author

Any idea what might be happening?

On the face of it, my code seems to be producing the incorrect result. I will investigate.

@brandon-b-miller
Copy link
Contributor

Connected offline and found this to be part of the issue https://github.com/rapidsai/cudf/blob/branch-0.16/python/cudf/cudf/core/window/rolling.py#L233-L234

@mythrocks
Copy link
Contributor Author

@brandon-b-miller is right, of course. cudf::rolling_window() seems to be working correctly, as tested with a temporarily added test-case:

42: [ RUN      ] UntypedRollingTest.TestMinPeriods
42: Input vector: 1,1,1,1
42: count(preceding=2, following=0, min_periods=2) : NULL,2,2,2
42: count(preceding=2, following=0, min_periods=1) : 1,2,2,2
42: count(preceding=2, following=0, min_periods=0) : 1,2,2,2
42: [       OK ] UntypedRollingTest.TestMinPeriods (0 ms)

I'll post a correction for the Python tests.

@mythrocks
Copy link
Contributor Author

Drat. Removing the min_periods hack from rolling.py breaks test_rolling_groupby_*. This will need a more delicate hand.

@brandon-b-miller
Copy link
Contributor

I'll look into this today.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Couple small things. My review can replace @trxcllnt 's if he is busy.

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_rolling.py Show resolved Hide resolved
1. Switched explicit for-loop to thrust::count_if().
2. Fixed spelling in rolling-window python tests.
@mythrocks mythrocks removed the request for review from trxcllnt October 6, 2020 05:10
@mythrocks
Copy link
Contributor Author

@harrism, thank you for your review and suggestions. Please let me know if this PR is looking alright to go in. (I'm a little nervous about rebasing #6277 in time for tomorrow.)

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor Author

rerun tests

@mythrocks
Copy link
Contributor Author

@shwina, might I please bug you to have a look at the Python bits, at your earliest convenience?

@mythrocks
Copy link
Contributor Author

Thanks for the review, @harrism!

1. Added test for expected values for rolling_window count
2. Fixed convention for parametrized agg values
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mythrocks!

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 12, 2020
@harrism harrism merged commit 4d5db1f into rapidsai:branch-0.16 Oct 13, 2020
@mythrocks
Copy link
Contributor Author

Thanks for the review, @shwina (and everyone). Thanks for merging this, @harrism!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Incorrect result for COUNT_VALID/COUNT_ALL in rolling_window
9 participants