You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In rolling_window() operations, COUNT_VALID/COUNT_ALL should only return null rows if the min_periods requirement is not satisfied. For all other cases, the count produced must be valid, even if the input row is null.
As it currently stands, the COUNT*rolling_window() operation returns null if even one of its input rows is null. That behaviour, while correct for aggregations like SUM, is incorrect for COUNT.
E.g. Consider a vector with all nulls:
[null, null, null, null, null]
COUNT_ALL with window (preceding=2, following=1, min_periods=1) should yield [2, 3, 3, 3, 2], not [null, null, null, null, null].
COUNT_ALL with window (preceding=2, following=1, min_periods=3) should yield [null, 3, 3, 3, null], not [null, null, null, null, null], because min_periods is not met at either end of the vector.
COUNT_VALID with window (preceding=2, following=1, min_periods=1) should yield [0, 0, 0, 0, 0], not [null, null, null, null, null].
COUNT_VALID with window (preceding=2, following=1, min_periods=3) should yield [null, 0, 0, 0, null], not [null, null, null, null, null], because min_periods is not met at either end of the vector.
The following test should illustrate the expected results, and serve to reproduce the erroneous output:
Closes#6343. Fixes COUNT_ALL, COUNT_VALID for window functions. In rolling_window() operations, COUNT_VALID/COUNT_ALL should only return null rows if the min_periods requirement is not satisfied. For all other cases, the count produced must be valid, even if the input row is null.
In
rolling_window()
operations,COUNT_VALID
/COUNT_ALL
should only returnnull
rows if themin_periods
requirement is not satisfied. For all other cases, the count produced must be valid, even if the input row is null.As it currently stands, the
COUNT*
rolling_window()
operation returnsnull
if even one of its input rows is null. That behaviour, while correct for aggregations likeSUM
, is incorrect forCOUNT
.E.g. Consider a vector with all nulls:
COUNT_ALL
with window(preceding=2, following=1, min_periods=1)
should yield[2, 3, 3, 3, 2]
, not[null, null, null, null, null]
.COUNT_ALL
with window(preceding=2, following=1, min_periods=3)
should yield[null, 3, 3, 3, null]
, not[null, null, null, null, null]
, becausemin_periods
is not met at either end of the vector.COUNT_VALID
with window(preceding=2, following=1, min_periods=1)
should yield[0, 0, 0, 0, 0]
, not[null, null, null, null, null]
.COUNT_VALID
with window(preceding=2, following=1, min_periods=3)
should yield[null, 0, 0, 0, null]
, not[null, null, null, null, null]
, becausemin_periods
is not met at either end of the vector.The following test should illustrate the expected results, and serve to reproduce the erroneous output:
It would be good to have a Pandas perspective on this assessment.
The text was updated successfully, but these errors were encountered: