-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed .isel
for DatasetRolling.construct
consistent rolling behavior
#7578
Conversation
`.isel` causes `DatasetRolling.construct` to behavior to be inconsistent with `DataArrayRolling.construct` when `stride` > 1.
…inconsistent-behavior Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior.
.isel
for DatasetRolling.construct
consistent rolling behavior
Good spot. The fix seems good to me. |
Sorry for the late reply... For some reason, when you add a dimension-coordinate to the Dataset it behaves as expected (before) and now wrong (with this PR): E.g. use these two test datasets: ds1 = xr.Dataset({"x": ("t", range(20))})
ds2 = xr.Dataset({"x": ("t", range(20))}, {"t": range(20)})
print("Dataset rolling: ds1")
print(ds1.rolling(t=4).construct("w", stride=2).x.shape) # wrong in main
print("DataArray rolling: ds1")
print(ds1.x.rolling(t=4).construct("w", stride=2).shape)
print("Dataset rolling: ds2")
print(ds2.rolling(t=4).construct("w", stride=2).x.shape) # wrong in this PR
print("DataArray rolling: ds2")
print(ds2.x.rolling(t=4).construct("w", stride=2).shape) The unit tests use datasets with an dimension-coordinate, therefore this error was never spotted. |
After digging a bit more, the problematic line is the return one: The isel in the end is supposed to remove the inserted NaNs again. So I think we have to find an intermediate solution and remove the isel + adopt what we pass to coords. |
Nvmd. I have added another test with more dimensions and 2D coordinates.
Anyone knows how to align here properly? Coords do not have an isel, otherwise one could simply apply the stride as well. |
For now the approach is to stride the original dataset and then extract the coords from there. Ofc, this strides the dataset variables which are then not used, so unnecessary computation. However this approach is already much faster and memory efficient than the previous approach. |
My apologies for very late reply. Got tons of backlog until seeing this popped up in my mailbox.
Thank you for digging this out. I was dumbfounded when looking at this particular line, haven't thought about NaNs case back then.
I've ended up with something similar but a little bit different in my own internal repository. I've found that it's a bit more efficient and more practical to just create a class of Virtual rolling coordinate then accessing the data by asking the virtual coordinate to provide me a But it would turn your question of how to not stride over |
Not sure I understand what you mean. The current approach only temporarily strides the dataset including it's coords and then extracts those coords. Unfortunately the Coordinates class does not support indexing, so we have to do it at the dataset level. I think it should not add too much overhead because it is index based lookup. The main difference between this approach and what you did is that it supports coordinates that have different dimensions than the data variables (see the new test). |
Actually we could add a peakmem asv benchmark for this and see how much more.memory efficient it is. |
Ok, it went from 141MB to 196MB... Does anyone have any idea why? |
Sorry for the confusion, I meant that I ended up wrote something on-top of After recollecting myself what I did in March and what you've done to fix my PR. It seems that we both end up with a similar solution on this topic, excluding some minor caveats. I agree that your suggested change is already memory efficient (and still simple to understand the codebase).
I think this is within the expectation? Because original behavior causes the result to fall short by a large margin (see my issue at #7021). Now that this PR fixing it, number of result windows should be larger (thus larger memory footprint) when running benchmark against the mainline branch. Thank you for your helpful feedback! |
Your behavior was for Datasets without an dimension coordinate (a coordinate that is called the same as the dimension), the benchmark uses a Dataset with (otherwise I cannot compare the results correctly). This behavior was correct before, because what happened is that the created arrays were first strided, then extended to full again filling the missing data with NaNs and then strided again (this was the isel in the end). So I expect the extending to full part to consume more memory, but seeing in the benchmark it does apparently not. Anyway I think this PR is a good addition because it fixes a bug, which is far more important than performance.
You're welcome! |
I just increased the dimensions as well in CI and now we get:
So it got much better? This seems strange... |
Just to note. It still almost impossible to run construct with stride>1 on a large dataset even with the first commit in this PR (only fix my wanted behavior) because it ended up not creating a view but actually allocating memory (according to memray) for rolling windows, so I thought that this was an intended behavior (leading to my reply suggesting some virtual lookup things I used elsewhere). My assumption is something weird is going on when extending to full part since reduce doesn't suffer the same issue. I might help digging down on this later but doesn't seem to be an immediate issue. Anyway, somehow it ends up fixing both behavior bug and performance. Thank you everyone. |
Thanks for starting this :) Your insight was helpful in figuring out what was going wrong. Just a tip: Do future PRs on a branch in your forked repo and not the main branch. Since we do squash commits, your history will be divergent to xarrays main branch and you will have to force push. Doing this in a branch prevents this :) Lets wait a day or two and then merge :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Dataset(...).isel(...)
at the return causedDatasetRolling.construct
behavior to be inconsistent withDataArrayRolling.construct
whenstride
> 1 without any benefits.The bug was reported in #7021
DatasetRolling.construct
andDataArrayRolling.construct
with stride > 1. #7021whats-new.rst