-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reimplemented incremental io #501
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #501 +/- ##
==========================================
- Coverage 92.56% 92.12% -0.44%
==========================================
Files 42 42
Lines 6078 6429 +351
==========================================
+ Hits 5626 5923 +297
- Misses 452 506 +54
|
writing metadata incrementally
@ArneDefauw @aeisenbarth tagging you because you opened at some point a issue regarding incremental IO. In this PR incremental IO is implemented, happy to receive feedback in case you want to play around with it😊 I will make a notebook to showcase it, but in short to save an element spatialdata/tests/io/test_readwrite.py Line 159 in 87dd1a8
Please note that those strategies are not guarantee to work in various scenarios, including multi-threaded application, network storages, etc. So please use with care. |
Currently the whole table needs to be replaced and the whole table needs to be stored in-memory, but recent progress in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @LucaMarconato ! I left a review with some minor points below. I think it looks good, but i didn't have time for a super in depth review. Given that this is a big change, I think the approval should be given by somebody who can look more closely.
Thanks for the review @kevinyamauchi. I will have a look at this PR later today as well. |
Thanks for the quick response and fix! I've tested the incremental io for my use case, and up to now everything seems to works as expected, except one thing. If I follow the approach suggested here: spatialdata/tests/io/test_readwrite.py Line 159 in 87dd1a8
I get a ValueError when I load my SpatialData object back from the zarr store and try to overwrite it: The fix was to first delete the attribute from the SpatialData object, and then remove the element on disk. Minimal example below of a typical workflow in my image processing pipelines:
I think what the unit test you refered to lacks is the reading back from the zarr store, after an element is written to a zarr store. In version 0.0.15 of SpatialData, when sdata.add_image(...) was executed, it was not necessary to read back from the zarr store. I understand that current implementation allows for more control, but the inplace update of the SpatialData object was kinda convenient. Edit: |
Co-authored-by: Kevin Yamauchi <[email protected]>
Thank you @ArneDefauw for trying the code and for the explanation, I will now look into your PR.
The reason why we refactored this part is because with |
* test read write on disk * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improved tests for workarounds for incremental io * fixed tests * improved comment --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Luca Marconato <[email protected]>
Thanks for the reviews. I addressed the points from @kevinyamauchi and from @ArneDefauw (in particular I merged his PR here). @giovp, when you have time could you also please give a pass to this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor things
Co-authored-by: Giovanni Palla <[email protected]>
Hi @LucaMarconato , spatialdata/tests/io/test_readwrite.py Line 136 in 582622f
Workaround 1 , looks rather safe in most scenarios. If I understand correctly it covers following scenario: How I would usually work is: The latter feels less dangerous, and looks pretty standard in image processing pipelines, e.g. tuning of hyper parameters for image cleaning or segmentation. In the latter case, I guess the following would be sufficient:
|
Yes, I agree that the approach that you described is generally a good practice when processing data and safe, since the original data is not modified. The use cases that I described are instead for the case in which the data itself is replaced. I think I should add in the comments that this approach should be avoided when possible, and clarify that the workaround that I described are only if really needed. |
Thank you for this PR, I am using it right now. One question: would it be possible to pass a list of strings onto |
@namsaraeva thanks for the suggestion, it's indeed handier to have list of names. I have added the support for this for |
personally I don't see any blockers currently for this PR. |
@melonora I wanted to check this PR before merging: https://github.com/scverse/spatialdata/pull/525/files. I will do this this weekend. |
Closes #186
Closes #496
Closes #498
Support for incremental io operations.
New features:
SpatialData
object is createdRobustness:
SpatialData
object and "self-contained" elements. Useful for the user to understand the implications of file backing__repr__()
Testing:
Other:
This PR also sets the basis for (not implemented here) the ability to load in-memory objects that are Dask-backed.