Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplemented incremental io #501

Merged
merged 28 commits into from
Jun 10, 2024
Merged

reimplemented incremental io #501

merged 28 commits into from
Jun 10, 2024

Conversation

LucaMarconato
Copy link
Member

@LucaMarconato LucaMarconato commented Mar 21, 2024

Closes #186
Closes #496
Closes #498

Support for incremental io operations.

New features:

  • ability to save additional elements to disk after the SpatialData object is created
  • ability to remove from disk previously saved objects
  • ability to see which elements are present only in-memory and not in the Zarr store and viceversa
  • refactored saving of metadata:
    • transformations
    • consolidated metadata
    • set the basis (but not implemented), like empty tests or TODOs with what's missing, for the other metadata: table.uns['spatialdata_attrs'], points.attrs['spatialdata_attrs'] and OMERO metadata for image channel names

Robustness:

  • refactored write function to make it more robust
  • improved error messages for the users, with actionable advices
  • new concept of "self-contained" SpatialData object and "self-contained" elements. Useful for the user to understand the implications of file backing
  • added info on Dask-backed files for non "self-contained" elements to __repr__()

Testing:

  • improved existing tests for io
  • extensive testing for modular io
  • improved testing for comparision of metadata after io and after a deepcopy

Other:

This PR also sets the basis for (not implemented here) the ability to load in-memory objects that are Dask-backed.

Copy link

codecov bot commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 86.05263% with 53 lines in your changes missing coverage. Please review.

Project coverage is 92.12%. Comparing base (62a6440) to head (2585216).
Report is 79 commits behind head on main.

Files with missing lines Patch % Lines
src/spatialdata/_core/spatialdata.py 87.62% 38 Missing ⚠️
src/spatialdata/transformations/operations.py 53.33% 7 Missing ⚠️
src/spatialdata/_io/_utils.py 75.00% 6 Missing ⚠️
src/spatialdata/models/_utils.py 75.00% 1 Missing ⚠️
src/spatialdata/testing.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #501      +/-   ##
==========================================
- Coverage   92.56%   92.12%   -0.44%     
==========================================
  Files          42       42              
  Lines        6078     6429     +351     
==========================================
+ Hits         5626     5923     +297     
- Misses        452      506      +54     
Files with missing lines Coverage Δ
src/spatialdata/_core/_deepcopy.py 98.41% <100.00%> (+0.02%) ⬆️
src/spatialdata/_core/_elements.py 92.47% <100.00%> (+0.80%) ⬆️
src/spatialdata/_io/io_zarr.py 88.37% <100.00%> (ø)
src/spatialdata/dataloader/datasets.py 90.68% <ø> (ø)
src/spatialdata/models/models.py 87.69% <100.00%> (+0.16%) ⬆️
src/spatialdata/models/_utils.py 91.79% <75.00%> (+0.18%) ⬆️
src/spatialdata/testing.py 98.24% <87.50%> (-1.76%) ⬇️
src/spatialdata/_io/_utils.py 88.52% <75.00%> (-2.33%) ⬇️
src/spatialdata/transformations/operations.py 89.94% <53.33%> (-2.75%) ⬇️
src/spatialdata/_core/spatialdata.py 90.78% <87.62%> (-1.44%) ⬇️

... and 6 files with indirect coverage changes

@LucaMarconato LucaMarconato changed the title implemented incremental io; tests missing reimplemented incremental io Mar 22, 2024
@LucaMarconato
Copy link
Member Author

LucaMarconato commented Mar 23, 2024

@ArneDefauw @aeisenbarth tagging you because you opened at some point a issue regarding incremental IO. In this PR incremental IO is implemented, happy to receive feedback in case you want to play around with it😊

I will make a notebook to showcase it, but in short to save an element lables, table, etc you can use the new sdata.write_element('element_name'). If the element already exists in the storage an exception will be raised. You can work around the exception for instance with the strategies shown here:

# workaround 1, mostly safe (untested for Windows platform, network drives, multi-threaded

Please note that those strategies are not guarantee to work in various scenarios, including multi-threaded application, network storages, etc. So please use with care.

@LucaMarconato
Copy link
Member Author

Currently the whole table needs to be replaced and the whole table needs to be stored in-memory, but recent progress in anndata + dask will be used also in spatialdata to allow lazy loading and the replacement of particular elements (like adding a single obs column). This PR clean up the previous code and is a step in that direction.

Copy link
Collaborator

@kevinyamauchi kevinyamauchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LucaMarconato ! I left a review with some minor points below. I think it looks good, but i didn't have time for a super in depth review. Given that this is a big change, I think the approval should be given by somebody who can look more closely.

src/spatialdata/_utils.py Outdated Show resolved Hide resolved
src/spatialdata/_io/_utils.py Outdated Show resolved Hide resolved
src/spatialdata/_core/spatialdata.py Outdated Show resolved Hide resolved
src/spatialdata/_core/spatialdata.py Outdated Show resolved Hide resolved
@melonora
Copy link
Collaborator

Thanks @LucaMarconato ! I left a review with some minor points below. I think it looks good, but i didn't have time for a super in depth review. Given that this is a big change, I think the approval should be given by somebody who can look more closely.

Thanks for the review @kevinyamauchi. I will have a look at this PR later today as well.

@ArneDefauw
Copy link
Contributor

ArneDefauw commented Mar 25, 2024

@ArneDefauw @aeisenbarth tagging you because you opened at some point a issue regarding incremental IO. In this PR incremental IO is implemented, happy to receive feedback in case you want to play around with it😊

I will make a notebook to showcase it, but in short to save an element lables, table, etc you can use the new sdata.write_element('element_name'). If the element already exists in the storage an exception will be raised. You can work around the exception for instance with the strategies shown here:

# workaround 1, mostly safe (untested for Windows platform, network drives, multi-threaded

Please note that those strategies are not guarantee to work in various scenarios, including multi-threaded application, network storages, etc. So please use with care.

Thanks for the quick response and fix!

I've tested the incremental io for my use case, and up to now everything seems to works as expected, except one thing. If I follow the approach suggested here:

# workaround 1, mostly safe (untested for Windows platform, network drives, multi-threaded

I get a ValueError when I load my SpatialData object back from the zarr store and try to overwrite it:
ValueError: The file path specified is a parent directory of one or more files used for backing for one or more elements in the SpatialData object. Deleting the data would corrupt the SpatialData object.

The fix was to first delete the attribute from the SpatialData object, and then remove the element on disk. Minimal example below of a typical workflow in my image processing pipelines:

from spatialdata import SpatialData
from spatialdata import read_zarr
import spatialdata
import dask.array as da

img_layer="test_image"

sdata=SpatialData()

sdata.write( os.path.join( path, "sdata.zarr" ) )

dummy_array = da.random.random(size=(1,10000, 10000), chunks=(1,1000, 1000))

se=spatialdata.models.Image2DModel.parse(
                data=dummy_array,
            )

sdata.images[ img_layer ] = se

if sdata.is_backed():
    sdata.write_element("test_image", overwrite=True)

# need to read back from zarr store, otherwise graph in the in-memory sdata would not be executed
sdata=read_zarr( sdata.path )

# now overwrite:

# Here I needed to first delete the attribute:

# first delete attribute
element_type = sdata._element_type_from_element_name(  img_layer )
del getattr(sdata, element_type)[ img_layer ]
# then on disk
if sdata.is_backed():
    sdata.delete_element_from_disk( img_layer )


sdata.images[ img_layer ] = se

if sdata.is_backed():
    sdata.write_element(img_layer, overwrite=True)

sdata=read_zarr( sdata.path )

I think what the unit test you refered to lacks is the reading back from the zarr store, after an element is written to a zarr store.

In version 0.0.15 of SpatialData, when sdata.add_image(...) was executed, it was not necessary to read back from the zarr store. I understand that current implementation allows for more control, but the inplace update of the SpatialData object was kinda convenient.

Edit:
I added a pull request, to illustrate the issue a little bit more: #515

@LucaMarconato
Copy link
Member Author

Thank you @ArneDefauw for trying the code and for the explanation, I will now look into your PR.

In version 0.0.15 of SpatialData, when sdata.add_image(...) was executed, it was not necessary to read back from the zarr store. I understand that current implementation allows for more control, but the inplace update of the SpatialData object was kinda convenient.

The reason why we refactored this part is because with add_image(), if the user had a in-memory image and was writing it to disk, the image would have then immediately lazy loaded. This is good and ergonomic if the image needs to be written only once, but if the user tries to write image again (for instance in a notebook when a cell may get manually executed twice), then it would have lead to an error.

* test read write on disk

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* improved tests for workarounds for incremental io

* fixed tests

* improved comment

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Marconato <[email protected]>
@LucaMarconato
Copy link
Member Author

LucaMarconato commented Mar 27, 2024

Thanks for the reviews. I addressed the points from @kevinyamauchi and from @ArneDefauw (in particular I merged his PR here). @giovp, when you have time could you also please give a pass to this?

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor things

tests/io/test_readwrite.py Show resolved Hide resolved
src/spatialdata/_core/spatialdata.py Show resolved Hide resolved
src/spatialdata/_core/spatialdata.py Outdated Show resolved Hide resolved
src/spatialdata/_core/spatialdata.py Outdated Show resolved Hide resolved
src/spatialdata/_core/spatialdata.py Outdated Show resolved Hide resolved
@ArneDefauw
Copy link
Contributor

ArneDefauw commented Mar 29, 2024

Thank you @ArneDefauw for trying the code and for the explanation, I will now look into your PR.

In version 0.0.15 of SpatialData, when sdata.add_image(...) was executed, it was not necessary to read back from the zarr store. I understand that current implementation allows for more control, but the inplace update of the SpatialData object was kinda convenient.

The reason why we refactored this part is because with add_image(), if the user had a in-memory image and was writing it to disk, the image would have then immediately lazy loaded. This is good and ergonomic if the image needs to be written only once, but if the user tries to write image again (for instance in a notebook when a cell may get manually executed twice), then it would have lead to an error.

Hi @LucaMarconato ,
thanks for the reply and the fixes!
I've tested your suggestions (

def test_incremental_io_on_disk(
) for my use cases and everything seems to work fine (both for images,labels, points, labels and shapes ).

Workaround 1 , looks rather safe in most scenarios. If I understand correctly it covers following scenario:
Having "x" in sdata, doing something on "x" (i.e. defining a dask graph), and then writing to "x".

How I would usually work is:
Having "x" in sdata, doing something on "x", and writing to "y" (where "y" already exists).

The latter feels less dangerous, and looks pretty standard in image processing pipelines, e.g. tuning of hyper parameters for image cleaning or segmentation.

In the latter case, I guess the following would be sufficient:


arr=sdata["x"].data
arr=arr*2
spatial_element=spatialdata.models.Image2DModel.parse(
                arr,
            )
del sdata["y"]
sdata.delete_element_from_disk("y")
sdata["y"]=spatial_element
sdata.write_element("y")
sdata=read_zarr( sdata.path )

@LucaMarconato
Copy link
Member Author

Yes, I agree that the approach that you described is generally a good practice when processing data and safe, since the original data is not modified.

The use cases that I described are instead for the case in which the data itself is replaced. I think I should add in the comments that this approach should be avoided when possible, and clarify that the workaround that I described are only if really needed.

@namsaraeva
Copy link
Contributor

Thank you for this PR, I am using it right now. One question: would it be possible to pass a list of strings onto write_element() instead of just one element? @LucaMarconato

@LucaMarconato
Copy link
Member Author

@namsaraeva thanks for the suggestion, it's indeed handier to have list of names. I have added the support for this for write_element() and delete_element_from_disk().

@melonora
Copy link
Collaborator

personally I don't see any blockers currently for this PR.

@LucaMarconato
Copy link
Member Author

personally I don't see any blockers currently for this PR.

@melonora I wanted to check this PR before merging: https://github.com/scverse/spatialdata/pull/525/files. I will do this this weekend.

@LucaMarconato LucaMarconato merged commit 137e1e0 into main Jun 10, 2024
5 of 7 checks passed
@LucaMarconato LucaMarconato deleted the feature/incremental_io branch June 10, 2024 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants