Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for to_polygons when using processes instead of threads in dask #756

Merged
merged 1 commit into from
Dec 10, 2024

Conversation

ArneDefauw
Copy link
Contributor

I noticed that when configuring dask to use 'processes' instead of 'threads' , spatialdata.to_polygons fails due to this line:

result of dask.compute() is lost when using processes, which is expected behaviour.

Note that using 'processes' instead of 'threads' considerably speeds up vectorizing labels for large masks, because the function we try to parallelize does not release the GIL.

Example:

import os
from pathlib import Path

import pooch
from pooch import Pooch

from spatialdata import read_zarr

BASE_URL = "https://objectstor.vib.be/spatial-hackathon-public/sparrow/public_datasets"

def _get_registry(path: str | Path | None = None) -> Pooch:
    return pooch.create(
        path=pooch.os_cache("sparrow") if path is None else path,
        base_url=BASE_URL,
        version="0.0.1",
        registry={
            "transcriptomics/vizgen/mouse/_sdata_2D.zarr.zip": "e1f36061e97e74ad131eb709ca678658829dc4385a444923ef74835e783d63bc",
        },
    )

registry=_get_registry( path = None ) # set path if you want to download data to somewhere else
unzip_path = registry.fetch("transcriptomics/vizgen/mouse/_sdata_2D.zarr.zip", processor=pooch.Unzip())
sdata = read_zarr(os.path.commonpath(unzip_path))
sdata.path = None
import dask
from spatialdata import to_polygons

dask.config.set(scheduler="processes")
gdf=to_polygons( sdata[ "segmentation_mask_full" ] )
# finishes in around 3m locally on a mac m2

dask.config.set(scheduler="threads")
gdf=to_polygons( sdata[ "segmentation_mask_full" ] )
# finishes in around 8m locally on a mac m2

"segmentation_mask_full' contains the masks from a merscope experiment, around 300k labels.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Copy link

codecov bot commented Nov 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.89%. Comparing base (27bb4a7) to head (d7c91a1).
Report is 37 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #756      +/-   ##
==========================================
- Coverage   91.89%   91.89%   -0.01%     
==========================================
  Files          45       45              
  Lines        6919     6918       -1     
==========================================
- Hits         6358     6357       -1     
  Misses        561      561              
Files with missing lines Coverage Δ
src/spatialdata/_core/operations/vectorize.py 93.75% <100.00%> (-0.04%) ⬇️

@LucaMarconato
Copy link
Member

@ArneDefauw I am trying this, it looks great thank you!

Only one question. Why do you set sdata.path = None in your code example?

@ArneDefauw
Copy link
Contributor Author

@ArneDefauw I am trying this, it looks great thank you!

Only one question. Why do you set sdata.path = None in your code example?

Hi @LucaMarconato , thanks for having a look!

I typically set sdata.path to None if I do unit tests, in order not to overwrite the .zarr store in the registry, but it is unrelated to this fix.

@LucaMarconato LucaMarconato merged commit eb1d713 into scverse:main Dec 10, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants