Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when writing to disk a dask-backed xarray #219

Closed
LucaMarconato opened this issue Aug 22, 2022 · 18 comments
Closed

Error when writing to disk a dask-backed xarray #219

LucaMarconato opened this issue Aug 22, 2022 · 18 comments

Comments

@LucaMarconato
Copy link
Contributor

LucaMarconato commented Aug 22, 2022

The following code example produces an error on the last line.

The code does the following:

  1. creates a 3D array (cyx axes) and writes it to zarr
  2. loads it from zarr
  3. writes to another file what has been loaded
##
import os.path
import numpy as np
import zarr
import shutil
from ome_zarr.writer import write_image
from ome_zarr.io import parse_url
from ome_zarr.io import ZarrLocation
from ome_zarr.reader import Multiscales, Reader
from ome_zarr.format import CurrentFormat
import dask.array.core

im = np.random.normal(size=(3, 100, 100))
fmt = CurrentFormat()

## write
def write_to_zarr(im, f):
    if os.path.isdir(f):
        shutil.rmtree(f)
    store = parse_url(f, mode="w").store
    group = zarr.group(store=store)
    write_image(im, group, axes=["c", "x", "y"], fmt=fmt)

write_to_zarr(im, "debug0.zarr")

## read
loc = ZarrLocation("debug0.zarr")
reader = Reader(loc)()
nodes = list(reader)
assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="0", version=fmt.version)

## write again (error)
write_to_zarr(im_read, "debug1.zarr")

##
import xarray as xr
xr.DataArray(im).chunks

The error is the following:

Traceback (most recent call last):
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3335, in run_ast_nodes
    code = compiler(mod, cell_name, mode)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/codeop.py", line 143, in __call__
    codeob = compile(source, filename, symbol, self.flags, True)
  File "<ipython-input-3-61b5ce34eaa1>", line 1
SyntaxError: 'return' outside function
Traceback (most recent call last):
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/hierarchy.py", line 800, in _write_op
    return f(*args, **kwargs)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/hierarchy.py", line 965, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  **File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/creation.py", line 380, in array
    z[...] = data**
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/core.py", line 1353, in __setitem__
    self.set_basic_selection(pure_selection, value, fields=fields)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/core.py", line 1448, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/core.py", line 1746, in _set_basic_selection_nd
    indexer = BasicIndexer(selection, self)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/indexing.py", line 342, in __init__
    dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_chunk_len)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/indexing.py", line 176, in __init__
    self.nchunks = ceildiv(self.dim_len, self.dim_chunk_len)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/indexing.py", line 160, in ceildiv
    return math.ceil(a / b)
TypeError: unsupported operand type(s) for /: 'int' and 'list'
python-BaseException

Investigating the cause of the problem I found what follows. In the first write_image() call, the im array, which is a np.ndarray, is converted to a zarr.core.Array in creation.py (see the bold line in the stack trace). More precisely, first an empty object is initialized and then it is filled with data.

# creation.py
    # instantiate array
    z = create(**kwargs)

    # fill with data
    z[...] = data

After the instantiation line, the object z has an attribute chunks which is (3, 100, 100). So far so good.

When instead write_image() is called for the second time, the im array is a dask.array.core.Array, and after the intantiation line above the attribute z.chunks is ([3], [100], [100]). This triggers the error that I am reporting at the line that follows (filling z with data).

From a performance point of view, what my code is asking is to take data that is on disk and write it somewhere else on disk, without the need to load the content in memory. So a workaround would be that I modify the write_to_zarr() function so that if the im data that I want to copy is a dask.array.core.Array, I instead manually do a copy at the file level, without using the ome-zarr-py APIs.

Any advice in how to proceed?

@will-moore
Copy link
Member

Hi, apologies for the delay...
I wonder if you could try this installing ome-zarr-py from the branch at #192
since I'm hoping that will do what you want.

@LucaMarconato
Copy link
Contributor Author

I have tried that branch and the code now works, thank you!

@LucaMarconato
Copy link
Contributor Author

Sorry to reopen already, I actually noticed a problem. There is no error in writing the files and the output looks the same, but what is written is slightly different in the .zarray.

Here is the first chunk of the image, first written from an in-memory array and then from the delayed representation. As you can see the "compressor" key is different.

macbook@MBP-di-macbook 0 % cat tmp.zarr/image_0/0/.zarray 
{
    "chunks": [
        3,
        100,
        100
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dimension_separator": "/",
    "dtype": "<f8",
    "fill_value": 0.0,
    "filters": null,
    "order": "C",
    "shape": [
        3,
        100,
        100
    ],
    "zarr_format": 2
}% 
macbook@MBP-di-macbook 0 % cat tmp2.zarr/image_0/0/.zarray 
{
    "chunks": [
        3,
        100,
        100
    ],
    "compressor": null,
    "dimension_separator": "/",
    "dtype": "<f8",
    "fill_value": 0.0,
    "filters": null,
    "order": "C",
    "shape": [
        3,
        100,
        100
    ],
    "zarr_format": 2
}%                    

@LucaMarconato LucaMarconato reopened this Aug 30, 2022
@LucaMarconato
Copy link
Contributor Author

I found a more serious bug. I disabled the compression passing storage_options={"compressor": None} to write_image, and now I noticed that the coordinate transformations stored in .zattrs don't match. See the output below:

Screen Shot 2022-08-30 at 14 37 49

@sbesson
Copy link
Member

sbesson commented Aug 30, 2022

Thanks for the feedback @LucaMarconato, I wonder whether you are encountering the same issue that was attempted to be fixed in #197. As this branch is currently conflicting with the inclusion of the dask writing code, could you try updating the corresponding location in the writer code to use storage_options.copy()?

@LucaMarconato
Copy link
Contributor Author

I changed the write_image() line to

write_image(im, group, axes=["c", "x", "y"], fmt=fmt, storage_options={"compressor": None}.copy())

but the coordinate transformations remain the same as in the screenshot above.

@will-moore
Copy link
Member

@LucaMarconato could you try #220 which should fix the coordinateTransforms issue above?

@will-moore
Copy link
Member

I don't know the reason for the compressor being different. I can only assume different default behaviour between group.create_dataset() for numpy data and dask.to_zarr().

@will-moore
Copy link
Member

I tried the fix in #197 but it doesn't change the default behaviour in the example code above (when no compressor is specified)

@LucaMarconato
Copy link
Contributor Author

LucaMarconato commented Aug 31, 2022

I tried #220 and it fixes the coordinateTransformations entries, which now coincide. But some of the binary data is not identical!

(ome) macbook@MBP-di-macbook temp % diff debug0.zarr/0/0/0/0 debug1.zarr/0/0/0/0 
(ome) macbook@MBP-di-macbook temp % diff debug0.zarr/1/0/0/0 debug1.zarr/1/0/0/0
Binary files debug0.zarr/1/0/0/0 and debug1.zarr/1/0/0/0 differ

I have tried both passing {'compressor': compressor} with compressor=None and compressor=Blosc(cname="zstd", clevel=5, shuffle=Blosc.NOSHUFFLE) but the bug remains. So I think this is not due to the compressor.

@LucaMarconato
Copy link
Contributor Author

Since the full resolution image (0) is identical maybe this is due to small numerical unavoidable errors when downscaling into the smaller levels of the image pyramid. I will now check if this is the case.

@LucaMarconato
Copy link
Contributor Author

LucaMarconato commented Aug 31, 2022

No, that's not the case. Those numbers are very different, but they have mostly the same sign, so I suspect it is something related to computing the downscaled versions with different chunks but that are mostly overlapping, maybe a few pixels off.
image

@will-moore
Copy link
Member

I think this is now failing due to the bug fixed in #197.
The storage_options are not passed down to subsequent levels after the first level is written.
With that fix, and updating your script to use:

write_image(im, group, axes=["c", "x", "y"], fmt=fmt, storage_options={'compressor': None})

I now get no difference between downscaled chunks.
Hope you see the same?

@will-moore
Copy link
Member

Instead of testing with PR #197 (which has merge conflicts), can you try with #221 where I've fixed the merge conflict on top of master branch.

@LucaMarconato
Copy link
Contributor Author

Thanks, I have tried with #221 but I still get different chunks. Here is the code that I use, updated from above. I also check for numerical approximation errors (it's not the cause).

##
import os.path
import shutil

import dask.array.core
import numpy as np
import zarr
from ome_zarr.format import CurrentFormat
from ome_zarr.io import ZarrLocation, parse_url
from ome_zarr.reader import Multiscales, Reader
from ome_zarr.writer import write_image
import filecmp

im = np.random.normal(size=(3, 100, 100))
fmt = CurrentFormat()

# write


def write_to_zarr(im, f):
    if os.path.isdir(f):
        shutil.rmtree(f)
    store = parse_url(f, mode="w").store
    group = zarr.group(store=store)
    if isinstance(im, np.ndarray) or isinstance(im, dask.array.core.Array):
        write_image(im, group, axes=["c", "x", "y"], fmt=fmt, storage_options={"compressor": None})
    else:
        raise ValueError("the array to write must be a numpy array or a dask array")


write_to_zarr(im, "debug0.zarr")

# read
loc = ZarrLocation("debug0.zarr")
reader = Reader(loc)()
nodes = list(reader)
assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="0", version=fmt.version)

# write again (error)
write_to_zarr(im_read, "debug1.zarr")

# from stackoverflow
class dircmp(filecmp.dircmp):  # type: ignore[type-arg]
    """
    Compare the content of dir1 and dir2. In contrast with filecmp.dircmp, this
    subclass compares the content of files with the same path.
    """

    def phase3(self) -> None:
        """
        Find out differences between common files.
        Ensure we are using content comparison with shallow=False.
        """
        fcomp = filecmp.cmpfiles(self.left, self.right, self.common_files, shallow=False)
        self.same_files, self.diff_files, self.funny_files = fcomp


def are_directories_identical(dir1, dir2):
    compared = dircmp(dir1, dir2)
    if compared.left_only or compared.right_only or compared.diff_files or compared.funny_files:
        return False

    for subdir in compared.common_dirs:
        if not are_directories_identical(
            os.path.join(dir1, subdir),
            os.path.join(dir2, subdir),
        ):
            return False
    return True

# inspect the content of the files to manually check for numerical approximation errors
# if the output matrices are not close up to the machine precision, then the difference is not due to numerical errors
loc = ZarrLocation('debug0.zarr')
reader = Reader(loc)()
nodes = list(reader)
assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="1", version=CurrentFormat().version)

loc = ZarrLocation('debug1.zarr')
reader = Reader(loc)()
nodes = list(reader)
assert len(nodes) == 1
node = nodes[0]
im_read2 = node.load(Multiscales).array(resolution="1", version=CurrentFormat().version)
print(im_read[:5, :5, 0].compute())
print(im_read2[:5, :5, 0].compute())

##
assert are_directories_identical("debug0.zarr", "debug1.zarr")

@will-moore
Copy link
Member

Thanks for that...
Sorry, I realised that in asking you to test with #221 above, this omitted #220, which fixes another resize issue (so the .zattrs is different).
With both of those branches merged, the assertion fails because the resized chunks are different. e.g.

debug0.zarr/1/0/0 debug1.zarr/1/0/0

It's hard to "see" what's going on with the random pixel data, so I tried using a sample generated with...

ome_zarr create debug0.zarr --method=astronaut

Then I tweaked your script to be able to use this (not overwrite if debug0.zarr exists. Also added some printing to show which files fail the assert are_directories_identical().
And I commented out assert len(nodes) == 1 where needed, since the sample image has 'labels' nodes etc.

full script
import os.path
import shutil

import dask.array.core
import numpy as np
import zarr
from ome_zarr.format import CurrentFormat
from ome_zarr.io import ZarrLocation, parse_url
from ome_zarr.reader import Multiscales, Reader
from ome_zarr.writer import write_image
import filecmp

im = np.random.normal(size=(3, 100, 100))
fmt = CurrentFormat()

# write
input_img = "debug0.zarr"
output_img = "debug1.zarr"

def write_to_zarr(im, f):
    if os.path.isdir(f):
        shutil.rmtree(f)
    store = parse_url(f, mode="w").store
    group = zarr.group(store=store)
    if isinstance(im, np.ndarray) or isinstance(im, dask.array.core.Array):
        write_image(im, group, axes=["c", "x", "y"], fmt=fmt, storage_options={"compressor": None})
    else:
        raise ValueError("the array to write must be a numpy array or a dask array")

# allow use of existing data if it exists...
if not os.path.isdir(input_img):
    write_to_zarr(im, input_img)

# read
loc = ZarrLocation(input_img)
reader = Reader(loc)()
nodes = list(reader)
# assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="0", version=fmt.version)


# write again (error)
write_to_zarr(im_read, output_img)

# from stackoverflow
class dircmp(filecmp.dircmp):  # type: ignore[type-arg]
    """
    Compare the content of dir1 and dir2. In contrast with filecmp.dircmp, this
    subclass compares the content of files with the same path.
    """

    def phase3(self) -> None:
        """
        Find out differences between common files.
        Ensure we are using content comparison with shallow=False.
        """
        fcomp = filecmp.cmpfiles(self.left, self.right, self.common_files, shallow=False)
        self.same_files, self.diff_files, self.funny_files = fcomp


def are_directories_identical(dir1, dir2):
    compared = dircmp(dir1, dir2)
    if compared.left_only or compared.right_only or compared.diff_files or compared.funny_files:
        print('compared.left_only', compared.left_only)
        print('compared.right_only', compared.right_only)
        print('compared.diff_files', compared.diff_files)
        print('compared.funny_files', compared.funny_files)
        return False

    for subdir in compared.common_dirs:
        if not are_directories_identical(
            os.path.join(dir1, subdir),
            os.path.join(dir2, subdir),
        ):
            print("DIFF", os.path.join(dir1, subdir), os.path.join(dir2, subdir))
            return False
    return True

# inspect the content of the files to manually check for numerical approximation errors
# if the output matrices are not close up to the machine precision, then the difference is not due to numerical errors
loc = ZarrLocation(input_img)
reader = Reader(loc)()
nodes = list(reader)
# assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="1", version=CurrentFormat().version)

loc = ZarrLocation(output_img)
reader = Reader(loc)()
nodes = list(reader)
# assert len(nodes) == 1
node = nodes[0]
im_read2 = node.load(Multiscales).array(resolution="1", version=CurrentFormat().version)
print(im_read[:5, :5, 0].compute())
print(im_read2[:5, :5, 0].compute())

##
assert are_directories_identical(input_img, output_img)

Running the script.... the assert are_directories_identical() fails due to the presence of labels, and the .zattrs is different because the original contains extra omero channel information. But this is easier to compare the mis-match pixel values...

$ python write_dask_test.py 
[[144 214 235 235 228]
 [141 206 228 226 222]
 [143 206 224 226 217]]
[[146 204 232 234 228]
 [140 197 225 227 222]
 [147 196 222 224 217]]
compared.left_only ['labels']
compared.right_only []
compared.diff_files ['.zattrs']
compared.funny_files []
Traceback (most recent call last):
  File "/Users/wmoore/Desktop/python-scripts/zarr_scripts/write_dask_test2.py", line 99, in <module>
    assert are_directories_identical(input_img, output_img)
AssertionError

To visually compare them I opened the lower resolution array in napari...
Choosing NOT to use ome-zarr-py to read. Either npe2 or builtins works fine.

$ napari debug0.zarr/1

Then I dragged the dir debug1.zarr/1 onto the napri window to open that in the same way.
By colouring one of these layers red and one green and using additive blending of the top layer, you get a yellow layer where the pixel values are equal.
But if you zoom in you can see where there is a difference, where some pixels are more red or green than yellow.

Screenshot 2022-09-05 at 16 33 02

I'm not entirely sure why this is happening.
As far as I can see, both the dask and non-dask resizing are using skimage.transform.resize to do the underlying resizing.
The non-dask implementation is doing this with a 2D plane at a time, whereas the dask resize is using da.map_blocks to iterate through the chunks, applying the resize.

In order to dig deeper, we'd need a test that directly compares the resize methods themselves, comparing resize of different shaped arrays etc.
It might be possible to get the same pixel values in resized data via the 2 different methods, but it might not. I'm not familiar enough with zarr/dask to know without more work.

Hopefully that's possible, but if they don't get any closer in their output, how much of a problem is that?
I'm assuming that downsampling is mostly (exclusively?) used for visualisation rather than analysis, so the slight differences we're seeing here might not be a deal-breaker?

Anyway, thanks for highlighting these differences. It's good to know, and I'll see if I can understand what's going on a bit better...

@LucaMarconato
Copy link
Contributor Author

Perfect thanks a lot for the detailed explanation! I have modified our tests so that we compare only the full resolution image, since those small differences should not be a problem for visualization of downstream tasks based on the lower resolution images.

LucaMarconato added a commit to scverse/spatialdata that referenced this issue Sep 5, 2022
@will-moore
Copy link
Member

Dask writing support now released in ome-zarr-py 0.6.0.
If there are any new issues with that, please open them in another issue. Closing this one for now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants