[Bug]: Iterative write for compound reference datasets takes 5ever #144

sneakers-the-rat · 2023-12-07T06:14:30Z

What happened?

hello hello! checking out how you're structuring the zarr export so I can mimic it, and first try wasn't able to complete an export because after ~10m or so i had only written ~6MB of ~3GB and so i got to profiling. Searched for an issue but didn't find one, sorry if this has been raised already.

The problem is here:

hdmf-zarr/src/hdmf_zarr/backend.py

Lines 1021 to 1026 in 9c3b4be

    
               for j, item in enumerate(data): 
        
                   new_item = list(item) 
        
                   for i in refs: 
        
                       new_item[i] = self.__get_ref(item[i], export_source=export_source) 
        
                   dset[j] = new_item 
        
           else:

where each row has to be separately serialized on write.

I see this behavior elsewhere like in __list_fill__ so it seems like this is a general problem with write indexing into object-type zarr arrays - when I tried to collect all the new items in a list and write them like dset[...] = new_items, i got a shape error (and couldn't fix it by setting the shape correctly in the require_dataset ).

What did work, though, is doing this:

if len(refs) > 0:
    dset = parent.require_dataset(name,
                  shape=(len(data), len(data[0])),
                  dtype=object,
                  object_codec=self.__codec_cls(),
                  **options['io_settings'])
    self._written_builders.set_written(builder)  # record that the builder has been written
    dset.attrs['zarr_dtype'] = type_str
    
    new_items = []
    for j, item in enumerate(data):
        new_item = list(item)
        for i in refs:
            new_item[i] = self.__get_ref(item[i], export_source=export_source)
        new_items.append(new_item)

    dset[...] = np.array(new_items)

The only difference that I could see is that rather than returning a python list when accessing the data again, it returned an (object-typed) numpy array.

The difference in timing is considerable (in my case, tool-breaking, which is why i'm raising as a bug rather than a feature request). When converting this dataset with 9 compound datasets w/ references, putting a timer around the above block:

Array write	Iterative Write
0.03479409217834473	1.675
0.010181903839111328	0.205
0.19667983055114746	62.746
1.0213229656219482	1808.36
2.000258207321167	(dnf, it's been 30m and i need to go home)
0.31616997718811035	...
0.0024509429931640625	...
0.3125300407409668	...
0.01106715202331543	...

I only spent ~20m or so profiling and debugging, so it doesn't replicate current top-level behavior exactly (creates an object array rather than a list of lists), but imo it might be better: the result is the same shape as the source dataset (so, in this case, an n x 3 array) and can be indexed as such, where the current implementation can't index by column since the array is just a one-dimensional array of serialized python lists. It also looks like it would be possible to improve it further still by using a numpy structured dtype (zarr seems to be able to handle that, at least from the docs, but the require_dataset method doesn't like it), etc. but hopefully useful as a first pass.

The "steps to reproduce" below is just what I put in a test script and ran cProfile on to diagnose:

Steps to Reproduce

from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO

in_file = "/location/of/data/nwb/sub-738651046_ses-760693773.nwb"
out_file = "./test2.zarr"

with NWBHDF5IO(in_file, 'r', load_namespaces=True) as read_io:
    with NWBZarrIO(out_file, mode="w") as export_io:
        export_io.export(src_io=read_io, write_args={'link_data':False})

Traceback

(not really a traceback kinda problem)

Operating System

macOS

Python Executable

Python

Python Version

3.11

Package Versions

environment_for_issue.txt

Code of Conduct

I agree to follow this project's Code of Conduct
Have you checked the Contributing document?
Have you ensured this bug was not already reported?

The text was updated successfully, but these errors were encountered:

oruebel · 2023-12-07T21:26:17Z

Thanks for detailed issue.

Just for background, the reason the Zarr backend needs to inspect all the values here is because Zarr does not natively support object references. So to work around this issue we represent references as JSON objects to allow us to describe and resolve references.

Reducing the number of write operations by generating the whole dataset in memory first and then writing in one single write call (rather than writing each element one-at-a-time), is definitely a good practice. In most files, the number of references is relatively small. But in this case it seems there are a lot reference and if they are in compound datasets (e.g,. a TimeSeriesReferenceVectorData) then that problem is probably further exasperated. We'll take a look or if you want , feel free to create a PR with your fix that we can work on together.

oruebel added category: bug errors in the code or code behavior category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Dec 7, 2023

oruebel added this to the Next Release milestone Dec 7, 2023

sneakers-the-rat mentioned this issue Dec 8, 2023

Write references in compound datasets as an array #146

Merged

6 tasks

mavaylon1 mentioned this issue Dec 13, 2023

Write references in compound datasets at the end #149

Closed

6 tasks

mavaylon1 closed this as completed in #146 Feb 2, 2024

sneakers-the-rat mentioned this issue Aug 8, 2024

Deprecate compound dtypes, TimeSeriesReferenceVectorData NeurodataWithoutBorders/nwb-schema#588

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Iterative write for compound reference datasets takes 5ever #144

[Bug]: Iterative write for compound reference datasets takes 5ever #144

sneakers-the-rat commented Dec 7, 2023

oruebel commented Dec 7, 2023

[Bug]: Iterative write for compound reference datasets takes 5ever #144

[Bug]: Iterative write for compound reference datasets takes 5ever #144

Comments

sneakers-the-rat commented Dec 7, 2023

What happened?

Steps to Reproduce

Traceback

Operating System

Python Executable

Python Version

Package Versions

Code of Conduct

oruebel commented Dec 7, 2023