Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Iterative write for compound reference datasets takes 5ever #144

Closed
3 tasks done
sneakers-the-rat opened this issue Dec 7, 2023 · 1 comment · Fixed by #146
Closed
3 tasks done

[Bug]: Iterative write for compound reference datasets takes 5ever #144

sneakers-the-rat opened this issue Dec 7, 2023 · 1 comment · Fixed by #146
Labels
category: bug errors in the code or code behavior category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Milestone

Comments

@sneakers-the-rat
Copy link
Contributor

What happened?

hello hello! checking out how you're structuring the zarr export so I can mimic it, and first try wasn't able to complete an export because after ~10m or so i had only written ~6MB of ~3GB and so i got to profiling. Searched for an issue but didn't find one, sorry if this has been raised already.

The problem is here:

for j, item in enumerate(data):
new_item = list(item)
for i in refs:
new_item[i] = self.__get_ref(item[i], export_source=export_source)
dset[j] = new_item
else:

where each row has to be separately serialized on write.

I see this behavior elsewhere like in __list_fill__ so it seems like this is a general problem with write indexing into object-type zarr arrays - when I tried to collect all the new items in a list and write them like dset[...] = new_items, i got a shape error (and couldn't fix it by setting the shape correctly in the require_dataset ).

What did work, though, is doing this:

if len(refs) > 0:
    dset = parent.require_dataset(name,
                  shape=(len(data), len(data[0])),
                  dtype=object,
                  object_codec=self.__codec_cls(),
                  **options['io_settings'])
    self._written_builders.set_written(builder)  # record that the builder has been written
    dset.attrs['zarr_dtype'] = type_str
    
    new_items = []
    for j, item in enumerate(data):
        new_item = list(item)
        for i in refs:
            new_item[i] = self.__get_ref(item[i], export_source=export_source)
        new_items.append(new_item)

    dset[...] = np.array(new_items)

The only difference that I could see is that rather than returning a python list when accessing the data again, it returned an (object-typed) numpy array.

The difference in timing is considerable (in my case, tool-breaking, which is why i'm raising as a bug rather than a feature request). When converting this dataset with 9 compound datasets w/ references, putting a timer around the above block:

Array write Iterative Write
0.03479409217834473 1.675
0.010181903839111328 0.205
0.19667983055114746 62.746
1.0213229656219482 1808.36
2.000258207321167 (dnf, it's been 30m and i need to go home)
0.31616997718811035 ...
0.0024509429931640625 ...
0.3125300407409668 ...
0.01106715202331543 ...

I only spent ~20m or so profiling and debugging, so it doesn't replicate current top-level behavior exactly (creates an object array rather than a list of lists), but imo it might be better: the result is the same shape as the source dataset (so, in this case, an n x 3 array) and can be indexed as such, where the current implementation can't index by column since the array is just a one-dimensional array of serialized python lists. It also looks like it would be possible to improve it further still by using a numpy structured dtype (zarr seems to be able to handle that, at least from the docs, but the require_dataset method doesn't like it), etc. but hopefully useful as a first pass.

The "steps to reproduce" below is just what I put in a test script and ran cProfile on to diagnose:

Steps to Reproduce

from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO

in_file = "/location/of/data/nwb/sub-738651046_ses-760693773.nwb"
out_file = "./test2.zarr"

with NWBHDF5IO(in_file, 'r', load_namespaces=True) as read_io:
    with NWBZarrIO(out_file, mode="w") as export_io:
        export_io.export(src_io=read_io, write_args={'link_data':False})

Traceback

(not really a traceback kinda problem)

Operating System

macOS

Python Executable

Python

Python Version

3.11

Package Versions

environment_for_issue.txt

Code of Conduct

@oruebel oruebel added category: bug errors in the code or code behavior category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Dec 7, 2023
@oruebel oruebel added this to the Next Release milestone Dec 7, 2023
@oruebel
Copy link
Contributor

oruebel commented Dec 7, 2023

Thanks for detailed issue.

Just for background, the reason the Zarr backend needs to inspect all the values here is because Zarr does not natively support object references. So to work around this issue we represent references as JSON objects to allow us to describe and resolve references.

Reducing the number of write operations by generating the whole dataset in memory first and then writing in one single write call (rather than writing each element one-at-a-time), is definitely a good practice. In most files, the number of references is relatively small. But in this case it seems there are a lot reference and if they are in compound datasets (e.g,. a TimeSeriesReferenceVectorData) then that problem is probably further exasperated. We'll take a look or if you want , feel free to create a PR with your fix that we can work on together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Projects
None yet
2 participants