[Bug]: Iterative write for compound reference datasets takes 5ever #144
Labels
category: bug
errors in the code or code behavior
category: enhancement
improvements of code or code behavior
priority: medium
non-critical problem and/or affecting only a small set of users
Milestone
What happened?
hello hello! checking out how you're structuring the zarr export so I can mimic it, and first try wasn't able to complete an export because after ~10m or so i had only written ~6MB of ~3GB and so i got to profiling. Searched for an issue but didn't find one, sorry if this has been raised already.
The problem is here:
hdmf-zarr/src/hdmf_zarr/backend.py
Lines 1021 to 1026 in 9c3b4be
where each row has to be separately serialized on write.
I see this behavior elsewhere like in
__list_fill__
so it seems like this is a general problem with write indexing into object-type zarr arrays - when I tried to collect all the new items in a list and write them likedset[...] = new_items
, i got a shape error (and couldn't fix it by setting the shape correctly in therequire_dataset
).What did work, though, is doing this:
The only difference that I could see is that rather than returning a python list when accessing the data again, it returned an (object-typed) numpy array.
The difference in timing is considerable (in my case, tool-breaking, which is why i'm raising as a bug rather than a feature request). When converting this dataset with 9 compound datasets w/ references, putting a timer around the above block:
I only spent ~20m or so profiling and debugging, so it doesn't replicate current top-level behavior exactly (creates an object array rather than a list of lists), but imo it might be better: the result is the same shape as the source dataset (so, in this case, an n x 3 array) and can be indexed as such, where the current implementation can't index by column since the array is just a one-dimensional array of serialized python lists. It also looks like it would be possible to improve it further still by using a numpy structured dtype (zarr seems to be able to handle that, at least from the docs, but the
require_dataset
method doesn't like it), etc. but hopefully useful as a first pass.The "steps to reproduce" below is just what I put in a test script and ran
cProfile
on to diagnose:Steps to Reproduce
Traceback
Operating System
macOS
Python Executable
Python
Python Version
3.11
Package Versions
environment_for_issue.txt
Code of Conduct
The text was updated successfully, but these errors were encountered: