Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Cannot overwrite existing HDF5 .nwb file with mode="w" #117

Closed
3 tasks done
CodyCBakerPhD opened this issue Aug 28, 2023 · 5 comments · Fixed by #229
Closed
3 tasks done

[Bug]: Cannot overwrite existing HDF5 .nwb file with mode="w" #117

CodyCBakerPhD opened this issue Aug 28, 2023 · 5 comments · Fixed by #229
Assignees
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users topic: storage issues related to storage schema and formats
Milestone

Comments

@CodyCBakerPhD
Copy link
Contributor

What happened?

Ran into this during testing on NeuroConv

Steps to Reproduce

from pynwb import NWBHDF5IO
from hdmf_zarr import NWBZarrIO
from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.base import mock_TimeSeries

with NWBHDF5IO(path="C:/Users/Raven/Downloads/test_hdf5.nwb", mode="w") as io:
    nwbfile = mock_NWBFile()
    nwbfile.add_acquisition(mock_TimeSeries())
    io.write(nwbfile)

with NWBZarrIO(path="C:/Users/Raven/Downloads/test_hdf5.nwb", mode="w") as io:
    nwbfile = mock_NWBFile()
    nwbfile.add_acquisition(mock_TimeSeries())
    io.write(nwbfile)

Traceback

Traceback (most recent call last):

  Cell In[8], line 1
    with NWBZarrIO(path="C:/Users/Raven/Downloads/test_hdf5.nwb", mode="w") as io:

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf\utils.py:644 in func_call
    return func(args[0], **pargs)

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf_zarr\nwb.py:52 in __init__
    super(NWBZarrIO, self).__init__(path,

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf\utils.py:644 in func_call
    return func(args[0], **pargs)

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf_zarr\backend.py:124 in __init__
    super().__init__(manager, source=source_path)

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf\utils.py:644 in func_call
    return func(args[0], **pargs)

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf\backends\io.py:41 in __init__
    self.open()

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf_zarr\backend.py:156 in open
    self.__file = zarr.open(store=self.path,

  File ~\AppData\Roaming\Python\Python310\site-packages\zarr\convenience.py:102 in open
    _store: BaseStore = normalize_store_arg(

  File ~\AppData\Roaming\Python\Python310\site-packages\zarr\storage.py:197 in normalize_store_arg
    return normalize_store(store, storage_options, mode)

  File ~\AppData\Roaming\Python\Python310\site-packages\zarr\storage.py:177 in _normalize_store_arg_v2
    return DirectoryStore(store)

  File ~\AppData\Roaming\Python\Python310\site-packages\zarr\storage.py:1062 in __init__
    raise FSPathExistNotDir(path)

FSPathExistNotDir: path exists but is not a directory: %r

Operating System

Windows

Python Executable

Conda

Python Version

3.10

Package Versions

No response

Code of Conduct

@oruebel
Copy link
Contributor

oruebel commented Aug 28, 2023

Thanks for issue and helpful example. Using the same file with different backends (here creating test_hdf5.nwb with HDF5 and then overwriting with Zarr) is not possible in most cases. I think the best we can do here is to check if the existing file is readable with the I/O backend and raise a ValueError to indicate that the existing file uses a different backend. What do you think?

@CodyCBakerPhD
Copy link
Contributor Author

I don't plan on using the same file with a different backend; the mode="w" to me, indicates the behavior 'delete if existing' which shouldn't matter if it is one backend or another

@CodyCBakerPhD
Copy link
Contributor Author

Example: I tried writing my NWB file using DANDI filename convention ./sub-{subject_id}_ses-{session_id}.nwb first using the HDF5 backend, then decided I wanted to switch to Zarr

Slightly annoying if I have to manually delete the file file if it exists before writing it again; I've always utilized the auto-overwrite of mode="w" to achieve this

@oruebel
Copy link
Contributor

oruebel commented Aug 28, 2023

the mode="w" to me, indicates the behavior 'delete if existing' which shouldn't matter if it is one backend or another

In this case we would need to manually delete the file if it from a different backend before opening it. I.e., the logic would need to be something like

if self.mode == 'w':
  if os.path.exists(filename) and not self.can_read(filename):
      if os.path.isdir(filename):
          shutil.rmtree(filename)
      else:
         os.remove(filename)

This would probably need to be done in all backends (HDF5 and Zarr). Since this is deleting files that are not from the particular backend, I think it would be best to make this configurable, e.g., by specifing and additional parameter, e.g., force_overwrite=True, to make sure that this is something the user explicitly decides to do and avoid users accidentally deleting files.

@rly
Copy link
Contributor

rly commented Aug 28, 2023

This is ultimately an issue with Zarr (and h5py) -- both libraries raise an error and do not overwrite if the filepath is of a different type than expected (see below). But we can improve on that user experience. I think @oruebel 's suggested fix above is good. By the way, the TAB is thinking of recommending using the ".nwb.zarr" suffix for NWB Zarr files. Alessio/AIND is already doing that.

When Zarr tries to write to a directory that exists as a non-zarr directory (OK):

>>> import zarr
>>> import os
>>> os.mkdir("temp")
>>> z1 = zarr.open("temp", mode="w", shape=(2, 2), dtype="i4")
>>> z1[:] = 42  # creates zarr structure with data

When Zarr tries to write to a directory that exists as a file (error):

>>> import zarr
>>> with open("temp.txt", "w"):
...     pass
... 
>>> zarr.open("temp.txt", "w")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/zarr/convenience.py", line 96, in open
    _store: BaseStore = normalize_store_arg(
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/zarr/storage.py", line 181, in normalize_store_arg
    return normalize_store(store, storage_options, mode)
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/zarr/storage.py", line 163, in _normalize_store_arg_v2
    return DirectoryStore(store)
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/zarr/storage.py", line 1036, in __init__
    raise FSPathExistNotDir(path)
zarr.errors.FSPathExistNotDir: path exists but is not a directory: %r

When h5py tries to write to a file that exists as a non-hdf5 file (OK):

>>> import h5py
>>> import os
>>> os.mkdir("temp")
>>> with open("temp.txt", "w"):
...     pass
... 
>>> with h5py.File("temp.txt", "w"):  # creates empty hdf5 file
...     pass
... 

When h5py tries to write to a file that exists as a directory (error):

>>> import h5py
>>> import os
>>> os.mkdir("temp")
>>> h5py.File("temp", "w")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/h5py/_hl/files.py", line 567, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/h5py/_hl/files.py", line 237, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 126, in h5py.h5f.create
IsADirectoryError: [Errno 21] Unable to create file (unable to open file: name = 'temp', errno = 21, error message = 'Is a directory', flags = 13, o_flags = 602)

@mavaylon1 mavaylon1 added category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users topic: storage issues related to storage schema and formats labels Apr 16, 2024
@mavaylon1 mavaylon1 self-assigned this Apr 16, 2024
@mavaylon1 mavaylon1 added this to the 0.8.0 milestone Apr 16, 2024
@rly rly closed this as completed in #229 Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users topic: storage issues related to storage schema and formats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants