Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting bins from scool on append leaves dead space in file #451

Open
bskubi opened this issue Jan 8, 2025 · 4 comments
Open

Deleting bins from scool on append leaves dead space in file #451

bskubi opened this issue Jan 8, 2025 · 4 comments

Comments

@bskubi
Copy link

bskubi commented Jan 8, 2025

In the HDF5 format, when groups/datasets are deleted, it does not free the space in the file.

I notice that in the create_scool implementation, if the function is called in append mode, then it always deletes the original bins group and recreates it, even if the submitted bins are exactly the same. Calling create_scool multiple times with the same bins, but an empty pixels dict, converts the previous bins datasets into inaccessible dead space within the file, making the file larger and larger even though the actual groups and data within the file is the same.

@srinitha709
Copy link

This issue arises from how HDF5 handles space when deleting or modifying groups and datasets. When you delete a group or dataset in an HDF5 file, the space occupied by those objects is not automatically freed. Instead, it is marked as free space but remains part of the file. This can cause file size bloat when repeated deletions and additions occur, as you've observed with the create_scool function in append mode.
Use File Compression--When adding new data, consider using compression to save space Some HDF5 libraries support automatic compression, reducing the file size even if some space is not reclaimed.
Consider Defragmentation--If the file size is a significant concern and space isn't being freed consider using tools like h5repack (for HDF5 files) to compact the file and reclaim unused space.

@bskubi
Copy link
Author

bskubi commented Jan 10, 2025 via email

@nvictus
Copy link
Member

nvictus commented Jan 11, 2025

Thanks for raising this issue @bskubi.

If I understand correctly, you are calling create_scool multiple times on the same file in append mode to add new single cell maps to an existing dataset. That wasn't an anticipated usage pattern, so I see how this issue arose. The contributor who wrote create_scool encoded the input as a dictionary of cell names to pixel dataframes, fully materialized, which is not suitable for large datasets. Enabling the pattern you are using without deleting and recreating the bin table would be a workaround.

Additionally (or alternatively), I've long been meaning to allow an iterator as input, to support incremental appending of single cell maps. Would this kind of solution fit your workflow? Something like:

def contact_map_iter():
    for cell_id in cell_ids:
        pixels = <fetch or generate pixels ... >
        yield cell_id, pixels
 
create_scool("foo.scool", bins, contact_map_iter())

I also think it would be useful if the create() method that create_scool calls did not validate that the bins dataframe contains the chrom, start and end columns, since it does not actually use them.

Good point! Perhaps optional cell-specific bin columns could be rolled into the iterator solution.

@bskubi
Copy link
Author

bskubi commented Jan 13, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants