-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting bins from scool on append leaves dead space in file #451
Comments
This issue arises from how HDF5 handles space when deleting or modifying groups and datasets. When you delete a group or dataset in an HDF5 file, the space occupied by those objects is not automatically freed. Instead, it is marked as free space but remains part of the file. This can cause file size bloat when repeated deletions and additions occur, as you've observed with the create_scool function in append mode. |
Hi, I'm aware of this. It would be easy to avoid causing these issues
arising for users of the method in the first place by adding a simple check
to verify that the bins are identical to the existing bins and skipping
deleting and rewriting them if so. Otherwise, if users find themselves
appending to a file many times iteratively, this could necessitate
repacking a large .scool file, which could be slow, and the necessity of
which could be avoided in the first place by this adjustment to the
create_scool method.
I also think it would be useful if the create() method that create_scool
calls did not validate that the bins dataframe contains the chrom, start
and end columns, since it does not actually use them. Otherwise, if bins
besides chrom, start, and end are to be passed for a large collection of
cells, then redundant and unused copies of these columns must be passed in
distinct dataframes, one per cell, which could use substantial memory. The
workaround I'm currently using is to set these columns to None for all but
the first cell passed in; I think relaxing this check would make for a
nicer interface.
Just food for thought, since based on the design of the code I wasn't sure
if you were aware of these somewhat hidden issues! Thanks.
…On Thu, Jan 9, 2025 at 1:04 AM srinitha709 ***@***.***> wrote:
This issue arises from how HDF5 handles space when deleting or modifying
groups and datasets. When you delete a group or dataset in an HDF5 file,
the space occupied by those objects is not automatically freed. Instead, it
is marked as free space but remains part of the file. This can cause file
size bloat when repeated deletions and additions occur, as you've observed
with the create_scool function in append mode.
Use File Compression--When adding new data, consider using compression to
save space Some HDF5 libraries support automatic compression, reducing the
file size even if some space is not reclaimed.
Consider Defragmentation--If the file size is a significant concern and
space isn't being freed consider using tools like h5repack (for HDF5 files)
to compact the file and reclaim unused space.
—
Reply to this email directly, view it on GitHub
<#451 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJF4OGJOPABWTFQAWVKVQKT2JY3TJAVCNFSM6AAAAABU27Z4TCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZZGUYTQMRTGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Warmly,
Benjamin Skubi
(425) 293-4462
|
Thanks for raising this issue @bskubi. If I understand correctly, you are calling Additionally (or alternatively), I've long been meaning to allow an iterator as input, to support incremental appending of single cell maps. Would this kind of solution fit your workflow? Something like: def contact_map_iter():
for cell_id in cell_ids:
pixels = <fetch or generate pixels ... >
yield cell_id, pixels
create_scool("foo.scool", bins, contact_map_iter())
Good point! Perhaps optional cell-specific bin columns could be rolled into the iterator solution. |
The major issue in the current design as I see it is actually with the bins
argument, not the pixels argument. The pixels argument already allows
passing an iterator that generates pixels dataframes on the fly for the
associated cell.
By contrast the bins dict requires that the values are all pandas
dataframes, so all the bins info for the entire dataset must be loaded into
memory, including any weights or other bins-associated per-cell values.
The change I think would be best is something like this:
Allow passing a single bins dataframe via the bins argument, setting the
chrom/start/end value of the shared bins for all cells.
Then allow passing an iterator that generates (cell id, pixels iterator,
cell-specific bins iterator). It would then iterate through the pixels
iterator and append them to the per-cell pixels dataset, then iterate
through the bins iterator and append them to the per-cell bins dataset.
Does that make sense?
Warmly,
Benjamin Skubi
(425) 293-4462
…On Sat, Jan 11, 2025 at 1:18 AM Nezar Abdennur ***@***.***> wrote:
Thanks for raising this issue @bskubi <https://github.com/bskubi>.
If I understand correctly, you are calling create_scool multiple times on
the same file in append mode to add new single cell maps to an existing
dataset. That wasn't an anticipated usage pattern, so I see how this issue
arose. The contributor who wrote create_scool encoded the input as a
dictionary of cell names to pixel dataframes, fully materialized, which is
not suitable for large datasets. Enabling the pattern you are using without
deleting and recreating the bin table would be a workaround.
Additionally (or alternatively), I've long been meaning to allow an
iterator as input, to support incremental appending of single cell maps.
Would this kind of solution fit your workflow? Something like:
def contact_map_iter():
for cell_id in cell_ids:
pixels = <fetch or generate pixels ... >
yield cell_id, pixels
create_scool("foo.scool", bins, contact_map_iter())
I also think it would be useful if the create() method that create_scool
calls did not validate that the bins dataframe contains the chrom, start
and end columns, since it does not actually use them.
Good point! Perhaps optional cell-specific bin columns could be rolled
into the iterator solution.
—
Reply to this email directly, view it on GitHub
<#451 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJF4OGIARQKZW74UO6TNJRD2KDOWJAVCNFSM6AAAAABU27Z4TCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBVGE3DSMRWGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
In the HDF5 format, when groups/datasets are deleted, it does not free the space in the file.
I notice that in the create_scool implementation, if the function is called in append mode, then it always deletes the original bins group and recreates it, even if the submitted bins are exactly the same. Calling create_scool multiple times with the same bins, but an empty pixels dict, converts the previous bins datasets into inaccessible dead space within the file, making the file larger and larger even though the actual groups and data within the file is the same.
The text was updated successfully, but these errors were encountered: