Replies: 8 comments 2 replies
-
Another comparison:Version failingCode:
Effect:
Version working:It seems that enabling default compression makes possible to write correct files larger than 2 GB. Code:
Effect:
|
Beta Was this translation helpful? Give feedback.
-
The simplest code to reproduce the problem (path to the output file can be adjusted):
This snippet gives me the error:
|
Beta Was this translation helpful? Give feedback.
-
Interesting, I've tried as well the same simple code but with compression enabled: from pathlib import Path
import numpy as np
import uproot
ntuple_path = Path('file.root')
data_size = 1000_000_000
data_dict = {
"x": np.ones(data_size, dtype=np.float64),
}
with uproot.recreate(ntuple_path) as fout:
fout["tree"] = data_dict The code took 20 min to run on my cluster, used ~40GB of RAM and crashed with: Traceback (most recent call last):
File "/memfs/7685922/bug.py", line 10, in <module>
fout["tree"] = data_dict
~~~~^^^^^^^^
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/writable.py", line 984, in __setitem__
self.update({where: what})
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/writable.py", line 1555, in update
uproot.writing.identify.add_to_directory(v, name, directory, streamers)
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/identify.py", line 152, in add_to_directory
tree.extend(data)
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/writable.py", line 1834, in extend
self._cascading.extend(self._file, self._file.sink, data)
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/_cascadetree.py", line 816, in extend
totbytes, zipbytes, location = self.write_np_basket(
^^^^^^^^^^^^^^^^^^^^^
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/_cascadetree.py", line 1399, in write_np_basket
uproot.reading._key_format_big.pack(
struct.error: 'i' format requires -2147483648 <= number <= 2147483647 |
Beta Was this translation helpful? Give feedback.
-
I've been meaning to get back to this. Maybe we could add an error message, but the ROOT format itself does not allow TBaskets to be bigger than 2 GB because both the uproot5/src/uproot/writing/_cascadetree.py Lines 1399 to 1408 in 94c085b and here's the definition of Line 2196 in 94c085b Here's the ROOT definition of a TKey: https://root.cern.ch/doc/master/classTKey.html#ab2e59bcc49663466e74286cabd3d42c1 in which Do you know that file["tree_name"] = {"branch": branch_data} writes all of the file["tree_name"] = {"branch": first_basket}
file["tree_name"].extend({"branch": second_basket})
file["tree_name"].extend({"branch": third_basket})
... It can't be an interface that takes all of the data in one call because the TBasket data might not fit in memory, especially if you have many TBranches (each with one TBasket). This interface is documented here. In most files, ROOT TBaskets tend to be too small: they tend to be on the order of kilobytes, when it would be more efficient to read if they were megabytes. If you ask ROOT to make big TBaskets, on the order of megabytes or bigger, it just doesn't do it—there seems to be some internal limit. Uproot does exactly what you ask, and you were asking for gigabyte-sized TBaskets. If you didn't run into the 2 GB limit, I wonder if ROOT would be able to read them. Since it prevents the writing of TBaskets that large, I wouldn't be surprised if there's an implicit assumption in the reading code. Did you ever write 1 GB TBaskets and then read them back in ROOT? About this issue, I think I can close it because the format simply doesn't accept integers of that size, and most likely, you intended to write multiple TBaskets with uproot.WritableTree.extend. |
Beta Was this translation helpful? Give feedback.
-
@jpivarski thanks for your detailed explanation. Indeed documentation of uproot is great on that aspects. It just needs a careful reading. Can you suggest me some methods on inspecting bucket size and checking the exact limits ? I am playing with uproot as tool to convert large HDF files into something that could be then inspected online using JSROOT. The ROOT ntuples are transfered to S3 filestystem provided by our supercomputing center (ACK Cyfronet in Krakow). As S3 provides nice way to share the files via URL, the JSROOT can nicely load the files. I exploit there the partial read from HTTP feature (root-project/jsroot#284). I've played a bit with basket size and for my use reading case the optimum basket size is about 1000000 (10^6) rows/entries. This gives fastest loading time in JSROOT. My tree has ~20 branches with mostly 64 bit floats. For small benchmark I've took the same HDF file and generated two root files, one with 100k entries per basket and one with 1000k entries. You can play yourselves with them: 100k entries/basket
1000k entries/basket
In general I feel that having less HTTP request of ~1MB size gives the most optimal perfomance. Going down with basket size to 10k entries slows down JSROOT even more. The problemNow the problem is following: with 1000000 entries per basket I cannot process larger files.
when running this code I get an error with
|
Beta Was this translation helpful? Give feedback.
-
@jpivarski I am not sure if discussion on closed issue is the best place ? Should I convert this into discussion (https://github.com/scikit-hep/uproot5/discussions) ? |
Beta Was this translation helpful? Give feedback.
-
I wrote a small script to dump the information on the baskets and on the contents of the ROOT files I've already generated. The optimal readout comes from the example file (I call it
The less optimal is with another file (called
I was puzzled by the fact that I wasn't able to convert I believe I may hit some limits in the ROOT I/O, but on the other hand the numbers are not that large (~1449 baskets in total, each of size few MB). To test this problem I can provide links to the HDF file (https://s3p.cloud.cyfronet.pl/datarawlv2v4/20231204m2/4nA.slim.hdf) and short script which translates HDF into ROOT using pyroot. |
Beta Was this translation helpful? Give feedback.
-
I've managed to prepare a short example where an error The input file needed to conversion can be wget'ed and converted using scripts provided here: The recipe to reproduce the problem is following: wget https://s3p.cloud.cyfronet.pl/datarawlv2v4/20231204m2/4nA.slim.hdf
python -m venv venv
source venv/bin/activate
pip install uproot h5py click
python discussion1135.py 4nA.slim.hdf 4nA.slim.root 1>stdout.log 2>stderr.log I just wanted to figure out if the ROOT limits are such low that having few MB baskets is not feasible or am I hitting some other issues. |
Beta Was this translation helpful? Give feedback.
-
The problem
Trying to save ntuple (TTree) with more than 2 GB of data and no compression fails with following error:
The minimum code to reproduce:
Details
More details with my original code from which the problem is below. In the comments below I've also provided more examples.
I was trying to write an ROOT ntuple with following code:
This works nicely until files are small, say smaller than 2GB.
When trying to save larger file I get following error:
I saw similar error reported long ago here: scikit-hep/uproot3#462
Also - when looking at the source code of
extend
method inclass NTuple(CascadeNode)
it seems that all calls toadd_rblob
are withbig=False
argument. which suggest that only 4-byte pointers are being used.See:
uproot5/src/uproot/writing/_cascadentuple.py
Line 779 in 8a42e7d
This is my uproot version:
Beta Was this translation helpful? Give feedback.
All reactions