Problems in the IO code #393

jmcarcell · 2023-03-21T11:42:43Z

When implementing support for RNTuple I found a few issues with some parts of the IO code that will have to be modified or worked around for RNTuple to work.

At least in one place a pointer that is supposed to point to the data points to an unique_ptr. This place is:

podio/python/templates/CollectionData.cc.jinja2

Line 94 in c7328d6

isSubsetColl ? nullptr : (void*)&m_data,
When converting buffers for writing, there is a three star cast: What this cast does is basically to return a pointer to vector.data() instead of a pointer to the vector itself. This also relies on the fact that implementations of std::vector save the pointer to the actual array in the first size_ptr bytes (if this wasn't the case the cast wouldn't even point to the data array):

podio/include/podio/CollectionBuffers.h

Line 40 in c7328d6

return *static_cast<std::vector<T>**>(raw);

How all of this works together is a mystery to me, as the TTree functions seem to be happy but the RNTuple ones don't. For the first point, I think we also probably rely on the fact that unique_ptr can have size zero. The first point can be easily worked around by saving another pointer to the actual data and using that one. Fixing any of those points will cause all the writers and readers to stop writing where they either crash or (the writers) don't crash but they don't write the data correctly.

The text was updated successfully, but these errors were encountered:

tmadlener · 2023-03-21T13:29:07Z

Let me try to bring some information on the original intent into this. I have to point out, that not all of this is ideally implemented yet, and there is still some historical baggage around that we didn't come around to fixing yet, because schema evolution would have touched similar code. I am working on getting closer to the ideas described below.

There are two types of collection buffers, which mainly differ by the ownership of the data they transport:

CollectionWriteBuffers only hold pointers to data that is owned by some collection.
CollectionReadBuffers own their data (although that is really not visible at all from the "interface").

The main idea is that since collections own their data, we simply need a convenient way to handle all necessary data for writing through one access point (instead of three like it was in the past). On the other hand when reading that, there is no collection yet to own the data, so we read all data into buffers and then make the collections constructible form these buffers (effectively "stealing" the data in the process).

Another important point is that while relation (references) and vector member information is somewhat easy to "standardize" across different data types, the data is really different for all data types, but I wanted only one class (instead of some inheritance structure), that is able to transport this, so data is a void* and we need to inject necessary type information via some other means and cast whenever necessary.

For ROOT we use some string manipulation and the fact that once a Branch is created it keeps this information.
For SIO we generate additional SIOBlocks (which is inheriting from sio::block, the basic building block for SIO based I/O). This allows us to inject the necessary information during code generation (see also below).

The CollectionReadBuffers also have two std::function members, that make it possible to type-erase some things, while still being able to inject necessary type information during code generation. The createCollection is the important bit here. We generate a std::function that fulfills the necessary interface, but that has all the type information still in place:

podio/python/templates/Collection.cc.jinja2

Lines 161 to 164 in c7328d6

    
           readBuffers.createCollection = [](podio::CollectionReadBuffers buffers, bool isSubsetColl) { 
        
               {{ collection_type }}Data data(buffers, isSubsetColl); 
        
               return std::make_unique<{{ collection_type }}>(std::move(data), isSubsetColl); 
        
           };

In the Frame based approach, the FrameDataT hands out CollectionReadBuffers and the Frame then simply calls buffers.createCollection on them, to get a collection from them.

Some more detailed comments

At least in one place a pointer that is supposed to point to the data points to an unique_ptr. This place is:

podio/python/templates/CollectionData.cc.jinja2

Line 94 in c7328d6

isSubsetColl ? nullptr : (void*)&m_data,

This is almost certainly a bug. The data field in the collection buffers should point to the vector<Data>. So I think making this a (void*) m_data.get() would be the right thing to do here.

When converting buffers for writing, there is a three star cast: What this cast does is basically to return a pointer to vector.data() instead of a pointer to the vector itself.

At least the intent of this was definitely to return a vector<Data>* not a pointer to vector.data(). This is also how it is used, e.g. in the SIO backend:

podio/python/templates/SIOBlock.cc.jinja2

Lines 62 to 65 in c7328d6

    
           auto* dataVec = podio::CollectionWriteBuffers::asVector<{{ class.full_type }}Data>(m_buffers.data); 
        
           unsigned size = dataVec->size() ; 
        
           device.data( size ) ; 
        
           podio::handlePODDataSIO( device ,  dataVec->data(), size ) ;

Also one of these casts seem to be necessary when reading via ROOT, as there is something going on with a void* when reading from a Branch that I honestly also haven't fully understood:

podio/src/ROOTFrameReader.cc

Line 133 in c7328d6

collBuffers.recast(collBuffers);

(this recast is essentially simply calling such the asVector with the correct type, using a std::function to type-erase the whole thing.)

See also:

podio/include/podio/CollectionBuffers.h

Lines 84 to 103 in c7328d6

    
           // This is a hacky workaround for the ROOT backend at the moment. There is 
        
           // probably a better solution, but I haven't found it yet. The problem is the 
        
           // following: 
        
           // 
        
           // When creating a pointer to a vector<T>, either via new or via 
        
           // TClass::New(), we get a void*, that can be cast back to a vector with 
        
           // 
        
           //     static_cast<vector<T>*>(raw); 
        
           // 
        
           // However, as soon as we pass that same void* to TBranch::SetAddress this no 
        
           // longer works and the actual cast has to be 
        
           // 
        
           //     *static_cast<vector<T>**>(raw); 
        
           // 
        
           // To make it possible to always use the first form, after we leave the Root 
        
           // parts of reading, this function is populated in the createBuffers call of each 
        
           // datatype where we have the necessary type information (from code 
        
           // generation) to do the second cast and assign the result of that to the data 
        
           // field again. 
        
           RecastFuncT recast{};

hegner · 2023-03-21T14:12:36Z

@jmcarcell - thanks for looking into all that!
Should we make a whiteboard session to explain the overall structure? because the design is not really optimal / we learnt a lot afterwards, we should take the RNTuple as a good chance to redo.
@tmadlener - we then document in tickets

jmcarcell mentioned this issue Mar 22, 2023

Add support for the new RNTuple format #395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems in the IO code #393

Problems in the IO code #393

jmcarcell commented Mar 21, 2023 •

edited

Loading

tmadlener commented Mar 21, 2023

hegner commented Mar 21, 2023

Problems in the IO code #393

Problems in the IO code #393

Comments

jmcarcell commented Mar 21, 2023 • edited Loading

tmadlener commented Mar 21, 2023

Some more detailed comments

hegner commented Mar 21, 2023

jmcarcell commented Mar 21, 2023 •

edited

Loading