Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in the IO code #393

Open
jmcarcell opened this issue Mar 21, 2023 · 2 comments
Open

Problems in the IO code #393

jmcarcell opened this issue Mar 21, 2023 · 2 comments

Comments

@jmcarcell
Copy link
Member

jmcarcell commented Mar 21, 2023

When implementing support for RNTuple I found a few issues with some parts of the IO code that will have to be modified or worked around for RNTuple to work.

  • At least in one place a pointer that is supposed to point to the data points to an unique_ptr. This place is:
    isSubsetColl ? nullptr : (void*)&m_data,
  • When converting buffers for writing, there is a three star cast: What this cast does is basically to return a pointer to vector.data() instead of a pointer to the vector itself. This also relies on the fact that implementations of std::vector save the pointer to the actual array in the first size_ptr bytes (if this wasn't the case the cast wouldn't even point to the data array):
    return *static_cast<std::vector<T>**>(raw);

How all of this works together is a mystery to me, as the TTree functions seem to be happy but the RNTuple ones don't. For the first point, I think we also probably rely on the fact that unique_ptr can have size zero. The first point can be easily worked around by saving another pointer to the actual data and using that one. Fixing any of those points will cause all the writers and readers to stop writing where they either crash or (the writers) don't crash but they don't write the data correctly.

@tmadlener
Copy link
Collaborator

Let me try to bring some information on the original intent into this. I have to point out, that not all of this is ideally implemented yet, and there is still some historical baggage around that we didn't come around to fixing yet, because schema evolution would have touched similar code. I am working on getting closer to the ideas described below.

There are two types of collection buffers, which mainly differ by the ownership of the data they transport:

  • CollectionWriteBuffers only hold pointers to data that is owned by some collection.
  • CollectionReadBuffers own their data (although that is really not visible at all from the "interface").

The main idea is that since collections own their data, we simply need a convenient way to handle all necessary data for writing through one access point (instead of three like it was in the past). On the other hand when reading that, there is no collection yet to own the data, so we read all data into buffers and then make the collections constructible form these buffers (effectively "stealing" the data in the process).

Another important point is that while relation (references) and vector member information is somewhat easy to "standardize" across different data types, the data is really different for all data types, but I wanted only one class (instead of some inheritance structure), that is able to transport this, so data is a void* and we need to inject necessary type information via some other means and cast whenever necessary.

  • For ROOT we use some string manipulation and the fact that once a Branch is created it keeps this information.
  • For SIO we generate additional SIOBlocks (which is inheriting from sio::block, the basic building block for SIO based I/O). This allows us to inject the necessary information during code generation (see also below).

The CollectionReadBuffers also have two std::function members, that make it possible to type-erase some things, while still being able to inject necessary type information during code generation. The createCollection is the important bit here. We generate a std::function that fulfills the necessary interface, but that has all the type information still in place:

readBuffers.createCollection = [](podio::CollectionReadBuffers buffers, bool isSubsetColl) {
{{ collection_type }}Data data(buffers, isSubsetColl);
return std::make_unique<{{ collection_type }}>(std::move(data), isSubsetColl);
};

In the Frame based approach, the FrameDataT hands out CollectionReadBuffers and the Frame then simply calls buffers.createCollection on them, to get a collection from them.

Some more detailed comments

This is almost certainly a bug. The data field in the collection buffers should point to the vector<Data>. So I think making this a (void*) m_data.get() would be the right thing to do here.

  • When converting buffers for writing, there is a three star cast: What this cast does is basically to return a pointer to vector.data() instead of a pointer to the vector itself.

At least the intent of this was definitely to return a vector<Data>* not a pointer to vector.data(). This is also how it is used, e.g. in the SIO backend:

auto* dataVec = podio::CollectionWriteBuffers::asVector<{{ class.full_type }}Data>(m_buffers.data);
unsigned size = dataVec->size() ;
device.data( size ) ;
podio::handlePODDataSIO( device , dataVec->data(), size ) ;

Also one of these casts seem to be necessary when reading via ROOT, as there is something going on with a void* when reading from a Branch that I honestly also haven't fully understood:

collBuffers.recast(collBuffers);

(this recast is essentially simply calling such the asVector with the correct type, using a std::function to type-erase the whole thing.)

See also:

// This is a hacky workaround for the ROOT backend at the moment. There is
// probably a better solution, but I haven't found it yet. The problem is the
// following:
//
// When creating a pointer to a vector<T>, either via new or via
// TClass::New(), we get a void*, that can be cast back to a vector with
//
// static_cast<vector<T>*>(raw);
//
// However, as soon as we pass that same void* to TBranch::SetAddress this no
// longer works and the actual cast has to be
//
// *static_cast<vector<T>**>(raw);
//
// To make it possible to always use the first form, after we leave the Root
// parts of reading, this function is populated in the createBuffers call of each
// datatype where we have the necessary type information (from code
// generation) to do the second cast and assign the result of that to the data
// field again.
RecastFuncT recast{};

@hegner
Copy link
Collaborator

hegner commented Mar 21, 2023

@jmcarcell - thanks for looking into all that!
Should we make a whiteboard session to explain the overall structure? because the design is not really optimal / we learnt a lot afterwards, we should take the RNTuple as a good chance to redo.
@tmadlener - we then document in tickets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants