From 05d8694f2c8b7e009825c117724a2aeba4ca0766 Mon Sep 17 00:00:00 2001 From: tmadlener Date: Tue, 18 Jun 2024 10:21:03 +0200 Subject: [PATCH] Add documentation about storage details --- doc/figures/file_layout_root.svg | 433 +++++++++++++++++++++++++++++++ doc/figures/file_layout_sio.svg | 238 +++++++++++++++++ doc/index.rst | 1 + doc/storage_details.md | 100 +++++++ 4 files changed, 772 insertions(+) create mode 100644 doc/figures/file_layout_root.svg create mode 100644 doc/figures/file_layout_sio.svg create mode 100644 doc/storage_details.md diff --git a/doc/figures/file_layout_root.svg b/doc/figures/file_layout_root.svg new file mode 100644 index 000000000..49be84428 --- /dev/null +++ b/doc/figures/file_layout_root.svg @@ -0,0 +1,433 @@ + + + + + + + + + + + + + Frame data + + Frame data + + Frame data + + Frame data + + Frame Metadata + + Frame Metadata + + + podio Metadata + + Category1 +TTree/RNTModel + Category2 +TTree/RNTModel + podio_metadata +TTree/RNTWriter + + ROOT file + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/figures/file_layout_sio.svg b/doc/figures/file_layout_sio.svg new file mode 100644 index 000000000..02c5f99b4 --- /dev/null +++ b/doc/figures/file_layout_sio.svg @@ -0,0 +1,238 @@ + + + + + + + + + Frame Metadata + + Frame data + + Frame Metadata + + Frame data + + + + + + + + + podio Metadata + + + + sio Metadata + + SIO file + + diff --git a/doc/index.rst b/doc/index.rst index 5960b64c3..1be94b679 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -15,6 +15,7 @@ Welcome to PODIO's documentation! examples.md frame.md userdata.md + storage_details.md cmake.md advanced_topics.md templates.md diff --git a/doc/storage_details.md b/doc/storage_details.md new file mode 100644 index 000000000..4234fb500 --- /dev/null +++ b/doc/storage_details.md @@ -0,0 +1,100 @@ +# Storage details of files produced by podio + +This page describes some of the details of how podio stores data of generated +EDMs and related metadata on file for the different backends. + +```{warning} +We in general do not treat the file layout part of the public API for podio. +This means that if you rely on exact details of a given file layout we might +break things. This is especially true for the way we store *podio related metadata* +metadata, e.g. information on collection types, their schema version, or the +mapping of names to collection IDs. **However, we will limit breaking changes to +cases where it is absolutely necessary.** +``` + +## ROOT + +For ROOT we offer two different options, one based on `TTree`s and one based on +`RNTuple`s. The basic layout of files is largely the same for these two options, +some minor differences are present in the storage of podio related metadata. + +The overall structure of a podio generated ROOT file looks like this +- Each category gets its own `TTree` or `RNTupleModel`, where the name is the category name +- Each Frame of a category will be one entry in that. +- The Frame parameters are stored as part of each entry +- An additional `podio_metadata` is created to hold the podio related metadata. + +An overview of this schema looks like this + +ROOT file layout schematic + +```{note} +For a ROOT based backend all entries of a category have to have the same +(collection) contents. This content is defined by the **first entry that is +written for a category** +``` + +### Storage of Collection data + +Each collection (that is not a subset collection) is decomposed into the +following branches (or fields for `RNTuple`) +- One branch of `std::vector`, with the same name as the collection +- One branch (if any) of `std::vector` per `VectorMember` (of type `T`) with the name `__` +- One branch (if any) of `std::vector` per `OneToOneRelation` or per `OneToManyRelation` with the name `__` + +If the collection is a subset collection, then there will be only branch (or +field for `RNTuple`) of `std::vector` with the name `_objIdx` + +### Storage of Frame parameters + +The Frame parameters are internally handled via `podio::GenericParameters`. For +I/O they are decomposed into pairs of vectors containing the keys and values for +each of the supported generic parameters types. In practice this means that +there will be the following branches per supported type + +- One branch of `std::vector` holding all the keys with the name of + `GPKeys` (where type is one of `Int`, `Float`, `Double` or `String`) +- One branch of `std::vector>` holding all the values with the + name of `GPValues` (where type is one of `Int`, `Float`, `Double` or + `String` and `T` is one of `int`, `float`, `double` or `std::string`) + +### Storage of podio related metadata + +The podio related metadata, stored in the `podio_metadata` `TTree` (or +`RNTupleModel`) contains the following general information once per file + +- The version of podio that has been used to write this file +- The complete datamodel definitions for each datamodel that was encountered + when writing the file + +And the following information once per category +- The mapping of collection names to collection IDs +- The types of all the stored collections +- The schema version of all stored collections +- Which collections are stored as subset collections + +Here the `TTree` based and `RNTuple` based backends differ slighlty in the way +these data are stored exactly. The `TTree` based backend stores the data in a +slightly more structued way, taking advantage of ROOTs capabilities to stream +out more complex object, e.g. the `podio::CollectionIDTable` is streamed as a +whole. The `RNTuple` based backend on the other hand, destructures the +information into separate fields that run in parallel. + +## SIO + +The SIO based backend stores all Frames as separate (compressed) binary records. +Each record contains the Frame data and the corresponding podio related metadata +to interpret the record. Additionally, it stores one additional record at the +very beginning of the file containing the version of podio that has been used to +write this file. After all records containing Frame data another record is +written that contains some more podio related metadata, e.g. all datamodel +definitions that were encountered when writing the file. This record also +contains a `podio::SIOFileTOCRecordBlock`, which contains information about the +file positions of all Frame records that have been written, allowing for quick +(random) access when reading files again. To find the beginning of this record +the very last bits of the binary record are used to encode the start position of +this record. + +Schematically an SIO file written by podio looks like this + +SIO file layout schematic