Skip to content

Commit

Permalink
Some BP5 Serialization documentation, mostly writer-side perspective. (
Browse files Browse the repository at this point in the history
  • Loading branch information
eisenhauer authored Oct 18, 2024
1 parent 98bbff3 commit f1febb7
Showing 1 changed file with 258 additions and 0 deletions.
258 changes: 258 additions & 0 deletions source/adios2/toolkit/format/bp5/BP5Base.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,264 @@
#pragma warning(disable : 4250)
#endif

/*
* BP5 Metadata Marshalling is based upon FFS, which provides the
* ability to serialize a C-style pointer-based data structure
* (starting with a base struct) and to deserialize it in-place on
* the receiving side.
*
* Normally, in order to use FFS, an application must fully describe
* the base structure using an FMFieldList, where each element
* describes a field in the structure, including the field's name,
* basic type (integer, float, etc.), size and offset from the start
* of the structure. In "normal" scenarios, like in SST this is
* straightforward because we're describing a structure that exists
* at compile-time and all of those things are compile-time static.
* However, ADIOS metadata represents information about variables
* that we don't know about until run-time, so if we're going to use
* FFS here, things have to be a bit more dynamic. In particular,
* we'll represent ADIOS metadata with a "virtual" structure, one
* whose description we'll construct on the fly and which will only
* ever exist virtually, making up offsets as we go. We just have to
* be careful about keeping things aligned appropriately because we
* want this to land on the receiver and be appropriately aligned
* there. (Normally the compiler takes care of this, but this
* virtual structure is never seen by a compiler, so we're doing it.)
* The field name that we specify to FFS is also important because we
* use it to communicate a lot of information between writer and
* reader. While it always contains the variable name, it also
* encodes the variable type (local or global, atomic or array,
* compressed, derived, etc.). Because the variable name only
* appears in the metametadata (ffs format), this is a great place to
* put more static information about the variable, specifically
* anything that is fixed after definition and doesn't change on a
* per-timestep basis. More on names later.
*
* To accomplish managing the structure on the writer side, we
* principally track two things, the FMFieldList that represents the
* description of the virtual struct, and a malloc'd region where we
* build the virtual struct itself. While the description is
* interpreted by FFS, the most important thing for BP5 to remember
* is this field's offset because that's where the (meta)data will
* go. When we Marshal a simple atomic value (local or global), we
* calculate an appropriately aligned new offset in the buffer, add
* to the FMFieldList (maintained in Info.MetaFields on the writer)
* and copy the data into the virtual field at that offset in the
* buffer. On future timesteps, the field already exists, so we just
* use the offset and copy the data into the buffer. Arrays are a
* bit more complex, but lets start with the simple case. FFS
* supports substructures, I.E. fields which themselves are a
* structure and we use that feature for all array representations.
* There are several things that may change on a per-timestep basis
* for arrays, including Shape, Count and Offset values (which are
* themselves arrays), and we also need to track the location of the
* related data block (offset in this rank's data segment). Except
* for Shape (which we assume is set for at least this timestep), all
* of these things are per-block.
*
* Back to FFS capabilities for a moment. FFS's pointer-based
* structures include dynamically-sized arrays, and the size of those
* arrays must be specified by an integer-typed field in that
* structure. There are three different array lengths required here.
* Shape is of length Dims (how many dimensions the array has),
* DataBlockLocation is of length BlockCount (how many blocks were
* written on this rank), and for Count and Offsets we must have
* those per-block, so the length is Dims*BlockCount. To satisfy
* FFS's constraints, that means we must have integer fields
* representing all three lengths in the array metadata struct, and
* we need pointers to the dynamic arrays representing Shape, Count,
* Offsets, and DataBlockLocation. These are the BASE_FIELDS below
* and the FFS FMField entries are BASE_FIELD_ENTRIES in BP5Base.cpp.
* While more complex arrays metadata entries are necessary, these
* must be the first fields in those structures. While there can't
* be a static struct declaration for all of the metadata, there is a
* static declaration for the array metadata substructure,
* MetaArrayRec below. Mostly you'll see this used like this:
*
* MetaArrayRec *MetaEntry = (MetaArrayRec *)((char *)(MetadataBuf) + Rec->MetaOffset);
*
* This gives us a nice way of accessing the key fields in an array's
* metadata entry.
*
* So, what about more complex arrays? All of our compression
* operators require the length of the encrypted field as input to
* the uncompress operator. Generally we don't include data block
* length as part of metadata because it's easily calculated from the
* Count values and the length of the data type, but in order to
* support compression we have to communicate it from the writer to
* the reader so we can uncompress. Therefore every field with an
* operator has as its next field (after BASE_FIELDS) DataBlockSize.
* Like DataBlockLocation, this is per block (and so it's FFS
* description also uses BlockCount). This arrangement is
* represented by the struct MetaArrayRecOperator below. Note that
* BP5 does not itself use the DataBlockSize in the metadata. The
* size of the compressed data is returned from the compression
* operator, and is used by BP5 to copy that data into the data
* block, but after that it is only passed to the Uncompress operator
* on the receiving side, so operators like MGard may choose to use
* this differently.
*
* The last case is arrays that also have Min/Max stats associated
* with them. Since this can be combined with operators, that gives
* us two more possible structs for array metadata, a plain array
* with Min/Max or an array with an operator and Min/Max, these are
* represented by the structs MetaArrayRecMM and
* MetaArrayRecOperatorMM below. Note that MinMax in that struct is
* a char*, but obviously the data type of Min/Max depends upon the
* element type of the array. How does that work? The actual size
* in bytes of the MinMax array is BlockCount * sizeof(array element)
* * 2, but in order to avoid introducing yet another integer-typed
* size value into the structure we've gone to some effort in order
* to leverage the existing BlockCount value. In particular, there
* are a number of FMField lists for The MM and OperatorMM arrays,
* each giving FFS a different element size for the MinMax Array.
* ADIOS types of size 1 use MetarrayRecMM1List, those of size 2 use
* MetaArrayRecMM2List, etc., up to MetaArrayRecMM16List, which would
* be used by long double. Note that BP5 doesn't define or support
* MinMax for string, complex, or structure types.
*
* For each of the array variations above, when we add the field
* associated with that array to the metadata field list, we specify
* the appropriate FieldList in the FFS "field_type" value, and
* allocate space for the relevant structure in the virtual metadata
* struct we're building.
*
* We mentioned field names above, we actually encode a lot of
* information into the FFS field names, including the variable name,
* shape, element_size, ADIOS type, any operator that might be
* applied, the name of the substructure (if the array is a struct
* type), and even the expression that is to be used for derived
* variables. These are all encoded in different ways, for example
* the basic shape of the variable is encoded in the three letter
* prefix of the FFS fieldname: GlobalValue: = "BPg", GlobalArray =
* "BPG"JoinedArray = "BPJ", LocalValue = "BPl", LocalArray = "BPL".
* The details of the encoding are buried in the logic, but important
* bit is knowing that there's a lot of information there and some of
* it (like the expression) is base64 encoded to avoid having special
* characters in the FFS field name. From the BP5 point of view,
* anything that can be encoded in the field name is a good thing
* because it travels in the metametadata, not the metadata, so it
* only gets moved around if the field set changes.
*
* Speaking of changes, there are some details that are omitted above
* to get the main points across, but lets talk about other details.
* First, when you put a first block of an array, we fill out the
* Dims field, init BlockCount to 1, DBCount (the Dims*BlockCount
* value) to Dims and then we malloc memory to hold a copy of the
* Shape, Count and Offset values. (We need to copy these anyway as
* part of serialization as they must be captured at the time of Put,
* so we can't, say, just reference the values in the VariableBase
* class.) For LocalArrays, the Shape value stays at a NULL pointer,
* as does the Start value. If after the first there's another Put()
* on that variable, we add 1 to BlockCount, increment DBCount by
* Dims, and realloc() the Count and Offset arrays so that we can add
* the new Count and Offset values after the ones that are already
* there. This means that the Count values for block 1 start at
* Count[Dims], for block 2 they start at Count[2*Dims], etc. At the
* end of the timestep after using FFSencode() to serialize the
* metadata, FMfree_var_rec_elements() is used to free() all these
* subarrays that we've malloc'd. It understands the structure of
* our entire Metadata structure, walks the field list and
* deallocates appropriately. Once this has been done, we can
* memset() the whole metadata structure back to zeros and we're
* ready to start again. (All pointers NULL and counts are zero.)
*
* When we do start again with the next timestep, we don't start from
* scratch with a new Fieldlist and virtual structure, but instead
* try to reuse the old one. The anticipation is that step-based HPC
* applications are highly regular and the set of variables that are
* output on step N+1 are likely the same as what they output for
* step N. So when we get a Put() for a variable, we look up it's
* entry in internal bookkeeping and if it has an entry in the
* structure we reuse it, putting the appropriate data in the virtual
* structure as described above. This is fine if we write the exact
* same set of variables in subsequent steps, but what if we don't?
* Well, if we write a new variable, then the procedure above
* happens, but we also take steps to make sure that we generate new
* MetaMetaData (I.E. re-register the format with FFS). We do this
* by setting the Info.MetaFormat value to NULL.
*
* Handling a non-written variable is done differently. We don't
* really want to bear the cost of new MetaMetaData frequently
* (because MetaMetaData can be big), so instead we're willing to
* bear the costs of not using some of the data in the virtual
* structure. So if the app Puts an atomic variable on timestep N,
* but skips it on N+1, we essentially leave that fraction of the
* metadata buffer unused in N+1. It's transmitted or stored, but it
* doesn't contain anything useful. But the reader still needs to
* know that it wasn't written, so BP5 metadata carries with it a
* bitmap showing if a variable that is part of the metadata has
* actually been written and is valid. This bitmap, contained in the
* BitField[BitFieldCount] fields in the MetadataFieldList is the
* ultimate authority as to what has been written. Variables are
* assigned an index in order when they are first entered into
* metadata and if the bit at that index isn't set, that variable
* wasn't written on that timestep.
*
* Now, this does bring up a vulnerability with BP5. If an
* application were to write a lot of variables on one step and then
* never use them again, we might end up with a big metadata block
* that mostly carried unused (junk) bytes. We have not yet run into
* this in a real application, so it isn't specifically handled. In
* an ideal world, one would look at the "occcupancy rate" of
* metadata in EndStep() and make a decision that for either this
* timestep or the next, we'd start from scratch with an empty field
* list. There's a tradeoff here. Do this too often and we've got
* big MetaMetadata costs, do it too little and our metadata has a
* lot of useless bytes. Future work. Note that this is mostly a
* writer-side thing to fix/optimize. The reader will appropriately
* handle new metadata, including new metametadata.
*
* The stuff above applies to ADIOS variables, but attributes are
* always handled separately. In the initial FFS-marshalling
* implementation, Attributes, while separate, were handled very
* similarly to variables. That is, there was a field list and
* virtual structure maintained where we entered attributes much like
* Global and local values are described above. There was a
* metametadata generated it it and it was moved around like other
* metametadata blocks. This old way of doing things is still
* present in the code and gets used if MarshalAttribute is called by
* the engine. Engines that use this marshall all attributes in
* Endstep(), calling MarshalAttribute for all attributes and only
* doing this when some attribute has changed. The resulting
* Attribute data always contains *all* the current attribute values,
* a situation that works out well for engines like SST where readers
* might join after timestep 0. The SST writer can save the most
* recent Attribute data block and provide it to a newly-joined
* reader so that it has all available attributes.
*
* However, this encoding mechanism has some significant
* disadvantages under almost all situations. This separation of
* metametadata and metadata was designed for Variables, where the
* set of variables was likely to be reused without changes
* repeatedly. However, attributes aren't like that, particularly in
* the original situation where attributes once set can never change.
* Then we're only doing this when we add an attribute, we're always
* generating new MetaMetadata whenever we have a change, and
* MetaMetadata + Metadata size is always going to be bigger than
* some simpler encoding mechanism. So, BP5 file engine now does
* things differently. It calls OnetimeMarshalAttribute() which uses
* a simpler FFS representation for attributes with the attribute
* "name" being part of the data, not part of the metametadata as it
* is with variables. This means that the metametadata never
* changes, so we don't have the same issues as with the prior
* approach. That metametadata struct (BP5AttrStruct) describes a
* relatively simple structure with two lists, one for attributes of
* any non-string type, and the other a list of string and
* array-of-string attributes. Generally we only want attributes to
* appear here when they change, so the BP5Writer calls
* OnetimeMarshlAttribute whenever it gets the NotifyEngineAttribute
* call (whenever an attribute changes). However it also gets called
* in BeginStep if that step is the first every called, because some
* attributes may have been defined before the engine was ever
* created. In BP5 file, attribute blocks then only every contain an
* attribute once, unless the attribute changes in which case it will
* appear again. This is not such a good situation for SST because
* of the late-coming-reader issue, so that still uses the old
* marshaling mechanism.
*
*/

namespace adios2
{
namespace format
Expand Down

0 comments on commit f1febb7

Please sign in to comment.