Skip to content

Epoch Allocation And Scoping

Jonathan Lifflander edited this page Jan 12, 2021 · 14 revisions

The current epoch allocation/bit layout follows this pattern:

Basic epoch Layout:

   w-1 .............. w-h-1 ...............w-h-c-1 ....................0
   | <EpochHeader> ... | <EpochCategory> ... |      <SeqEpochID>       |

       where     h = epoch_header_num_bits (~2 bits)
                 c = epoch_category_num_bits (1 bit)
                 w = sizeof(EpochType) * 8 (i.e., 64-bit/int64_t field)
                 

Rooted epoch breakdown:

                                          w-h-c-1      w-h-c-n-1       0
                                             ^             ^           ^
                                             | .... n .... | ..........|
                                               <NodeType>  <SeqEpochID>

      where      n = sizeof(NodeType) (configurable, default 16 bits)

Thus, with default sizes:

  • a collective epoch has 61 bits for the sequential epoch ID;
  • a rooted epoch has 16 less bits or 45 bits for the sequential epoch ID (assuming number of nodes fits in 16-bits).

Current design:

Each EpochWindow manages allocation/deallocation for epochs of a particular type. The type is all the high bits minus the SeqEpochID (set to zero). EpochWindow uses IntegralSet to compactly store terminated epochs. As epochs complete out of order, the data structure grows in size, but it's generally very efficient.

Scoping Motivation

Correct collective epoch allocation relies on ordered execution across nodes/ranks. Specifically, within a particular type of epoch, the sequential epoch is repeatedly incremented by 1 each time a new epoch is allocated until a free epoch is found. This presents two problems:

  • Work units that create collective epochs must be ordered (even across "modules")
  • There is a potential race in terms of discovery of epoch termination and re-use. For this to be a problem, an application would have to process an insane number of epochs to wrap around. The conscious design choice here was to not use the first free epoch returned from the IntegralSet---instead, use the next one in the sequence until a free one is found. This means as long as the program is generally reasonable the race is highly improbable.