-
Notifications
You must be signed in to change notification settings - Fork 18
Buffer Organizer
The Buffer Organizer is the "corrector" half of our predictor/corrector model. It attempts to correct sub-optimal DPE placements by moving data among buffers.
- Management of hierarchical buffering space
- Data flushing
- Read acceleration
- Manage data life cycle, or journey
- When is the blob in equilibrium?
- How do we eliminate unnecessary data movement?
We attempt to meet the above objectives via a Blob
scoring system. Each Blob
has two different scores associated with it: the importance score and the access score.
The importance score is a real number in the range [0, 1] where 0 represents a Blob
that is not important to the user (i.e., it will not be accessed) and 1 represents a Blob
that will be accessed either very frequently, very soon, or both.
- Blob Size
- Blob Name
- Recency of Blob access
- Frequency of Blob access
- User-supplied priority (this is only a hint, not a guarantee).
This is a real number in the range [0, 1] where 0 represents a Blob
with the slowest access time (i.e., all its Buffer
s are in the slowest tier) and 1 represents a Blob
with the quickest access time (all the Buffer
s are in the fastest tier).
- Bandwidth of the tiers that contain the
Blob
'sBuffer
s. - The
Blob
distribution (i.e., is all the data in a singleBuffer
, or is it spread out among multipleBuffer
s on multiple nodes?).
The goal of the BufferOrganizer
is to ensure that each Blob
's access score is closely aligned to its importance score.
- If a
Blob
'sBucket
has a reference count of 0 (i.e., no process has an open handle to theBucket
) then the importance score should be 0. The score is only calculated once at least one process opens theBucket
.
All BufferOrganizer
operations are implemented in terms of 3 simple
operators
- MOVE(BufferID, TargetID)
- COPY(BufferID, TargetID)
- DELETE(BufferID)
With these operators, we can build more complex tasks:
Move a BufferID
from one set of Target
s to another.
- The System (load balancing)
- The User (producer/consumer)
Move a set of BufferId
s from one set of Target
s to an unspecified
location (could even be swap space).
- Put (DPE)
- Get (Prefetcher)
- Thread that updates the
SystemViewState
(enforces a minimum capacity threshold passed in through the config).
- DPE?
- BO?
- Could theoretically reap performance benefits of collective IO operations, although I don't think we'll ever be able to capitalize on this because each rank must act independently and can't synchronize with the other ranks.
- Less stress on the PFS metadata server.
- Don't have to worry about reserving size for each rank.
- Don't have to worry about locking.
- We'll go with this for the initial implementation.
- Don't have to worry about locking or reserving size with respect to
the buffer organizer. However, since multiple ranks could
potentially write to the same swap file, we need to either
- Filter all swap traffic through the buffer organizer
- Synchronize all access to the file
- Won't overload the metadata servers as bad as file per rank.
The Buffer Organizer can be triggered in 3 ways:
The period can be controlled by a configuration variable.
- If, for any reason, a client DPE places data to the swap target, it will also trigger the buffer organizer by adding an event to the buffer organizer's queue.
- We store the blob name, the offset into the swap target (for file-based targets), and the blob size.
- When the buffer organizer processes an event, it
- Reads the blob from the swap target into memory.
- Calls
Put
to place the blob into the hierarchy. If thePut
fails, it tries again, up tonum_buffer_organizer_retries
(configurable) times.
- Nothing is implemented yet.
- Should the BO constantly monitor the buffering hierarchy and attempt to maintain a set of rules (remaining capacity percentage, thresholds, etc.)?
- Should the BO simply carry out "orders" and not attempt to make its own decisions? If so, who gives the orders?
- Should the BO be available for other asynchronous tasks?
- (At least) 2 different priority lanes
- Node local and remote queues (but only for neighborhoods, not global queues).
- Need ability to restrict queue length
- RPC is used to route
BoTask
s to the appropriate Hermes core.
- The
BO
RPC server only has one function:bool EnqueueBoTask(BoTask task, Priority priority);
- Argobots pools
- High and low priorities
- Basic FIFO queue by default
- Completely customizable (e.g., could be a priority queue, min-heap, etc.)
- Argobots schedulers
- Takes tasks from the queues and runs them on OS threads as user level threads (basically coroutines).
- Completely customizable.
- By default, one scheduler is associated with a single execution stream (OS thread).
- Only take tasks from low priority queue if high priority queue is empty?
- Argobots execution streams
- Bound to a processing element (CPU core or hyperthread), and shouldn't be oversubscribed.
- An importance score of 0 could be the signal to flush a
Blob
to the PFS. -
StageIn
andStageOut
APIs - Reverse "gravity" for read heavy workloads.
Blobs
trickle up to higher tiers. - Explicitly maintain Target capacity thresholds.
- Introduce horizontal movement if Topology threshold is exceeded.
- Test the difference between flushing tier by tier vs skipping tiers. For example, a
Blob
moving from RAM to burst buffer could go through NVMe as an intermediate tier, or skip it altogether.