-
Notifications
You must be signed in to change notification settings - Fork 18
Buffer Organizer
The Buffer Organizer is the "corrector" half of our predictor/corrector model. It attempts to correct sub-optimal DPE placements by moving data among buffers.
- Management of hierarchical buffering space
- Data flushing
- Read acceleration
- Manage data life cycle, or journey
- When is the blob in equilibrium?
- How do we eliminate unnecessary data movement?
We attempt to meet the above objectives via a Blob
scoring system. Each Blob
has two different scores associated with it: the importance score and the access score.
The importance score is a real number in the range [0, 1] where 0 represents a Blob
that is not important to the user (i.e., it will not be accessed) and 1 represents a Blob
that will be accessed either very frequently, very soon, or both.
- Blob Size
- Blob Name
- Recency of Blob access
- Frequency of Blob access
- Reference count of the Blob's
Bucket
(how processes currently have theBucket
open). - Number of times the Blob was linked to a
VBucket
- User-supplied priority (this is only a hint, not a guarantee).
This is a real number in the range [0, 1] where 0 represents a Blob
with the slowest access time (i.e., all its Buffer
s are in the slowest tier) and 1 represents a Blob
with the quickest access time (all the Buffer
s are in the fastest tier).
- Bandwidth of the tiers that contain the
Blob
'sBuffer
s. - The
Blob
distribution (is all the data in a singleBuffer
, or is it spread out among multipleBuffer
s on multiple nodes?).
The goal of the BufferOrganizer
is to ensure that each Blob
's access score is closely aligned to its importance score.
- If a
Blob
'sBucket
has a reference count of 0 (i.e., no process has an open handle to theBucket
) then the importance score should be 0. The score is only calculated once at least one process open's theBucket
.
All BufferOrganizer
operations are implemented in terms of 3 simple
operators
- MOVE(BufferID, TargetID)
- COPY(BufferID, TargetID)
- DELETE(BufferID)
With these operators, we can build more complex tasks:
Move a BufferID
from one set of Target
s to another.
- The System (load balancing)
- The User (producer/consumer)
Move a set of BufferId
s from one set of Target
s to an unspecified
location (could even be swap space).
- Put (DPE)
- Get (Prefetcher)
- Thread that updates the
SystemViewState
(enforces a minimum capacity threshold passed in through the config).
- DPE?
- BO?
When GetBuffers
fails (because constraints can't be met or we are out
of buffering space), we send blobs to Swap Space. We reserve a
special Buffering Target for this purpose called the Swap
Target. This special target is never considered by a DPE as a
buffering target. It is only meant as a "dumping ground" for blobs that
don't fit in our buffering space. It will usually be backed by a
parallel file system, but could also be backed by AWS, or any other
storage. From an API perspective, a blob in swap space is no different
from a blob elsewhere in the hierarchy. You can Get it, ask for its
metadata, Delete it, etc.
- For now we'll assume that the swap target is backed by a parallel file system.
- We'll keep one swap file per node, assuming we stick with one buffer organizer per node.
- Could theoretically reap performance benefits of collective IO operations, although I don't think we'll ever be able to capitalize on this because each rank must act independently and can't synchronize with the other ranks.
- Less stress on the PFS metadata server.
- Don't have to worry about reserving size for each rank.
- Don't have to worry about locking.
- We'll go with this for the initial implementation.
- Don't have to worry about locking or reserving size with respect to
the buffer organizer. However, since multiple ranks could
potentially write to the same swap file, we need to either
- Filter all swap traffic through the buffer organizer
- Synchronize all access to the file
- Won't overload the metadata servers as bad as file per rank.
- + Can reuse a lot of code paths.
- - Have to decide sizes ahead of time.
- - Cuts into our RAM.
- - Might run out of buffers.
The Buffer Organizer can be triggered in 3 ways:
The period can be controlled by a configuration variable.
- If, for any reason, a client DPE places data to the swap target, it will also trigger the buffer organizer by adding an event to the buffer organizer's queue.
- We store the blob name, the offset into the swap target (for file-based targets), and the blob size.
- When the buffer organizer processes an event, it
- Reads the blob from the swap target into memory.
- Calls
Put
to place the blob into the hierarchy. If thePut
fails, it tries again, up tonum_buffer_organizer_retries
(configurable) times.
- Nothing is implemented yet.
- Should the BO constantly monitor the buffering hierarchy and attempt to maintain a set of rules (remaining capacity percentage, thresholds, etc.)?
- Should the BO simply carry out "orders" and not attempt to make its own decisions? If so, who gives the orders?
- Should the BO be available for other asynchronous tasks?
- (At least) 2 different priority lanes
- Node local and remote queues (but only for neighborhoods, not global queues).
- Need ability to restrict queue length
- RPC is used to route
BoTask
s to the appropriate Hermes core.
- The
BO
RPC server only has one function:bool EnqueueBoTask(BoTask task, Priority priority);
- Argobots pools
- High and low priorities
- Basic FIFO queue by default
- Completely customizable (e.g., could be a priority queue, min-heap, etc.)
- Argobots schedulers
- Takes tasks from the queues and runs them on OS threads as user level threads (basically coroutines).
- Completely customizable.
- By default, one scheduler is associated with a single execution stream (OS thread).
- Only take tasks from low priority queue if high priority queue is empty?
- Argobots execution streams
- Bound to a processing element (CPU core or hyperthread), and shouldn't be oversubscribed.