-
Notifications
You must be signed in to change notification settings - Fork 18
Buffer Organizer
The Buffer Organizer is the "corrector" half of our predictor/corrector model. It attempts to correct sub-optimal DPE placements by moving data among buffers.
- Management of hierarchical buffering space
- Data flushing
- Read acceleration
- Manage data life cycle, or journey
- When is the blob in equilibrium?
- How do we eliminate unnecessary data movement?
We attempt to meet the above objectives via a Blob
scoring system. Each Blob
has two different scores associated with it: the importance score and the access score.
The importance score is a real number in the range [0, 1] where 0 represents a Blob
that is not important to the user (i.e., it will not be accessed) and 1 represents a Blob
that will be accessed either very frequently, very soon, or both.
- Blob Size
- Blob Name
- Recency of Blob access
- Frequency of Blob access
- Reference count of the Blob's
Bucket
(how processes currently have theBucket
open). - Number of times the Blob was linked to a
VBucket
- User-supplied priority (this is only a hint, not a guarantee).
This is a real number in the range [0, 1] where 0 represents a Blob
with the slowest access time (i.e., all its Buffer
s are in the slowest tier) and 1 represents a Blob
with the quickest access time (all the Buffer
s are in the fastest tier).
- Bandwidth of the tiers that contain the
Blob
'sBuffer
s. - The
Blob
distribution (is all the data in a singleBuffer
, or is it spread out among multipleBuffer
s on multiple nodes?).
The goal of the BufferOrganizer
is to ensure that each Blob
's access score is closely aligned to its importance score.
All BufferOrganizer
operations are implemented in terms of 3 simple
operators
- MOVE(BufferID, TargetID)
- COPY(BufferID, TargetID)
- DELETE(BufferID)
With these operators, we can build more complex tasks:
Move a BufferID
from one set of Target
s to another.
- The System (load balancing)
- The User (producer/consumer)
Move a set of BufferId
s from one set of Target
s to an unspecified
location (could even be swap space).
- Put (DPE)
- Get (Prefetcher)
- Thread that updates the
SystemViewState
(enforces a minimum capacity threshold passed in through the config).
- DPE?
- BO?
When GetBuffers
fails (because constraints can't be met or we are out
of buffering space), we send blobs to Swap Space. We reserve a
special Buffering Target for this purpose called the Swap
Target. This special target is never considered by a DPE as a
buffering target. It is only meant as a "dumping ground" for blobs that
don't fit in our buffering space. It will usually be backed by a
parallel file system, but could also be backed by AWS, or any other
storage. From an API perspective, a blob in swap space is no different
from a blob elsewhere in the hierarchy. You can Get it, ask for its
metadata, Delete it, etc.
- For now we'll assume that the swap target is backed by a parallel file system.
- We'll keep one swap file per node, assuming we stick with one buffer organizer per node.
- Could theoretically reap performance benefits of collective IO operations, although I don't think we'll ever be able to capitalize on this because each rank must act independently and can't synchronize with the other ranks.
- Less stress on the PFS metadata server.
- Don't have to worry about reserving size for each rank.
- Don't have to worry about locking.
- We'll go with this for the initial implementation.
- Don't have to worry about locking or reserving size with respect to
the buffer organizer. However, since multiple ranks could
potentially write to the same swap file, we need to either
- Filter all swap traffic through the buffer organizer
- Synchronize all access to the file
- Won't overload the metadata servers as bad as file per rank.
- + Can reuse a lot of code paths.
- - Have to decide sizes ahead of time.
- - Cuts into our RAM.
- - Might run out of buffers.
The Buffer Organizer can be triggered in 3 ways:
The period can be controlled by a configuration variable.
- If, for any reason, a client DPE places data to the swap target, it will also trigger the buffer organizer by adding an event to the buffer organizer's queue.
- We store the blob name, the offset into the swap target (for file-based targets), and the blob size.
- When the buffer organizer processes an event, it
- Reads the blob from the swap target into memory.
- Calls
Put
to place the blob into the hierarchy. If thePut
fails, it tries again, up tonum_buffer_organizer_retries
(configurable) times.
- Nothing is implemented yet.
- Should the BO constantly monitor the buffering hierarchy and attempt to maintain a set of rules (remaining capacity percentage, thresholds, etc.)?
- Should the BO simply carry out "orders" and not attempt to make its own decisions? If so, who gives the orders?
- Should the BO be available for other asynchronous tasks?
- (At least) 2 different priority lanes
- Node local and remote queues (but only for neighborhoods, not global queues).
- Need ability to restrict queue length
- RPC is used to route
BoTask
s to the appropriate Hermes core.
- The
BO
RPC server only has one function:bool EnqueueBoTask(BoTask task, Priority priority);
- Argobots pools
- High and low priorities
- Basic FIFO queue by default
- Completely customizable (e.g., could be a priority queue, min-heap, etc.)
- Argobots schedulers
- Takes tasks from the queues and runs them on OS threads as user level threads (basically coroutines).
- Completely customizable.
- By default, one scheduler is associated with a single execution stream (OS thread).
- Only take tasks from low priority queue if high priority queue is empty?
- Argobots execution streams
- Bound to a processing element (CPU core or hyperthread), and shouldn't be oversubscribed.