This document presents a high level design (HLD) of scalable resource management interfaces for Motr.
The main purposes of this document are:
- To be inspected by M0 architects and peer designers to ascertain that high level design is aligned with M0 architecture and other designs, and contains no defects.
- To be a source of material for Active Reviews of Intermediate Design (ARID) and detailed level design (DLD) of the same component.
- To serve as a design reference document.
The intended audience of this document consists of M0 customers, architects, designers, and developers.
Motr functionality, both internal and external, is often specified in terms of resources. A resource is part of the system or its environment for which a notion of ownership is well-defined.
-
A resource is part of the system or its environment for which a notion of ownership is well-defined. Resource ownership is used for two purposes:
-
concurrency control. resource owners can manipulate the resource and the ownership transfer protocol assures that owners do not step on each other. That is, resources provide a traditional distributed locking mechanism.
-
replication control. the owner can create a (local) copy of a resource. The ownership transfer protocol with the help of version numbers guarantees that multiple replicas are re-integrated correctly. That is, resources provide a cache coherency mechanism. A global cluster-wide cache management policy can be implemented on top of resources.
-
-
A resource owner uses the resource via a usage credit (also called resource credit or simply credit as context permits). E.g., a client might have a credit of a read-only or write-only or read-write access to a certain extent in a file. An owner is granted credit to use a resource.
-
A usage credit granted to an owner is held (or pinned) when its existence is necessary for the correctness of ongoing resource usage. For example, a lock on a data extent must be held while an IO operation is going on and a meta-data lock on a directory must be held while a new file is created in the directory. Otherwise, the granted credit is cached.
-
A resource belongs to a specific resource type, which determines resource semantics.
-
A conflict occurs in an attempt to use a resource with a credit incompatible with already granted credit. Conflicts are resolved by a conflict resolution policy specific to the resource type in question.
-
To acquire a resource usage credit, a prospective owner enqueues a resource acquisition request to a resource owner.
-
An owner can relinquish its usage credits by sending a resource cancel request to another resource owner, which assumes relinquished credits.
-
A usage credit can be associated with a lease, which is a time interval for which the credit is granted. The usage credit automatically cancels at the end of the lease. A lease can be renewed.
-
One possible conflict resolution policy would revoke all already granted conflicting credits before granting the new credit. Revocation is effected by sending conflict call-backs to the owners of the credit. The owners are expected to react by canceling their cached credits.
[R.M0.LAYOUT.LAYID.RESOURCE]
: layids are handled as a distributed resource (similarly to fids).[R.M0.RESOURCE]
: scalable hierarchical resource allocation is supported[R.M0.RESOURCE.CACHEABLE]
: resources can be cached by clients[R.M0.RESOURCE.HIERARCICAL]
: resources are distributed hierarchically[R.M0.RESOURCE.CALLBACK-REVOKE]
: scalable call-back and revocation model: revocation can spawn multiple nodes, each owning a part of a resource[R.M0.RESOURCE.RECLAIM]
: unused resources are reclaimed from users.
Additional Requirements
[r.resource.enqueue.async]
: a resource can be enqueued asynchronously[r.resource.ordering]
: a total ordering of all resources is defined. Resources are enqueued according to the ordering, thus avoiding deadlocks.[r.resource.persistent]
: a record of resource usage credit acquisition can be persistent (e.g., for disconnected operation).[r.resource.conversion]
: a resource usage credit can be converted into another usage credit.[r.resource.adaptive]
: dynamic switch into a lockless mode.[r.resource.revocation-partial]
: part of a granted resource usage credit can be revoked.[r.resource.sublet]
: an owner can grant usage credits to further owners, thus organizing a hierarchy of owners.[r.resource.separate]
: resource management is separate from actual resource placement. For example, locks on file data extents are distributed by a locking service that is separate from data servers.[r.resource.open-file]
: an open file is a resource (with a special property that this resource can be revoked until the owner closes the file).[r.resource.lock]
: a distributed lock is a resource.[r.resource.resource-count]
: a count of resource usage credit granted to a particular owner is a resource.[r.resource.grant]
: free storage space is a resource.[r.resource.quota]
: storage quota is a resource.[r.resource.memory]
: server memory is a resource.[r.resource.cpu-cycles]
: server cpu-cycles are a resource.[r.resource.fid]
: file identifier is a resource.[r.resource.inode-number]
: file inode number is a resource.[r.resource.network-bandwidth]
: network bandwidth is a resource.[r.resource.storage-bandwidth]
: storage bandwidth is a resource.[r.resource.cluster-configuration]
: cluster configuration is a resource.[r.resource.power]
: (electrical) power consumed by a device is a resource.
- hierarchical resource names. Resource name assignment can be simplified by introducing variable length resource identifiers.
- conflict-free schedules: no observable conflicts. Before a resource usage credit is canceled, the owner must re-integrate all changes made to the local copy of the resource. Conflicting usage credits can be granted only after all changes are re-integrated. Yet, the ordering between actual re-integration network requests and cancellation requests can be arbitrary, subject to server-side NRS policy.
- resource management code is split into two parts:
- generic code that implements functionality independent of a particular resource type (request queuing, resource ordering, etc.).
- per-resource type code that implements type-specific functionality (conflict resolution, etc.).
- an important distinction with a more traditional design (as exemplified by the Vax Cluster or Lustre distributed lock managers) is that there is no strict separation of rôles between "resource manager" and "resource user": the same resource owner can request usage credits from and grant usage credits to other resource owners. This reflects the more dynamic nature of Motr resource control flow, with its hierarchical and peer-to-peer caches.
The external resource management interface is centered around the following data types:
- a resource type
- a resource owner
- a usage credit
- a request for resource usage credit.
The following sequence diagram illustrates the interaction between resource users, resource owners, and resource servers.
Here a solid arrow represents a (local) function call and a dashed arrow—a potentially remote call.
The external resource management interface consists of the following calls:
-
credit_get(resource_owner, resource_credit_description, notify_callback): obtains the specified resource usage credit. If no matching credit is granted to the owner, the credit acquisition request is enqueued to the primary resource owner, if any. This call is asynchronous and signals completion through some synchronization mechanism (e.g., a condition variable). The call outcome can be one of:
- success: a credit, matching the description is granted;
- denied: usage credit cannot be granted. The user is not allowed to cache the resource and must use no-cache operation mode;
- error: some other error, e.g., a communication failure, occurred.
Several additional flags, modifying call behavior can be specified:
-
non-block-local: deny immediately if no matching credit is granted (i.e., don't enqueue).
-
non-block-remote: deny if no matching credit is granted to the primary owner (i.e., don't resolve conflicts).
On successful completion, the granted credit is held. notify_callback is invoked by the resource manager when the cached resource credit has to be revoked to satisfy a conflict resolution or some other policy.
credit_put(resource_credit)
: release held credit
A resource owner maintains:
- an owned resource usage credit description. The exact representation of this is up to the resource type. This is the description of the resource credits that are held by this owner at the moment.
Examples:- for (meta-data) inode resource type: credit description is a lock mode.
- for the quota resource type: credit description is a quota amount assigned to the owner (a node, typically).
- for a component data object: credit description is a collection of locked extents together with their lock modes. This collection could be maintained either as a list or a more sophisticated data structure (e.g., an interval tree).
- a queue of granted resource usage credits. This is a queue of triples (credit, owner, lease) that this owner granted to other owners. Granted credits no longer belong to this owner;
- a queue of incoming pending credits. This is a queue of incoming requests for usage credits, which were sent to this resource owner and are not yet granted, due to whatever circumstances (unresolved conflict, long-term resource scheduling decision, etc.);
- a queue of outgoing pending credits. This is a queue of usage credits that users asked this resource owner to obtain, but that are not yet obtained.
[R.M0.LAYOUT.LAYID.RESOURCE]
,[r.resource.fid]
,[r.resource.inode-number]
: layout, file and other identifiers are implemented as a special resource type. These identifiers must be globally unique. Typical identifier allocator operates as following:- originally, a dedicated "management" node runs a resource owner that owns all identifiers (i.e., owns the [0, 0xffffffffffffffff] extent in identifiers name-space).
- when a server runs short on identifiers (including the time when the server starts up for the first time) it enqueues a credit request to the management node. credit description is simply the number of identifiers to grant. The management node's resource owner finds a not-yet granted extent of suitable size and returns it to the server's resource owner.
- depending on identifier usage, clients can similarly request identifier extents from the servers;
- there is no conflict resolution policy.
- identifiers can be canceled voluntarily: e.g., an inode number is canceled when the file is deleted and the fid range is canceled when a client disconnects or is evicted.
[R.M0.RESOURCE]
,[R.M0.RESOURCE.HIERARCICAL]
: resource owners can enqueue credit requests to other ("master") owners and at the same time bestow credits to "slave" owners. This forms a hierarchy of owners allowing scalable resource management across the cluster.[R.M0.RESOURCE.CACHEABLE]
: it is up to the resource type to provide a conflict resolution policy such that an owner can safely use cached resources while it possesses corresponding usage credits.[R.M0.RESOURCE.CALLBACK-REVOKE]
: scalable call-back and revocation model: revocation can spawn multiple nodes, each owning a part of a resource.[R.M0.RESOURCE.RECLAIM]
: a resource owner can voluntarily cancel a cached usage credit.
Additional requirements are:
[r.resource.enqueue.async]
: credit_get entry point is asynchronous by definition.[r.resource.ordering]
: a total ordering of all resources is defined. Resources are enqueued according to the ordering, thus avoiding deadlocks.[r.resource.persistent]
: a record of resource usage credit acquisition can be persistent (e.g., for disconnected operation).[r.resource.conversion]
: a resource usage credit can be converted into another usage credit.[r.resource.adaptive]
: dynamic switch into a lock less mode.[r.resource.revocation-partial]
: part of a granted resource usage credit can be revoked.[r.resource.sublet]
: an owner can grant usage credits to further owners, thus organizing a hierarchy of owners.[r.resource.separate]
: resource management is separate from actual resource placement. For example, locks on file data extents are distributed by a locking service that is separate from data servers.[r.resource.open-file]
: an open file is a resource (with a special property that this resource can be revoked until the owner closes the file).[r.resource.lock]
: a distributed lock is a resource.[r.resource.resource-count]
: a count of resource usage credit granted to a particular owner is a resource.[r.resource.grant]
: free storage space is a resource.[r.resource.quota]
: storage quota is a resource.[r.resource.memory]
: server memory is a resource.[r.resource.cpu-cycles]
: server cpu-cycles are a resource.[r.resource.network-bandwidth]
: network bandwidth is a resource.[r.resource.storage-bandwidth]
: storage bandwidth is a resource.[r.resource.cluster-configuration]
: cluster configuration is a resource.[r.resource.power]
: (electrical) power consumed by a device is a resource.
Implementations of these methods are provided by each resource type.
See examples below:
- matches(credit_description0, credit_description1) method: this method returns true if a credit with description credit_description0 is implied by a credit with description credit_description1. For example, extent lock L0 matches extent lock L1 if L0's extent is part of L1's extent and L0's lock mode is compatible with L1's lock mode. More generally, for lock-type resources, matching is the same as lock compatibility.
credit_get(owner, credit_description)
- if matches(credit_description, owner.credit_description)