Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
mctpd: Add support for endpoint recovery
DSP0236 v1.3.1[1] 8.17.6 defines a process for reclaiming dynamically assigned EIDs from endpoint devices that have been removed from the network. The fundamental process is captured in the following paragraphs: > - The bus owner shall wait at least `Treclaim` seconds before > reassigning a given EID (where `Treclaim` is specified in the > physical transport binding specification for the medium used to > access the endpoint). > > - Reclaimed EIDs shall only be reassigned after all unused EIDs in the > EID pool have been assigned to endpoints. Optionally, additional > robustness can be achieved if the bus owner maintains a short FIFO > list of reclaimed EIDs (and their associated physical addresses) and > allocates the older EIDs first. > > - A bus owner shall confirm that an endpoint has been removed by > attempting to access it after `Treclaim` has expired. It can do this > by issuing a `Get Endpoint ID` command to the endpoint to verify > that the endpoint is still non-responsive. It is recommended that > this be done at least three times, with a delay of at least `1/2 * > Treclaim` between tries if possible. If the endpoint continues to be > non-responsive, it can be assumed that it is safe to return its EID > to the pool of EIDs available for assignment. [1]: https://www.dmtf.org/sites/default/files/standards/documents/DSP0236_1.3.1.pdf The extract is framed with the perspective of reclaiming an unused EID but we must consider the broader case of unresponsiveness - we don't yet understand whether an EID should even be reclaimed. However, we can use the mechanisms outlined - polling using `Get Endpoint ID` - to achieve this understanding. We wish to avoid continuous polling of devices by `mctpd` for a few reasons: 1. Interleaving of commands to simple endpoints that may not cope 2. The increase in bus utilisation increases the probability of contention and loss of arbitration Given this, daemons using MCTP as a transport need some way to request that mctpd start polling the device to recover it or reclaim its EID. The strategy implemented adds a `.Recover` method and a `.Connectivity` property to the `au.com.CodeConstruct.MCTP.Endpoint` interface. `.Recover` takes no arguments, produces no result, and returns immediately. `.Connectivity` takes one of two values: - `Available` - `Degraded` When `.Recover` is invoked `mctpd` transitions `.Connectivity` to `Degraded`, then queries the device with `Get Endpoint ID`. `.Recover` responds on D-Bus after `Get Endpoint ID` has been issued and before a response is received. A valid response received from a `Get Endpoint ID` query transmitted inside `Treclaim` leads to `.Connectivity` transitioning to `Available`. If a response is not received to a `Get Endpoint ID` query issued inside `Treclaim` then `mctpd` removes the endpoint's D-Bus object. ```mermaid stateDiagram-v2 [*] --> Available Available --> Degraded Degraded --> Available Degraded --> [*] ``` The impact of changes to `.Connectivity` can be divided across two classes of client application: 1. The application invoking `.Recover` on a given endpoint `A` 2. Applications communicating with `A` that have not entered a state where recovery was considered necessary For an application to invoke `.Recover` it must already consider the endpoint unresponsive, therefore it seems reasonable to assume it won't continue communicating with the endpoint unless the recovery succeeds. As such there's no action required when `.Connectivity` transitions to `Degraded`. However, if recovery succeeds, the transition of `.Connectivity` from `Degraded` to `Available` provides the signal to restart communicating with the endpoint. For applications in the second class it is likely the case that they haven't themselves invoked `.Recover` because they are yet to observe a communication failure with the endpoint. As such there's also no requirement for action when `.Connectivity` transitions to `Degraded` on behalf of another application. However, it may be the case that communication failures are subsequently observed. It's not necessary to invoke `.Recover` if `.Connectivity` is already in `Degraded`, though there should also be no significant consequences if it occurs. If no communication failures are observed while `.Connectivity` is in `Degraded` then there's also no action required if it transitions to `Available`. Signed-off-by: Andrew Jeffery <[email protected]>
- Loading branch information