Skip to content

Commit

Permalink
mctpd: Add support for endpoint recovery
Browse files Browse the repository at this point in the history
DSP0236 v1.3.1[1] 8.17.6 defines a process for reclaiming dynamically
assigned EIDs from endpoint devices that have been removed from the
network. The fundamental process is captured in the following paragraphs:

> - The bus owner shall wait at least `Treclaim` seconds before
>   reassigning a given EID (where `Treclaim` is specified in the
>   physical transport binding specification for the medium used to
>   access the endpoint).
>
> - Reclaimed EIDs shall only be reassigned after all unused EIDs in the
>   EID pool have been assigned to endpoints. Optionally, additional
>   robustness can be achieved if the bus owner maintains a short FIFO
>   list of reclaimed EIDs (and their associated physical addresses) and
>   allocates the older EIDs first.
>
> - A bus owner shall confirm that an endpoint has been removed by
>   attempting to access it after `Treclaim` has expired. It can do this
>   by issuing a `Get Endpoint ID` command to the endpoint to verify
>   that the endpoint is still non-responsive. It is recommended that
>   this be done at least three times, with a delay of at least `1/2 *
>   Treclaim` between tries if possible. If the endpoint continues to be
>   non-responsive, it can be assumed that it is safe to return its EID
>   to the pool of EIDs available for assignment.

[1]: https://www.dmtf.org/sites/default/files/standards/documents/DSP0236_1.3.1.pdf

The extract is framed with the perspective of reclaiming an unused EID
but we must consider the broader case of unresponsiveness - we don't yet
understand whether an EID should even be reclaimed. However, we can use
the mechanisms outlined - polling using `Get Endpoint ID` - to achieve
this understanding.

We wish to avoid continuous polling of devices by `mctpd` for a few
reasons:

1. Interleaving of commands to simple endpoints that may not cope
2. The increase in bus utilisation increases the probability of
   contention and loss of arbitration

Given this, daemons using MCTP as a transport need some way to request
that mctpd start polling the device to recover it or reclaim its EID.

The strategy implemented adds a `.Recover` method and a `.Connectivity`
property to the `au.com.CodeConstruct.MCTP.Endpoint` interface.
`.Recover` takes no arguments, produces no result, and returns
immediately. `.Connectivity` takes one of two values:

- `Available`
- `Degraded`

When `.Recover` is invoked `mctpd` transitions `.Connectivity` to
`Degraded`, then queries the device with `Get Endpoint ID`. `.Recover`
responds on D-Bus after `Get Endpoint ID` has been issued and before a
response is received.  A valid response received from a `Get Endpoint
ID` query transmitted inside `Treclaim` leads to `.Connectivity`
transitioning to `Available`. If a response is not received to a `Get
Endpoint ID` query issued inside `Treclaim` then `mctpd` removes the
endpoint's D-Bus object.

```mermaid
stateDiagram-v2
  [*] --> Available
  Available --> Degraded
  Degraded --> Available
  Degraded --> [*]
```

The impact of changes to `.Connectivity` can be divided across two
classes of client application:

1. The application invoking `.Recover` on a given endpoint `A`
2. Applications communicating with `A` that have not entered a state
   where recovery was considered necessary

For an application to invoke `.Recover` it must already consider the
endpoint unresponsive, therefore it seems reasonable to assume it won't
continue communicating with the endpoint unless the recovery succeeds.
As such there's no action required when `.Connectivity` transitions to
`Degraded`. However, if recovery succeeds, the transition of
`.Connectivity` from `Degraded` to `Available` provides the signal to
restart communicating with the endpoint.

For applications in the second class it is likely the case that they
haven't themselves invoked `.Recover` because they are yet to observe a
communication failure with the endpoint. As such there's also no
requirement for action when `.Connectivity` transitions to `Degraded` on
behalf of another application. However, it may be the case that
communication failures are subsequently observed. It's not necessary to
invoke `.Recover` if `.Connectivity` is already in `Degraded`, though
there should also be no significant consequences if it occurs. If no
communication failures are observed while `.Connectivity` is in
`Degraded` then there's also no action required if it transitions to
`Available`.

Signed-off-by: Andrew Jeffery <[email protected]>
  • Loading branch information
amboar committed Jan 10, 2024
1 parent 6aa4b05 commit 7ec2f8d
Show file tree
Hide file tree
Showing 3 changed files with 714 additions and 0 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## Unreleased

### Added

1. mctpd: Add support for endpoint recovery

### Changed

1. dbus interface: the NetworkID field is now a `u` rather than an `i`, to
Expand Down
Loading

0 comments on commit 7ec2f8d

Please sign in to comment.