You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we monitor agent instance health through each process' "OpCode", which mostly boils down to whether the process is running or not. (See enum.) This value is included in agent heartbeat; is monitored and streamed to a feed by the registry.
In the model where processes are long-lived, and they deal with hardware problems by simply attempting reconnection, the process running/not-running is not sufficiently informative. Such a process could record the status in session.data somehow ... but for generic propagation of that information we should standardize how that is done and how to get that information out.
Continuing from the discussion on dev call yesterday, a proposal is:
Add "DEGRADED" to the OpCode enum.
Standardize on session.data["degraded_at"] = unix_timestamp to mark running sessions as degraded.
In OpSession, add function "set_degraded(degraded [bool])" to mark / clear the degraded state (which just updates session.data).
In OpSession.op_code property, if status == "running" but data['degraded'] > 0 then return value "DEGRADED".
Individual agents and processes will need to manually implement the use of degraded, if it is applicable to how they deal with errors. Alarms configured for such agents will need to be updated to map the "degraded" state to be as bad as "not running".
The text was updated successfully, but these errors were encountered:
While we're in there, might be nice to have processes automagically transition out of "starting" state when the process code is run, rather than relying on agent code to set_status('running') manually.
Currently we monitor agent instance health through each process' "OpCode", which mostly boils down to whether the process is running or not. (See enum.) This value is included in agent heartbeat; is monitored and streamed to a feed by the registry.
In the model where processes are long-lived, and they deal with hardware problems by simply attempting reconnection, the process running/not-running is not sufficiently informative. Such a process could record the status in session.data somehow ... but for generic propagation of that information we should standardize how that is done and how to get that information out.
Continuing from the discussion on dev call yesterday, a proposal is:
session.data["degraded_at"] = unix_timestamp
to mark running sessions as degraded.Individual agents and processes will need to manually implement the use of degraded, if it is applicable to how they deal with errors. Alarms configured for such agents will need to be updated to map the "degraded" state to be as bad as "not running".
The text was updated successfully, but these errors were encountered: