-
Notifications
You must be signed in to change notification settings - Fork 77
watchdog
The watchdog is used to ensure that the system never locks up.
- Our kernels are configured with CONFIG_WATCHDOG_NOWAYOUT. This ensures that also accidentally stopping the watchdog process, which normally stops the hardware timer: disabling the watchdog function, will never lead to locked up system.
- (in progress) During the reboot, the watchdog process is stopped first. To make sure that anything later step during the reboot process can never lead to a system hang. https://github.com/victronenergy/venus/issues/312
- The watchdog is stopped under certain special conditions, to make sure the system resets (dbus daemon, anything else?)
- The watchdog is started early, to also watch the system during boot.
-
Its configured to
- watch for max free memory as well as watch average system load.
- (attempt to) write the reason to stop to a file, for VRM. See below
- (attempt to) write the process list to
/data/log/watchdog_processlist.txt
- (in progress) check that connectivity to the VRM servers is up. https://github.com/victronenergy/venus/issues/287
In the VRM Portal we want to store the reason for a (re)boot, for diagnostics purposes.
Be aware that are two levels that can generate the reason for a reset. First there is watchdog
the userland process, and then there is the watchdog unit in the microprocessor, controlled by the kernel driver.
If the watchdog process decided to reboot, we are interested in its reason, and otherwise we are interested in the data in the microprocessor register.
The code we want to go to VRM is selected here, and written to /tmp/last_boot_type
.
Vrmlogger reads the code, sends it, and then writes -3 in it, tmp file already read
. Seeing that on VRM usually means that vrmlogger has restarted either on purpose or because it crashed. In case it can't read the file it will report -2.
From watch_err.h in the software watchdog project (note, 255 -> 30253
#define EREBOOT 255 /* unconditional reboot (255 = -1 as unsigned 8-bit) */
#define ERESET 254 /* unconditional hard reset */
#define EMAXLOAD 253 /* load average too high */
#define ETOOHOT 252 /* too hot inside */
#define ENOLOAD 251 /* /proc/loadavg contains no data */
#define ENOCHANGE 250 /* file wasn't changed in the given interval */
#define EINVMEM 249 /* /proc/meminfo contains invalid data */
#define ECHKILL 248 /* child was killed by signal */
#define ETOOLONG 247 /* child didn't return in time */
#define EUSERVALUE 246 /* reserved for user error code */
#define EDONTKNOW 245 /* unknown, not "no error" (i.e. success) but implies test still running */
and some extra info from errorcodes.c:
case ENOERR: str = "no error"; break;
case EREBOOT: str = "unconditional reboot requested"; break;
case ERESET: str = "unconditional hard reset requested"; break;
case EMAXLOAD: str = "load average too high"; break;
case ETOOHOT: str = "too hot"; break;
case ENOLOAD: str = "loadavg contains no data"; break;
case ENOCHANGE: str = "file was not changed in the given interval"; break;
case EINVMEM: str = "meminfo contains invalid data"; break;
case ECHKILL: str = "child process was killed by signal"; break;
case ETOOLONG: str = "child process did not return in time"; break;
case EUSERVALUE: str = "user-reserved code"; break;
case EDONTKNOW: str = "unknown (neither good nor bad)"; break;
Getting a list from the database:
select nameEnum, count(*) as nr_of_sites, min(l.secondsAgo), max(l.secondsAgo) from vwLastLogData l where idDataAttribute = 237 group by nameEnum ;