Skip to content
mpvader edited this page Aug 17, 2018 · 35 revisions

The watchdog is used to ensure that the system never locks up.

Be aware that are multiple things, that one could all call watchdog: watchdog counter in the silicium/SoC, the watchdog-driver in the kernel, and the watchdog-process in userland.

And on the boards hosting a nanopi, there is also an external watchdog, added that because bad power supply can wedge the entire SoC including the watchdog.

Details on how its configured

  • Our kernels are configured with CONFIG_WATCHDOG_NOWAYOUT. This ensures that also accidentally stopping the watchdog-process will never lead to locked up system.
  • (in progress) During the reboot, the watchdog-process is stopped first. To make sure that anything later step during the reboot process can never lead to a system hang. https://github.com/victronenergy/venus/issues/312
  • The watchdog-process is stopped/killed under certain special conditions, to make sure the system resets (dbus daemon, anything else?)
  • The watchdog is started early, to watch over the system already during boot.
  • The watchdog-process is configured to
    • watch for max free memory as well as watch average system load.
    • (attempt to) write the reason to stop to a file, for VRM. See below
    • (attempt to) write the process list to /data/log/watchdog_processlist.txt
    • (in progress) check that connectivity to the VRM servers is up. https://github.com/victronenergy/venus/issues/287
  • The watchdog-process will append a line to /data/wtmp in case of a repair.

Files used for reporting

When logged in a system that rebooted, you might well see this (TODO UPDATE THIS ONCE FIXED. THAT 53 IS WRONG):

root@ccgx:~# cat /tmp/last_boot_type
-3
53
root@ccgx:~# cat /tmp/last_boot_type.orig 
30253

/data/log/watchdog_processlist.txt

And then there is also /data/wtmp, see man page.

Reporting to VRM

In the VRM Portal we want to store the reason for a (re)boot, for diagnostics purposes.

If the watchdog-process decided to reboot, we are interested in its reason, and otherwise we are interested in the data in the microprocessor register.

The code we want to go to VRM is selected here, and written to /tmp/last_boot_type.

Vrmlogger reads the code, sends it, and then writes -3 in top of the file, meaning tmp file already read. Seeing that on VRM usually means that vrmlogger has restarted either on purpose or because it crashed. In case it can't read the file it will report -2 to vrm.

Boot type codes as sent to VRM

Software originated, hence available on all machines:

code description origin
-3 tmp file already read, vrmlogger restarted!? vrmlogger
-2 Reading tmp file failed vrmlogger
-1 Reading watchdog register failed get_boot_type.c
29997 Max load avg exceeded watchdog-process
30012 watchdog-ENOMEM watchdog-process
30253 watchdog-EMAXLOAD watchdog-process

Hardware related:

code description ccgx beaglebone nanopi CANvu500
0 Old kernel, no boottype support x x x
1 Cold boot or reboot x
2 Unreproducable reset on CCGX x
3 Reset button x
4 Cold boot x
5 Reboot command x
17 Watchdog reboot x x

More details

From watch_err.h in the software watchdog project (note, 255 -> 30253
#define EREBOOT		255	/* unconditional reboot (255 = -1 as unsigned 8-bit) */
#define ERESET		254	/* unconditional hard reset */
#define EMAXLOAD	253	/* load average too high */
#define ETOOHOT		252	/* too hot inside */
#define ENOLOAD		251	/* /proc/loadavg contains no data */
#define ENOCHANGE	250	/* file wasn't changed in the given interval */
#define EINVMEM		249	/* /proc/meminfo contains invalid data */
#define ECHKILL		248	/* child was killed by signal */
#define ETOOLONG	247	/* child didn't return in time */
#define EUSERVALUE	246	/* reserved for user error code */
#define EDONTKNOW	245	/* unknown, not "no error" (i.e. success) but implies test still running */

Getting a list from the database:

select valueEnum, nameEnum, count(*) as nr_of_sites, min(l.secondsAgo) / 60 as min_minutes_ago, max(l.secondsAgo) / 60 as max_minutes_ago from vwLastLogData l where idDataAttribute = 237 group by valueEnum;
Clone this wiki locally