Linux Kernel Watchdog
The Linux kernel watchdog is used to monitor if a system is running. It is supposed to automatically reboot hanged systems due to unrecoverable software errors. The watchdog module is specific to the hardware or chip being used. Personal computer users don’t need watchdog as they can reset the system manually. However, it is useful for systems that are mission critical and need the ability to reboot themselves without human intervention. For example, servers on a remote location or embedded equipment on a spacecraft that need automatic hardware reset capabilities.
Warning: Proceed with Caution
Wrong configurations of a watchdog on your system can cause problems like:
- Endless reboot loop
- File corruption due to hard reset
- Unpredictable random reboots
So avoid using live servers to test Linux kernel watchdog.
Watchdog functionality on the hardware side sets up a timer that times out after a predetermined period. The watchdog software then periodically refreshes the hardware timer. If the software stops refreshing, then after the predetermined period, the timer performs a hardware reset of the device. In order for a watchdog timer to be functional, the motherboard manufacturer has to use the chip’s watchdog functionality. Often the documentation from the manufacturer is not clear about whether the functionality was implemented. In that case, you have to test it out.
Also, you need the right watchdog kernel module to be loaded in your Linux system. Different chips use different modules. For example:
- Intel chipsets might use the “iTCO_wdt” module
- HP hardware might use “hpwdt”
- IBM mainframes might use “vmwatchdog”
- Xen VM might use “xen_wdt”
After the module is loaded, you can check /dev/watchdog on the Linux system. If this file is present, that means the watchdog kernel device driver or module was loaded. The system periodically keeps writing to /dev/watchdog. It is also called “kicking or feeding the watchdog”. If the system fails to kick or feed the watchdog, then after a while the system is hard reset.
The watchdog daemon opens the device and provides the necessary refresh to keep the system from resetting. It can test process table space, memory usage, file accessibility, work overload, file table overflow, IP address ping, network interface traffic, temperature, running processes and more. If the tests fail, then watchdog causes a shutdown.
Starting and Stopping Watchdog
Watchdog daemon should start at boot time and put itself in the background. You can check if it is running:
If the kernel is NOT compiled with CONFIG_WATCHDOG_NOWAYOUT, then if you close the /dev/watchdog properly, it will not cause a reboot. You can write the character V into /dev/watchdog and then close the file. This should stop the watchdog.
Testing the Watchdog
If you want to test if the hardware watchdog is working, you can do the following from your administrator command prompt:
And press “enter” twice and wait. The prompt will not come back. After awhile depending on your kernel’s setting, the system should perform the hard reboot.