Categories
DevOps

How to create watchdog for systemd service

Create a software watchdog for systemd service.

I will create a software watchdog for Redis service, by looking at the Redis socket file and restart the container if it goes missing.

Install Python bindings for systemd to get logging functionality.

$ sudo apt install python3-systemd

Ensure that Redis service is running.

$ sudo systemctl status redis
● redis-server.service - Advanced key-value store
     Loaded: loaded (/lib/systemd/system/redis-server.service; enabled; vendor >
     Active: active (running) since Sun 2021-09-12 12:00:53 UTC; 43s ago
       Docs: http://redis.io/documentation,
             man:redis-server(1)
   Main PID: 984 (redis-server)
     Status: "Ready to accept connections"
      Tasks: 5 (limit: 2311)
     Memory: 9.8M
        CPU: 143ms
     CGroup: /system.slice/redis-server.service
             └─984 /usr/bin/redis-server 127.0.0.1:6379

Sep 12 12:00:53 bullseye systemd[1]: Starting Advanced key-value store...
Sep 12 12:00:53 bullseye systemd[1]: Started Advanced key-value store.

Ensure that Redis socket exists as it will be used to monitor this service.

$ ls -l /var/run/redis/redis-server.sock 
srwx------ 1 redis redis 0 Sep 12 19:42 /var/run/redis/redis-server.sock

Create a monitoring script.

$ cat << EOF | sudo tee /usr/local/sbin/redis-watchdog.py
#!/bin/env python3

"""
Naive Redis watchdog service
"""

import os
import socket
import time
import sys
import signal
from systemd import journal

def log(message):
    """Log message to journal"""
    journal.send(message)


def get_notify_socket():
    """Connect to notify socket"""
    assert os.environ.get("NOTIFY_SOCKET", None) is not None, "NOTIFY_SOCKET needs to be defined"
    internal_notify_socket = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM | socket.SOCK_CLOEXEC)
    internal_notify_socket.connect(os.environ.get("NOTIFY_SOCKET", None))
    return internal_notify_socket


def get_watchdog_sec():
    """Get WATCHDOG_USEC, convert it to seconds and get ~0.75 of it (update watchdog every 3s if WATCHDOG_USEC is 5s)"""
    assert os.environ.get("WATCHDOG_USEC", None) is not None, "WATCHDOG_USEC needs to be defined"
    watchdog_update = int(int(os.environ.get("WATCHDOG_USEC", None)) / 1000000 * 0.75)
    if watchdog_update > 1:
        return watchdog_update
    else:
        return 1


def send_socket_message(message):
    """Send message over socket"""
    notify_socket = get_notify_socket()
    return notify_socket.send(message)


def socket_notify_ready():
    """Mark service as ready"""
    send_socket_message(b"STATUS=Ready")
    send_socket_message(b"READY=1")


def socket_notify_stop(*args):
    """Mark service as stopping and exit"""
    send_socket_message(b"STATUS=Stopping")
    send_socket_message(b"STOPPING=1")
    sys.exit(0)


def socket_notify_running():
    """Mark service as running"""
    send_socket_message(("STATUS=Running, checking every %is" % get_watchdog_sec()).encode('utf8'))


def watchdog_continue():
    """Update watchdog timestamp"""
    send_socket_message(b"WATCHDOG=1")


def watchdog_skip():
    log("Watchdog timestamp not updated.")


def check_condition():
    """Check condition - check if there is a redis socket"""
    if os.path.isfile("/var/run/redis/redis-server.sock"):
        return True
    else:
        return False


def wait():
    watchdog_sec = get_watchdog_sec()
    time.sleep(watchdog_sec)


if __name__ == "__main__":
    for send_signal in (signal.SIGINT, signal.SIGABRT, signal.SIGTERM):
        signal.signal(send_signal, socket_notify_stop)
    socket_notify_ready()

    socket_notify_running()
    while True:
        if check_condition():
            watchdog_continue()
        else:
            watchdog_skip()
        wait()

Ensure that proper permissions and ownership is applied.

$ sudo chmod 750 /usr/local/sbin/redis-watchdog.py
$ sudo chown root:root /usr/local/sbin/redis-watchdog.py

This script is designed as systemd service, so it will not work outside of its role.

$ python3 /usr/local/sbin/redis.watchdog.py
Traceback (most recent call last):
  File "/usr/local/sbin/redis.watchdog.py", line 85, in 
    socket_notify_ready()
  File "/usr/local/sbin/redis.watchdog.py", line 45, in socket_notify_ready
    send_socket_message(b"STATUS=Ready")
  File "/usr/local/sbin/redis.watchdog.py", line 39, in send_socket_message
    notify_socket = get_notify_socket()
  File "/usr/local/sbin/redis.watchdog.py", line 21, in get_notify_socket
    assert os.environ.get("NOTIFY_SOCKET", None) is not None, "NOTIFY_SOCKET needs to be defined"
AssertionError: NOTIFY_SOCKET needs to be defined

Create systemd service. Notice, watchdog interval is set to 5 seconds, 5 failures in 2 minutes will result in container reboot.

$ cat <<EOF | sudo tee /etc/systemd/system/redis-watchdog.service
[Unit]
Description=Naive Redis watchdog service

[Service]
Type=notify
ExecStart=/usr/local/sbin/redis-watchdog.py

Restart=always
RestartSec=1
WatchdogSec=5

StartLimitInterval=2min
StartLimitBurst=5
StartLimitAction=reboot

[Install]
WantedBy=multi-user.target
EOF

Create a timer to start watchdog 2 minutes after boot.

$ cat <<EOF | sudo tee /etc/systemd/system/redis-watchdog.timer
[Unit]
Description=timer for naive Redis watchdog service

[Timer]
OnBootSec=1min


[Install]
WantedBy=timers.target
EOF

Reload systemd configuration.

$ sudo systemctl daemon-reload

Enable service timer.

$ sudo systemctl enable redis-watchdog.timer 

Start watchdog now.

$ sudo systemctl start redis.watchdog

Sample service logs from an emergency situation.

$ sudo journalctl -u redis-watchdog -f 
-- Boot dacd0c7096a24504bca6ab54e9f4a6c3 --
Sep 12 20:59:54 bullseye systemd[1]: Starting Naive Redis watchdog service...
Sep 12 20:59:54 bullseye systemd[1]: Started Naive Redis watchdog service.
Sep 12 21:00:06 bullseye python3[567]: Watchdog timestamp not updated.
Sep 12 21:00:09 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)!
Sep 12 21:00:09 bullseye systemd[1]: redis-watchdog.service: Killing process 567 (python3) with signal SIGABRT.
Sep 12 21:00:09 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'.
Sep 12 21:00:10 bullseye systemd[1]: redis-watchdog.service: Scheduled restart job, restart counter is at 1.
Sep 12 21:00:10 bullseye systemd[1]: Stopped Naive Redis watchdog service.
Sep 12 21:00:10 bullseye systemd[1]: Starting Naive Redis watchdog service...
Sep 12 21:00:10 bullseye systemd[1]: Started Naive Redis watchdog service.
Sep 12 21:00:10 bullseye python3[606]: Watchdog timestamp not updated.
Sep 12 21:00:13 bullseye python3[606]: Watchdog timestamp not updated.
Sep 12 21:00:15 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)!
Sep 12 21:00:15 bullseye systemd[1]: redis-watchdog.service: Killing process 606 (python3) with signal SIGABRT.
Sep 12 21:00:15 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'.
Sep 12 21:00:16 bullseye systemd[1]: redis-watchdog.service: Scheduled restart job, restart counter is at 2.
Sep 12 21:00:16 bullseye systemd[1]: Stopped Naive Redis watchdog service.
Sep 12 21:00:16 bullseye systemd[1]: Starting Naive Redis watchdog service...
Sep 12 21:00:16 bullseye systemd[1]: Started Naive Redis watchdog service.
Sep 12 21:00:16 bullseye python3[608]: Watchdog timestamp not updated.
Sep 12 21:00:19 bullseye python3[608]: Watchdog timestamp not updated.
Sep 12 21:00:22 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)!
Sep 12 21:00:22 bullseye systemd[1]: redis-watchdog.service: Killing process 608 (python3) with signal SIGABRT.
Sep 12 21:00:22 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'.
Sep 12 21:00:23 bullseye systemd[1]: redis-watchdog.service: Scheduled restart job, restart counter is at 3.
Sep 12 21:00:23 bullseye systemd[1]: Stopped Naive Redis watchdog service.
Sep 12 21:00:23 bullseye systemd[1]: Starting Naive Redis watchdog service...
Sep 12 21:00:23 bullseye systemd[1]: Started Naive Redis watchdog service.
Sep 12 21:00:23 bullseye python3[610]: Watchdog timestamp not updated.
Sep 12 21:00:26 bullseye python3[610]: Watchdog timestamp not updated.
Sep 12 21:00:28 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)!
Sep 12 21:00:28 bullseye systemd[1]: redis-watchdog.service: Killing process 610 (python3) with signal SIGABRT.
Sep 12 21:00:28 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'.
Sep 12 21:00:29 bullseye systemd[1]: redis-watchdog.service: Scheduled restart job, restart counter is at 4.
Sep 12 21:00:29 bullseye systemd[1]: Stopped Naive Redis watchdog service.
Sep 12 21:00:29 bullseye systemd[1]: Starting Naive Redis watchdog service...
Sep 12 21:00:29 bullseye systemd[1]: Started Naive Redis watchdog service.
Sep 12 21:00:29 bullseye python3[612]: Watchdog timestamp not updated.
Sep 12 21:00:32 bullseye python3[612]: Watchdog timestamp not updated.
Sep 12 21:00:35 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)!
Sep 12 21:00:35 bullseye systemd[1]: redis-watchdog.service: Killing process 612 (python3) with signal SIGABRT.
Sep 12 21:00:35 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'.
Connection to 127.0.0.1 closed by remote host.
Connection to 127.0.0.1 closed.