Create a software watchdog for systemd service.
I will create a software watchdog for Redis service, by looking at the Redis socket file and restart the container if it goes missing.
Install Python bindings for systemd to get logging functionality.
$ sudo apt install python3-systemd
Ensure that Redis service is running.
$ sudo systemctl status redis
● redis-server.service - Advanced key-value store Loaded: loaded (/lib/systemd/system/redis-server.service; enabled; vendor > Active: active (running) since Sun 2021-09-12 12:00:53 UTC; 43s ago Docs: http://redis.io/documentation, man:redis-server(1) Main PID: 984 (redis-server) Status: "Ready to accept connections" Tasks: 5 (limit: 2311) Memory: 9.8M CPU: 143ms CGroup: /system.slice/redis-server.service └─984 /usr/bin/redis-server 127.0.0.1:6379 Sep 12 12:00:53 bullseye systemd[1]: Starting Advanced key-value store... Sep 12 12:00:53 bullseye systemd[1]: Started Advanced key-value store.
Ensure that Redis socket exists as it will be used to monitor this service.
$ ls -l /var/run/redis/redis-server.sock
srwx------ 1 redis redis 0 Sep 12 19:42 /var/run/redis/redis-server.sock
Create a monitoring script.
$ cat << EOF | sudo tee /usr/local/sbin/redis-watchdog.py #!/bin/env python3 """ Naive Redis watchdog service """ import os import socket import time import sys import signal from systemd import journal def log(message): """Log message to journal""" journal.send(message) def get_notify_socket(): """Connect to notify socket""" assert os.environ.get("NOTIFY_SOCKET", None) is not None, "NOTIFY_SOCKET needs to be defined" internal_notify_socket = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM | socket.SOCK_CLOEXEC) internal_notify_socket.connect(os.environ.get("NOTIFY_SOCKET", None)) return internal_notify_socket def get_watchdog_sec(): """Get WATCHDOG_USEC, convert it to seconds and get ~0.75 of it (update watchdog every 3s if WATCHDOG_USEC is 5s)""" assert os.environ.get("WATCHDOG_USEC", None) is not None, "WATCHDOG_USEC needs to be defined" watchdog_update = int(int(os.environ.get("WATCHDOG_USEC", None)) / 1000000 * 0.75) if watchdog_update > 1: return watchdog_update else: return 1 def send_socket_message(message): """Send message over socket""" notify_socket = get_notify_socket() return notify_socket.send(message) def socket_notify_ready(): """Mark service as ready""" send_socket_message(b"STATUS=Ready") send_socket_message(b"READY=1") def socket_notify_stop(*args): """Mark service as stopping and exit""" send_socket_message(b"STATUS=Stopping") send_socket_message(b"STOPPING=1") sys.exit(0) def socket_notify_running(): """Mark service as running""" send_socket_message(("STATUS=Running, checking every %is" % get_watchdog_sec()).encode('utf8')) def watchdog_continue(): """Update watchdog timestamp""" send_socket_message(b"WATCHDOG=1") def watchdog_skip(): log("Watchdog timestamp not updated.") def check_condition(): """Check condition - check if there is a redis socket""" if os.path.isfile("/var/run/redis/redis-server.sock"): return True else: return False def wait(): watchdog_sec = get_watchdog_sec() time.sleep(watchdog_sec) if __name__ == "__main__": for send_signal in (signal.SIGINT, signal.SIGABRT, signal.SIGTERM): signal.signal(send_signal, socket_notify_stop) socket_notify_ready() socket_notify_running() while True: if check_condition(): watchdog_continue() else: watchdog_skip() wait()
Ensure that proper permissions and ownership is applied.
$ sudo chmod 750 /usr/local/sbin/redis-watchdog.py
$ sudo chown root:root /usr/local/sbin/redis-watchdog.py
This script is designed as systemd service, so it will not work outside of its role.
$ python3 /usr/local/sbin/redis.watchdog.py
Traceback (most recent call last): File "/usr/local/sbin/redis.watchdog.py", line 85, in <module> socket_notify_ready() File "/usr/local/sbin/redis.watchdog.py", line 45, in socket_notify_ready send_socket_message(b"STATUS=Ready") File "/usr/local/sbin/redis.watchdog.py", line 39, in send_socket_message notify_socket = get_notify_socket() File "/usr/local/sbin/redis.watchdog.py", line 21, in get_notify_socket assert os.environ.get("NOTIFY_SOCKET", None) is not None, "NOTIFY_SOCKET needs to be defined" AssertionError: NOTIFY_SOCKET needs to be defined
Create systemd service. Notice, watchdog interval is set to 5 seconds, 5 failures in 2 minutes will result in container reboot.
$ cat <<EOF | sudo tee /etc/systemd/system/redis-watchdog.service
[Unit] Description=Naive Redis watchdog service [Service] Type=notify ExecStart=/usr/local/sbin/redis-watchdog.py Restart=always RestartSec=1 WatchdogSec=5 StartLimitInterval=2min StartLimitBurst=5 StartLimitAction=reboot [Install] WantedBy=multi-user.target EOF
Create a timer to start watchdog 2 minutes after boot.
$ cat <<EOF | sudo tee /etc/systemd/system/redis-watchdog.timer [Unit] Description=timer for naive Redis watchdog service [Timer] OnBootSec=1min [Install] WantedBy=timers.target EOF
Reload systemd configuration.
$ sudo systemctl daemon-reload
Enable service timer.
$ sudo systemctl enable redis-watchdog.timer
Start watchdog now.
$ sudo systemctl start redis.watchdog
Sample service logs from an emergency situation.
$ sudo journalctl -u redis-watchdog -f
-- Boot dacd0c7096a24504bca6ab54e9f4a6c3 -- Sep 12 20:59:54 bullseye systemd[1]: Starting Naive Redis watchdog service... Sep 12 20:59:54 bullseye systemd[1]: Started Naive Redis watchdog service. Sep 12 21:00:06 bullseye python3[567]: Watchdog timestamp not updated. Sep 12 21:00:09 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)! Sep 12 21:00:09 bullseye systemd[1]: redis-watchdog.service: Killing process 567 (python3) with signal SIGABRT. Sep 12 21:00:09 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'. Sep 12 21:00:10 bullseye systemd[1]: redis-watchdog.service: Scheduled restart job, restart counter is at 1. Sep 12 21:00:10 bullseye systemd[1]: Stopped Naive Redis watchdog service. Sep 12 21:00:10 bullseye systemd[1]: Starting Naive Redis watchdog service... Sep 12 21:00:10 bullseye systemd[1]: Started Naive Redis watchdog service. Sep 12 21:00:10 bullseye python3[606]: Watchdog timestamp not updated. Sep 12 21:00:13 bullseye python3[606]: Watchdog timestamp not updated. Sep 12 21:00:15 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)! Sep 12 21:00:15 bullseye systemd[1]: redis-watchdog.service: Killing process 606 (python3) with signal SIGABRT. Sep 12 21:00:15 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'. Sep 12 21:00:16 bullseye systemd[1]: redis-watchdog.service: Scheduled restart job, restart counter is at 2. Sep 12 21:00:16 bullseye systemd[1]: Stopped Naive Redis watchdog service. Sep 12 21:00:16 bullseye systemd[1]: Starting Naive Redis watchdog service... Sep 12 21:00:16 bullseye systemd[1]: Started Naive Redis watchdog service. Sep 12 21:00:16 bullseye python3[608]: Watchdog timestamp not updated. Sep 12 21:00:19 bullseye python3[608]: Watchdog timestamp not updated. Sep 12 21:00:22 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)! Sep 12 21:00:22 bullseye systemd[1]: redis-watchdog.service: Killing process 608 (python3) with signal SIGABRT. Sep 12 21:00:22 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'. Sep 12 21:00:23 bullseye systemd[1]: redis-watchdog.service: Scheduled restart job, restart counter is at 3. Sep 12 21:00:23 bullseye systemd[1]: Stopped Naive Redis watchdog service. Sep 12 21:00:23 bullseye systemd[1]: Starting Naive Redis watchdog service... Sep 12 21:00:23 bullseye systemd[1]: Started Naive Redis watchdog service. Sep 12 21:00:23 bullseye python3[610]: Watchdog timestamp not updated. Sep 12 21:00:26 bullseye python3[610]: Watchdog timestamp not updated. Sep 12 21:00:28 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)! Sep 12 21:00:28 bullseye systemd[1]: redis-watchdog.service: Killing process 610 (python3) with signal SIGABRT. Sep 12 21:00:28 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'. Sep 12 21:00:29 bullseye systemd[1]: redis-watchdog.service: Scheduled restart job, restart counter is at 4. Sep 12 21:00:29 bullseye systemd[1]: Stopped Naive Redis watchdog service. Sep 12 21:00:29 bullseye systemd[1]: Starting Naive Redis watchdog service... Sep 12 21:00:29 bullseye systemd[1]: Started Naive Redis watchdog service. Sep 12 21:00:29 bullseye python3[612]: Watchdog timestamp not updated. Sep 12 21:00:32 bullseye python3[612]: Watchdog timestamp not updated. Sep 12 21:00:35 bullseye systemd[1]: redis-watchdog.service: Watchdog timeout (limit 5s)! Sep 12 21:00:35 bullseye systemd[1]: redis-watchdog.service: Killing process 612 (python3) with signal SIGABRT. Sep 12 21:00:35 bullseye systemd[1]: redis-watchdog.service: Failed with result 'watchdog'. Connection to 127.0.0.1 closed by remote host. Connection to 127.0.0.1 closed.