Python checker runs on pve-control via systemd timer every 10 min, publishes issues to NATS subject homelab_health_issue. Checks NATS, Postgres, MariaDB, Ghost blogs, DB dependents, standalone services, and every NodePort. Silent when healthy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
12 KiB
Homelab Health — Internal Checks Design
Status: design approved 2026-04-20. Ready to write implementation plan.
Resume Notes (for next session)
You and I brainstormed this design across one session. All design questions answered, all three design sections approved. The next step per the brainstorming skill is:
- This file exists — design committed (or ready to commit).
- Next action: invoke
superpowers:writing-plansto turn this design into a step-by-step implementation plan. - After the implementation plan is written, execute it (writing-plans → executing-plans).
Do NOT re-open any design decisions in the new session unless something here is obviously wrong; the decisions below are settled.
Test canary: when verifying the installed system end-to-end, break mediawiki (e.g. scale to 0 replicas), not the Ghost blogs. Ghosts are production, MediaWiki is expendable for a "does the alert fire" test.
Goals
Add a second layer of cluster health monitoring that runs inside the K3s cluster
and reports structural / semantic problems to NATS. The existing
k3s/scripts/check-health.sh (workstation-driven canary) stays in place unchanged.
Requirements, as given:
- NATS itself up
- MariaDB up
- PostgreSQL up
- Internal Ghost blog ports respond to HTTP correctly
- All other services depending on MariaDB respond correctly
- All services depending on PostgreSQL respond correctly
- Something is listening at every NodePort
Plus implicit: standalone services (Vaultwarden, Garage, etc.) also get probed.
Output contract: publish NATS messages on subject homelab_health_issue with
JSON body:
{
"component_name": "<str>",
"issue_detail": "<str>",
"detected_at": "<ISO8601 timestamp>",
"root_cause": "<optional str>"
}
Decisions (settled)
| Decision | Choice | Why |
|---|---|---|
| Where it runs | systemd timer on pve-control | Master K3s control node; kubectl locally; always on. |
| Language | Python 3 | User expertise; structured JSON; clean error handling. |
| HTTP probes | requests library |
No subprocess per probe; in-process. |
| NATS publish | nats-py library |
In-process; one cohesive Python process. |
| kubectl use | subprocess (kept for now) | Only two call sites; revisit later with kubernetes client. |
| DB auth for probes | sidestepped | Use kubectl exec <pod> -- pg_isready / mariadb-admin ping; no creds on pve-control. |
| Orchestration | Single script, one function per check category | Simple; matches "one function per check" ask. |
| Schedule | Every 10 minutes | User said no more frequent than that. |
| Deduplication | Stateless | Re-fires every tick while failing; consumer handles aggregation. |
| Healthy publishes | None | Silent when OK. Only problems on the wire. |
| Recovery events | None | Reports stop when fixed; absence = healthy. |
| Service config | JSON file (checks.json) |
Pythonic; easy to edit/commit; lives alongside checker.py. |
| NodePort discovery | Live from kubectl get svc -A -o json |
Source of truth is the cluster; no drift. |
| NATS-down fallback | stdout + non-zero exit | Workstation canary + systemctl status surface failures. Future leaf/LAN NATS fallback via env var hook (deferred). |
Architecture
Deployment layout on pve-control:
/opt/homelab-health/
├── checker.py # Python entrypoint, one function per check
├── checks.json # service catalog + NATS/DB config
├── venv/ # virtualenv with nats-py, requests
/etc/systemd/system/
├── homelab-health.service
└── homelab-health.timer
Source of truth in repo:
k3s/health/
├── home_lab_health.md # this file
├── checker.py
├── checks.json
├── requirements.txt # nats-py, requests
├── install.sh # runs on pve-control, sets up venv + units
├── homelab-health.service
├── homelab-health.timer
└── tests/
└── test_checks.py
Runtime flow each tick:
- Load
checks.json. - Connect to NATS with a 3s timeout. On failure: log loud, still run checks, publish nothing, exit 1.
- Run each check function in sequence, each wrapped in
try/except; exceptions in one check never stop the others (they become ahealthcheck.<fn>meta-issue). - Each check returns
list[Issue]. Main loop aggregates. - Log every issue to stdout (journal).
- For each issue, publish one NATS message to
homelab_health_issue. - Exit 0 if zero issues, 1 otherwise.
systemctl status+ journalctl give humans visibility.
Config schema (checks.json)
{
"nats": {
"url": "nats://nats.default.svc.cluster.local:4222",
"subject": "homelab_health_issue",
"monitoring_nodeport": 32388
},
"databases": [
{
"name": "postgres",
"namespace": "default",
"pod_label": "app=postgres",
"probe_cmd": ["pg_isready", "-U", "postgres"]
},
{
"name": "mariadb",
"namespace": "default",
"pod_label": "app=mariadb",
"probe_cmd": ["mariadb-admin", "ping", "--silent"]
}
],
"services": [
{"name": "ghost1", "namespace": "fulfillment", "db": "mariadb",
"probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
{"name": "ghost2", "namespace": "fulfillment", "db": "mariadb",
"probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
{"name": "ghost3", "namespace": "fulfillment", "db": "mariadb",
"probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
{"name": "mediawiki", "namespace": "default", "db": "mariadb",
"probe_path": "/", "expected": [200, 302]},
{"name": "forgejo", "namespace": "sjasoft", "db": "postgres",
"probe_path": "/api/healthz", "expected": [200]},
{"name": "authentik-server", "namespace": "default", "db": "postgres",
"probe_path": "/-/health/live/", "expected": [200, 204]},
{"name": "listmonk", "namespace": "default", "db": "postgres",
"probe_path": "/api/health", "expected": [200]},
{"name": "n8n", "namespace": "default", "db": "postgres",
"probe_path": "/healthz", "expected": [200]},
{"name": "mattermost", "namespace": "default", "db": "postgres",
"probe_path": "/api/v4/system/ping", "expected": [200]},
{"name": "vaultwarden", "namespace": "default", "db": null,
"probe_path": "/alive", "expected": [200]},
{"name": "garage", "namespace": "default", "db": null,
"probe_path": "/health", "expected": [200]},
{"name": "garage-webui", "namespace": "default", "db": null,
"probe_path": "/", "expected": [200, 302]}
]
}
Probe URL resolution: at runtime, kubectl get svc -n <ns> <name> -o json →
extract .spec.ports[].nodePort → probe http://localhost:<nodeport><probe_path>.
Per-service silence: add "disabled": true to a service entry to skip it without
deleting it.
Verify actual probe paths during implementation — the paths above are reasonable
defaults but each needs a quick curl sanity check. Specifically double-check:
Authentik (/-/health/live/ vs /-/health/ready/), Garage (root /health endpoint),
Vaultwarden (/alive returns 200 plain-text timestamp — confirmed), n8n (/healthz).
Check catalog
One function per requirement, sharing an internal probe_service(svc_cfg) helper.
| Function | Covers | Mechanism |
|---|---|---|
check_nats() |
#0 | kubectl exec NATS pod to run nats server check connection; fallback HTTP GET localhost:<monitoring_nodeport>/healthz |
check_postgres() |
#2 | kubectl exec postgres pod to run pg_isready -U postgres |
check_mariadb() |
#1 | kubectl exec mariadb pod to run mariadb-admin ping --silent |
check_ghost_blogs() |
#3 | probe_service for every service whose name starts with ghost |
check_mariadb_dependents() |
#4 | probe_service for every non-ghost service where db == "mariadb" |
check_postgres_dependents() |
#5 | probe_service for every service where db == "postgres" |
check_standalone_services() |
implicit | probe_service for every service where db == null |
check_all_nodeports() |
#6 | kubectl get svc -A -o json; for every nodePort, TCP connect localhost:<nodeport>; failure = nothing listening |
probe_service(svc): resolves NodePort via kubectl, calls
requests.get(f"http://localhost:{nodeport}{svc['probe_path']}", timeout=10),
compares status to expected, returns an Issue on mismatch or on exception.
Root-cause hints in payload: if check_mariadb() produced an issue this run,
any check_mariadb_dependents() failure gets "root_cause": "mariadb unreachable".
Same pattern for postgres. Decorative — consumers decide what to do with it.
Error handling
def run_all_checks(cfg) -> list[Issue]:
issues = []
for fn in [check_nats, check_postgres, check_mariadb,
check_ghost_blogs, check_mariadb_dependents,
check_postgres_dependents, check_standalone_services,
check_all_nodeports]:
try:
issues.extend(fn(cfg))
except Exception as e:
issues.append(Issue(
component_name=f"healthcheck.{fn.__name__}",
issue_detail=f"check function raised: {type(e).__name__}: {e}",
detected_at=now_iso(),
root_cause="healthcheck bug or missing dependency"))
return issues
- No single check can halt the pipeline.
- NATS connect failure is loud-logged; checks still run; individual publish failures are logged but don't stop the rest.
Issueis a small dataclass;to_dict()serialises to the exact NATS payload schema.
Deployment
install.sh (run once on pve-control as samantha, with sudo where needed):
set -euo pipefail
sudo mkdir -p /opt/homelab-health
sudo rsync -a --delete ./ /opt/homelab-health/ --exclude=install.sh --exclude=tests
sudo chown -R samantha:samantha /opt/homelab-health
python3 -m venv /opt/homelab-health/venv
/opt/homelab-health/venv/bin/pip install -r /opt/homelab-health/requirements.txt
sudo cp homelab-health.service homelab-health.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now homelab-health.timer
homelab-health.service:
[Unit]
Description=Homelab internal health checks
After=network-online.target
[Service]
Type=oneshot
User=samantha
ExecStart=/opt/homelab-health/venv/bin/python /opt/homelab-health/checker.py
StandardOutput=journal
StandardError=journal
homelab-health.timer:
[Unit]
Description=Run homelab health checks every 10 minutes
[Timer]
OnCalendar=*:0/10
Persistent=true
[Install]
WantedBy=timers.target
Testing
Unit tests (tests/test_checks.py, pytest):
- Each check function takes a config object — easily stubbed.
probe_serviceaccepts an injected HTTP client so tests don't hit real services.- Mock
subprocess.runfor kubectl calls. - Assert the exact
Issuelist returned for each failure shape.
Manual smoke test — checker.py --dry-run logs all issues to stdout but skips
NATS publish. Run ad-hoc on pve-control during development.
End-to-end verification after install:
systemctl list-timers homelab-health.timershows next fire time.- Manually fire once:
sudo systemctl start homelab-health.service. journalctl -u homelab-health -n 200shows outcome.- On workstation:
nats sub homelab_health_issue(against the cluster NATS). - Break mediawiki (
kubectl scale deploy mediawiki -n default --replicas=0) and wait ≤10 min — expect a message on the subject, withcomponent_name:"mediawiki". - Restore (
--replicas=1) and confirm alerts stop on the next tick.
Open items / future
- Leaf/LAN NATS fallback: add
FALLBACK_NATS_URLenv-var hook inchecker.py(unused for now). When the leaf NATS comes online, publish there too on connect failure to primary. - NATS auth: current assumption is local anonymous publish is allowed. If auth is
added, introduce a
nats.creds_pathfield inchecks.jsonpointing at a creds file on pve-control. - k8s Python client migration: replace the two remaining
kubectlsubprocess calls with thekuberneteslibrary for a fully in-process script. - Recovery events: if downstream consumers want a "resolved" signal, add a small local state file (JSON on disk) to detect transitions and publish recovery events.
- Per-namespace grouping: not needed now; if service list grows beyond ~25,
reconsider organizing
checks.jsonby namespace for readability.