Samantha Atkins 58bfd422d4 Add homelab internal health checker

Python checker runs on pve-control via systemd timer every 10 min,
publishes issues to NATS subject homelab_health_issue. Checks NATS,
Postgres, MariaDB, Ghost blogs, DB dependents, standalone services,
and every NodePort. Silent when healthy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-20 15:48:07 -04:00

12 KiB

Raw Blame History

Homelab Health — Internal Checks Design

Status: design approved 2026-04-20. Ready to write implementation plan.

Resume Notes (for next session)

You and I brainstormed this design across one session. All design questions answered, all three design sections approved. The next step per the brainstorming skill is:

This file exists — design committed (or ready to commit).
Next action: invoke superpowers:writing-plans to turn this design into a step-by-step implementation plan.
After the implementation plan is written, execute it (writing-plans → executing-plans).

Do NOT re-open any design decisions in the new session unless something here is obviously wrong; the decisions below are settled.

Test canary: when verifying the installed system end-to-end, break mediawiki (e.g. scale to 0 replicas), not the Ghost blogs. Ghosts are production, MediaWiki is expendable for a "does the alert fire" test.

Goals

Add a second layer of cluster health monitoring that runs inside the K3s cluster and reports structural / semantic problems to NATS. The existing k3s/scripts/check-health.sh (workstation-driven canary) stays in place unchanged.

Requirements, as given:

NATS itself up
MariaDB up
PostgreSQL up
Internal Ghost blog ports respond to HTTP correctly
All other services depending on MariaDB respond correctly
All services depending on PostgreSQL respond correctly
Something is listening at every NodePort

Plus implicit: standalone services (Vaultwarden, Garage, etc.) also get probed.

Output contract: publish NATS messages on subject homelab_health_issue with JSON body:

{
  "component_name": "<str>",
  "issue_detail": "<str>",
  "detected_at": "<ISO8601 timestamp>",
  "root_cause": "<optional str>"
}

Decisions (settled)

Decision	Choice	Why
Where it runs	systemd timer on pve-control	Master K3s control node; kubectl locally; always on.
Language	Python 3	User expertise; structured JSON; clean error handling.
HTTP probes	`requests` library	No subprocess per probe; in-process.
NATS publish	`nats-py` library	In-process; one cohesive Python process.
kubectl use	subprocess (kept for now)	Only two call sites; revisit later with `kubernetes` client.
DB auth for probes	sidestepped	Use `kubectl exec <pod> -- pg_isready` / `mariadb-admin ping`; no creds on pve-control.
Orchestration	Single script, one function per check category	Simple; matches "one function per check" ask.
Schedule	Every 10 minutes	User said no more frequent than that.
Deduplication	Stateless	Re-fires every tick while failing; consumer handles aggregation.
Healthy publishes	None	Silent when OK. Only problems on the wire.
Recovery events	None	Reports stop when fixed; absence = healthy.
Service config	JSON file (`checks.json`)	Pythonic; easy to edit/commit; lives alongside `checker.py`.
NodePort discovery	Live from `kubectl get svc -A -o json`	Source of truth is the cluster; no drift.
NATS-down fallback	stdout + non-zero exit	Workstation canary + `systemctl status` surface failures. Future leaf/LAN NATS fallback via env var hook (deferred).

Architecture

Deployment layout on pve-control:

/opt/homelab-health/
├── checker.py            # Python entrypoint, one function per check
├── checks.json           # service catalog + NATS/DB config
├── venv/                 # virtualenv with nats-py, requests
/etc/systemd/system/
├── homelab-health.service
└── homelab-health.timer

Source of truth in repo:

k3s/health/
├── home_lab_health.md         # this file
├── checker.py
├── checks.json
├── requirements.txt           # nats-py, requests
├── install.sh                 # runs on pve-control, sets up venv + units
├── homelab-health.service
├── homelab-health.timer
└── tests/
    └── test_checks.py

Runtime flow each tick:

Load checks.json.
Connect to NATS with a 3s timeout. On failure: log loud, still run checks, publish nothing, exit 1.
Run each check function in sequence, each wrapped in try/except; exceptions in one check never stop the others (they become a healthcheck.<fn> meta-issue).
Each check returns list[Issue]. Main loop aggregates.
Log every issue to stdout (journal).
For each issue, publish one NATS message to homelab_health_issue.
Exit 0 if zero issues, 1 otherwise. systemctl status + journalctl give humans visibility.

Config schema (`checks.json`)

{
  "nats": {
    "url": "nats://nats.default.svc.cluster.local:4222",
    "subject": "homelab_health_issue",
    "monitoring_nodeport": 32388
  },
  "databases": [
    {
      "name": "postgres",
      "namespace": "default",
      "pod_label": "app=postgres",
      "probe_cmd": ["pg_isready", "-U", "postgres"]
    },
    {
      "name": "mariadb",
      "namespace": "default",
      "pod_label": "app=mariadb",
      "probe_cmd": ["mariadb-admin", "ping", "--silent"]
    }
  ],
  "services": [
    {"name": "ghost1", "namespace": "fulfillment", "db": "mariadb",
     "probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
    {"name": "ghost2", "namespace": "fulfillment", "db": "mariadb",
     "probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
    {"name": "ghost3", "namespace": "fulfillment", "db": "mariadb",
     "probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
    {"name": "mediawiki", "namespace": "default", "db": "mariadb",
     "probe_path": "/", "expected": [200, 302]},
    {"name": "forgejo", "namespace": "sjasoft", "db": "postgres",
     "probe_path": "/api/healthz", "expected": [200]},
    {"name": "authentik-server", "namespace": "default", "db": "postgres",
     "probe_path": "/-/health/live/", "expected": [200, 204]},
    {"name": "listmonk", "namespace": "default", "db": "postgres",
     "probe_path": "/api/health", "expected": [200]},
    {"name": "n8n", "namespace": "default", "db": "postgres",
     "probe_path": "/healthz", "expected": [200]},
    {"name": "mattermost", "namespace": "default", "db": "postgres",
     "probe_path": "/api/v4/system/ping", "expected": [200]},
    {"name": "vaultwarden", "namespace": "default", "db": null,
     "probe_path": "/alive", "expected": [200]},
    {"name": "garage", "namespace": "default", "db": null,
     "probe_path": "/health", "expected": [200]},
    {"name": "garage-webui", "namespace": "default", "db": null,
     "probe_path": "/", "expected": [200, 302]}
  ]
}

Probe URL resolution: at runtime, kubectl get svc -n <ns> <name> -o json → extract .spec.ports[].nodePort → probe http://localhost:<nodeport><probe_path>.

Per-service silence: add "disabled": true to a service entry to skip it without deleting it.

Verify actual probe paths during implementation — the paths above are reasonable defaults but each needs a quick curl sanity check. Specifically double-check: Authentik (/-/health/live/ vs /-/health/ready/), Garage (root /health endpoint), Vaultwarden (/alive returns 200 plain-text timestamp — confirmed), n8n (/healthz).

Check catalog

One function per requirement, sharing an internal probe_service(svc_cfg) helper.

Function	Covers	Mechanism
`check_nats()`	#0	`kubectl exec` NATS pod to run `nats server check connection`; fallback HTTP GET `localhost:<monitoring_nodeport>/healthz`
`check_postgres()`	#2	`kubectl exec` postgres pod to run `pg_isready -U postgres`
`check_mariadb()`	#1	`kubectl exec` mariadb pod to run `mariadb-admin ping --silent`
`check_ghost_blogs()`	#3	`probe_service` for every service whose name starts with `ghost`
`check_mariadb_dependents()`	#4	`probe_service` for every non-ghost service where `db == "mariadb"`
`check_postgres_dependents()`	#5	`probe_service` for every service where `db == "postgres"`
`check_standalone_services()`	implicit	`probe_service` for every service where `db == null`
`check_all_nodeports()`	#6	`kubectl get svc -A -o json`; for every `nodePort`, TCP connect `localhost:<nodeport>`; failure = nothing listening

probe_service(svc): resolves NodePort via kubectl, calls requests.get(f"http://localhost:{nodeport}{svc['probe_path']}", timeout=10), compares status to expected, returns an Issue on mismatch or on exception.

Root-cause hints in payload: if check_mariadb() produced an issue this run, any check_mariadb_dependents() failure gets "root_cause": "mariadb unreachable". Same pattern for postgres. Decorative — consumers decide what to do with it.

Error handling

def run_all_checks(cfg) -> list[Issue]:
    issues = []
    for fn in [check_nats, check_postgres, check_mariadb,
               check_ghost_blogs, check_mariadb_dependents,
               check_postgres_dependents, check_standalone_services,
               check_all_nodeports]:
        try:
            issues.extend(fn(cfg))
        except Exception as e:
            issues.append(Issue(
                component_name=f"healthcheck.{fn.__name__}",
                issue_detail=f"check function raised: {type(e).__name__}: {e}",
                detected_at=now_iso(),
                root_cause="healthcheck bug or missing dependency"))
    return issues

No single check can halt the pipeline.
NATS connect failure is loud-logged; checks still run; individual publish failures are logged but don't stop the rest.
Issue is a small dataclass; to_dict() serialises to the exact NATS payload schema.

Deployment

install.sh (run once on pve-control as samantha, with sudo where needed):

set -euo pipefail
sudo mkdir -p /opt/homelab-health
sudo rsync -a --delete ./ /opt/homelab-health/ --exclude=install.sh --exclude=tests
sudo chown -R samantha:samantha /opt/homelab-health
python3 -m venv /opt/homelab-health/venv
/opt/homelab-health/venv/bin/pip install -r /opt/homelab-health/requirements.txt
sudo cp homelab-health.service homelab-health.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now homelab-health.timer

homelab-health.service:

[Unit]
Description=Homelab internal health checks
After=network-online.target

[Service]
Type=oneshot
User=samantha
ExecStart=/opt/homelab-health/venv/bin/python /opt/homelab-health/checker.py
StandardOutput=journal
StandardError=journal

homelab-health.timer:

[Unit]
Description=Run homelab health checks every 10 minutes

[Timer]
OnCalendar=*:0/10
Persistent=true

[Install]
WantedBy=timers.target

Testing

Unit tests (tests/test_checks.py, pytest):

Each check function takes a config object — easily stubbed.
probe_service accepts an injected HTTP client so tests don't hit real services.
Mock subprocess.run for kubectl calls.
Assert the exact Issue list returned for each failure shape.

Manual smoke test — checker.py --dry-run logs all issues to stdout but skips NATS publish. Run ad-hoc on pve-control during development.

End-to-end verification after install:

systemctl list-timers homelab-health.timer shows next fire time.
Manually fire once: sudo systemctl start homelab-health.service.
journalctl -u homelab-health -n 200 shows outcome.
On workstation: nats sub homelab_health_issue (against the cluster NATS).
Break mediawiki (kubectl scale deploy mediawiki -n default --replicas=0) and wait ≤10 min — expect a message on the subject, with component_name:"mediawiki".
Restore (--replicas=1) and confirm alerts stop on the next tick.

Open items / future

Leaf/LAN NATS fallback: add FALLBACK_NATS_URL env-var hook in checker.py (unused for now). When the leaf NATS comes online, publish there too on connect failure to primary.
NATS auth: current assumption is local anonymous publish is allowed. If auth is added, introduce a nats.creds_path field in checks.json pointing at a creds file on pve-control.
k8s Python client migration: replace the two remaining kubectl subprocess calls with the kubernetes library for a fully in-process script.
Recovery events: if downstream consumers want a "resolved" signal, add a small local state file (JSON on disk) to detect transitions and publish recovery events.
Per-namespace grouping: not needed now; if service list grows beyond ~25, reconsider organizing checks.json by namespace for readability.

12 KiB Raw Blame History