homelab/k3s/health/home_lab_health.md
Samantha Atkins 58bfd422d4 Add homelab internal health checker
Python checker runs on pve-control via systemd timer every 10 min,
publishes issues to NATS subject homelab_health_issue. Checks NATS,
Postgres, MariaDB, Ghost blogs, DB dependents, standalone services,
and every NodePort. Silent when healthy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 15:48:07 -04:00

12 KiB

Homelab Health — Internal Checks Design

Status: design approved 2026-04-20. Ready to write implementation plan.


Resume Notes (for next session)

You and I brainstormed this design across one session. All design questions answered, all three design sections approved. The next step per the brainstorming skill is:

  1. This file exists — design committed (or ready to commit).
  2. Next action: invoke superpowers:writing-plans to turn this design into a step-by-step implementation plan.
  3. After the implementation plan is written, execute it (writing-plans → executing-plans).

Do NOT re-open any design decisions in the new session unless something here is obviously wrong; the decisions below are settled.

Test canary: when verifying the installed system end-to-end, break mediawiki (e.g. scale to 0 replicas), not the Ghost blogs. Ghosts are production, MediaWiki is expendable for a "does the alert fire" test.


Goals

Add a second layer of cluster health monitoring that runs inside the K3s cluster and reports structural / semantic problems to NATS. The existing k3s/scripts/check-health.sh (workstation-driven canary) stays in place unchanged.

Requirements, as given:

  1. NATS itself up
  2. MariaDB up
  3. PostgreSQL up
  4. Internal Ghost blog ports respond to HTTP correctly
  5. All other services depending on MariaDB respond correctly
  6. All services depending on PostgreSQL respond correctly
  7. Something is listening at every NodePort

Plus implicit: standalone services (Vaultwarden, Garage, etc.) also get probed.

Output contract: publish NATS messages on subject homelab_health_issue with JSON body:

{
  "component_name": "<str>",
  "issue_detail": "<str>",
  "detected_at": "<ISO8601 timestamp>",
  "root_cause": "<optional str>"
}

Decisions (settled)

Decision Choice Why
Where it runs systemd timer on pve-control Master K3s control node; kubectl locally; always on.
Language Python 3 User expertise; structured JSON; clean error handling.
HTTP probes requests library No subprocess per probe; in-process.
NATS publish nats-py library In-process; one cohesive Python process.
kubectl use subprocess (kept for now) Only two call sites; revisit later with kubernetes client.
DB auth for probes sidestepped Use kubectl exec <pod> -- pg_isready / mariadb-admin ping; no creds on pve-control.
Orchestration Single script, one function per check category Simple; matches "one function per check" ask.
Schedule Every 10 minutes User said no more frequent than that.
Deduplication Stateless Re-fires every tick while failing; consumer handles aggregation.
Healthy publishes None Silent when OK. Only problems on the wire.
Recovery events None Reports stop when fixed; absence = healthy.
Service config JSON file (checks.json) Pythonic; easy to edit/commit; lives alongside checker.py.
NodePort discovery Live from kubectl get svc -A -o json Source of truth is the cluster; no drift.
NATS-down fallback stdout + non-zero exit Workstation canary + systemctl status surface failures. Future leaf/LAN NATS fallback via env var hook (deferred).

Architecture

Deployment layout on pve-control:

/opt/homelab-health/
├── checker.py            # Python entrypoint, one function per check
├── checks.json           # service catalog + NATS/DB config
├── venv/                 # virtualenv with nats-py, requests
/etc/systemd/system/
├── homelab-health.service
└── homelab-health.timer

Source of truth in repo:

k3s/health/
├── home_lab_health.md         # this file
├── checker.py
├── checks.json
├── requirements.txt           # nats-py, requests
├── install.sh                 # runs on pve-control, sets up venv + units
├── homelab-health.service
├── homelab-health.timer
└── tests/
    └── test_checks.py

Runtime flow each tick:

  1. Load checks.json.
  2. Connect to NATS with a 3s timeout. On failure: log loud, still run checks, publish nothing, exit 1.
  3. Run each check function in sequence, each wrapped in try/except; exceptions in one check never stop the others (they become a healthcheck.<fn> meta-issue).
  4. Each check returns list[Issue]. Main loop aggregates.
  5. Log every issue to stdout (journal).
  6. For each issue, publish one NATS message to homelab_health_issue.
  7. Exit 0 if zero issues, 1 otherwise. systemctl status + journalctl give humans visibility.

Config schema (checks.json)

{
  "nats": {
    "url": "nats://nats.default.svc.cluster.local:4222",
    "subject": "homelab_health_issue",
    "monitoring_nodeport": 32388
  },
  "databases": [
    {
      "name": "postgres",
      "namespace": "default",
      "pod_label": "app=postgres",
      "probe_cmd": ["pg_isready", "-U", "postgres"]
    },
    {
      "name": "mariadb",
      "namespace": "default",
      "pod_label": "app=mariadb",
      "probe_cmd": ["mariadb-admin", "ping", "--silent"]
    }
  ],
  "services": [
    {"name": "ghost1", "namespace": "fulfillment", "db": "mariadb",
     "probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
    {"name": "ghost2", "namespace": "fulfillment", "db": "mariadb",
     "probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
    {"name": "ghost3", "namespace": "fulfillment", "db": "mariadb",
     "probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
    {"name": "mediawiki", "namespace": "default", "db": "mariadb",
     "probe_path": "/", "expected": [200, 302]},
    {"name": "forgejo", "namespace": "sjasoft", "db": "postgres",
     "probe_path": "/api/healthz", "expected": [200]},
    {"name": "authentik-server", "namespace": "default", "db": "postgres",
     "probe_path": "/-/health/live/", "expected": [200, 204]},
    {"name": "listmonk", "namespace": "default", "db": "postgres",
     "probe_path": "/api/health", "expected": [200]},
    {"name": "n8n", "namespace": "default", "db": "postgres",
     "probe_path": "/healthz", "expected": [200]},
    {"name": "mattermost", "namespace": "default", "db": "postgres",
     "probe_path": "/api/v4/system/ping", "expected": [200]},
    {"name": "vaultwarden", "namespace": "default", "db": null,
     "probe_path": "/alive", "expected": [200]},
    {"name": "garage", "namespace": "default", "db": null,
     "probe_path": "/health", "expected": [200]},
    {"name": "garage-webui", "namespace": "default", "db": null,
     "probe_path": "/", "expected": [200, 302]}
  ]
}

Probe URL resolution: at runtime, kubectl get svc -n <ns> <name> -o json → extract .spec.ports[].nodePort → probe http://localhost:<nodeport><probe_path>.

Per-service silence: add "disabled": true to a service entry to skip it without deleting it.

Verify actual probe paths during implementation — the paths above are reasonable defaults but each needs a quick curl sanity check. Specifically double-check: Authentik (/-/health/live/ vs /-/health/ready/), Garage (root /health endpoint), Vaultwarden (/alive returns 200 plain-text timestamp — confirmed), n8n (/healthz).


Check catalog

One function per requirement, sharing an internal probe_service(svc_cfg) helper.

Function Covers Mechanism
check_nats() #0 kubectl exec NATS pod to run nats server check connection; fallback HTTP GET localhost:<monitoring_nodeport>/healthz
check_postgres() #2 kubectl exec postgres pod to run pg_isready -U postgres
check_mariadb() #1 kubectl exec mariadb pod to run mariadb-admin ping --silent
check_ghost_blogs() #3 probe_service for every service whose name starts with ghost
check_mariadb_dependents() #4 probe_service for every non-ghost service where db == "mariadb"
check_postgres_dependents() #5 probe_service for every service where db == "postgres"
check_standalone_services() implicit probe_service for every service where db == null
check_all_nodeports() #6 kubectl get svc -A -o json; for every nodePort, TCP connect localhost:<nodeport>; failure = nothing listening

probe_service(svc): resolves NodePort via kubectl, calls requests.get(f"http://localhost:{nodeport}{svc['probe_path']}", timeout=10), compares status to expected, returns an Issue on mismatch or on exception.

Root-cause hints in payload: if check_mariadb() produced an issue this run, any check_mariadb_dependents() failure gets "root_cause": "mariadb unreachable". Same pattern for postgres. Decorative — consumers decide what to do with it.


Error handling

def run_all_checks(cfg) -> list[Issue]:
    issues = []
    for fn in [check_nats, check_postgres, check_mariadb,
               check_ghost_blogs, check_mariadb_dependents,
               check_postgres_dependents, check_standalone_services,
               check_all_nodeports]:
        try:
            issues.extend(fn(cfg))
        except Exception as e:
            issues.append(Issue(
                component_name=f"healthcheck.{fn.__name__}",
                issue_detail=f"check function raised: {type(e).__name__}: {e}",
                detected_at=now_iso(),
                root_cause="healthcheck bug or missing dependency"))
    return issues
  • No single check can halt the pipeline.
  • NATS connect failure is loud-logged; checks still run; individual publish failures are logged but don't stop the rest.
  • Issue is a small dataclass; to_dict() serialises to the exact NATS payload schema.

Deployment

install.sh (run once on pve-control as samantha, with sudo where needed):

set -euo pipefail
sudo mkdir -p /opt/homelab-health
sudo rsync -a --delete ./ /opt/homelab-health/ --exclude=install.sh --exclude=tests
sudo chown -R samantha:samantha /opt/homelab-health
python3 -m venv /opt/homelab-health/venv
/opt/homelab-health/venv/bin/pip install -r /opt/homelab-health/requirements.txt
sudo cp homelab-health.service homelab-health.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now homelab-health.timer

homelab-health.service:

[Unit]
Description=Homelab internal health checks
After=network-online.target

[Service]
Type=oneshot
User=samantha
ExecStart=/opt/homelab-health/venv/bin/python /opt/homelab-health/checker.py
StandardOutput=journal
StandardError=journal

homelab-health.timer:

[Unit]
Description=Run homelab health checks every 10 minutes

[Timer]
OnCalendar=*:0/10
Persistent=true

[Install]
WantedBy=timers.target

Testing

Unit tests (tests/test_checks.py, pytest):

  • Each check function takes a config object — easily stubbed.
  • probe_service accepts an injected HTTP client so tests don't hit real services.
  • Mock subprocess.run for kubectl calls.
  • Assert the exact Issue list returned for each failure shape.

Manual smoke testchecker.py --dry-run logs all issues to stdout but skips NATS publish. Run ad-hoc on pve-control during development.

End-to-end verification after install:

  1. systemctl list-timers homelab-health.timer shows next fire time.
  2. Manually fire once: sudo systemctl start homelab-health.service.
  3. journalctl -u homelab-health -n 200 shows outcome.
  4. On workstation: nats sub homelab_health_issue (against the cluster NATS).
  5. Break mediawiki (kubectl scale deploy mediawiki -n default --replicas=0) and wait ≤10 min — expect a message on the subject, with component_name:"mediawiki".
  6. Restore (--replicas=1) and confirm alerts stop on the next tick.

Open items / future

  • Leaf/LAN NATS fallback: add FALLBACK_NATS_URL env-var hook in checker.py (unused for now). When the leaf NATS comes online, publish there too on connect failure to primary.
  • NATS auth: current assumption is local anonymous publish is allowed. If auth is added, introduce a nats.creds_path field in checks.json pointing at a creds file on pve-control.
  • k8s Python client migration: replace the two remaining kubectl subprocess calls with the kubernetes library for a fully in-process script.
  • Recovery events: if downstream consumers want a "resolved" signal, add a small local state file (JSON on disk) to detect transitions and publish recovery events.
  • Per-namespace grouping: not needed now; if service list grows beyond ~25, reconsider organizing checks.json by namespace for readability.