Python checker runs on pve-control via systemd timer every 10 min, publishes issues to NATS subject homelab_health_issue. Checks NATS, Postgres, MariaDB, Ghost blogs, DB dependents, standalone services, and every NodePort. Silent when healthy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
319 lines
12 KiB
Markdown
319 lines
12 KiB
Markdown
# Homelab Health — Internal Checks Design
|
|
|
|
**Status: design approved 2026-04-20. Ready to write implementation plan.**
|
|
|
|
---
|
|
|
|
## Resume Notes (for next session)
|
|
|
|
You and I brainstormed this design across one session. All design questions answered,
|
|
all three design sections approved. The next step per the brainstorming skill is:
|
|
|
|
1. **This file exists** — design committed (or ready to commit).
|
|
2. **Next action:** invoke `superpowers:writing-plans` to turn this design into a
|
|
step-by-step implementation plan.
|
|
3. After the implementation plan is written, execute it (writing-plans → executing-plans).
|
|
|
|
Do NOT re-open any design decisions in the new session unless something here is
|
|
obviously wrong; the decisions below are settled.
|
|
|
|
**Test canary:** when verifying the installed system end-to-end, break **mediawiki**
|
|
(e.g. scale to 0 replicas), not the Ghost blogs. Ghosts are production, MediaWiki is
|
|
expendable for a "does the alert fire" test.
|
|
|
|
---
|
|
|
|
## Goals
|
|
|
|
Add a second layer of cluster health monitoring that runs **inside** the K3s cluster
|
|
and reports structural / semantic problems to NATS. The existing
|
|
`k3s/scripts/check-health.sh` (workstation-driven canary) stays in place unchanged.
|
|
|
|
Requirements, as given:
|
|
|
|
0. NATS itself up
|
|
1. MariaDB up
|
|
2. PostgreSQL up
|
|
3. Internal Ghost blog ports respond to HTTP correctly
|
|
4. All other services depending on MariaDB respond correctly
|
|
5. All services depending on PostgreSQL respond correctly
|
|
6. Something is listening at every NodePort
|
|
|
|
Plus implicit: standalone services (Vaultwarden, Garage, etc.) also get probed.
|
|
|
|
**Output contract:** publish NATS messages on subject `homelab_health_issue` with
|
|
JSON body:
|
|
|
|
```json
|
|
{
|
|
"component_name": "<str>",
|
|
"issue_detail": "<str>",
|
|
"detected_at": "<ISO8601 timestamp>",
|
|
"root_cause": "<optional str>"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Decisions (settled)
|
|
|
|
| Decision | Choice | Why |
|
|
|---|---|---|
|
|
| Where it runs | systemd timer on **pve-control** | Master K3s control node; kubectl locally; always on. |
|
|
| Language | **Python 3** | User expertise; structured JSON; clean error handling. |
|
|
| HTTP probes | `requests` library | No subprocess per probe; in-process. |
|
|
| NATS publish | `nats-py` library | In-process; one cohesive Python process. |
|
|
| kubectl use | **subprocess** (kept for now) | Only two call sites; revisit later with `kubernetes` client. |
|
|
| DB auth for probes | **sidestepped** | Use `kubectl exec <pod> -- pg_isready` / `mariadb-admin ping`; no creds on pve-control. |
|
|
| Orchestration | Single script, one function per check category | Simple; matches "one function per check" ask. |
|
|
| Schedule | Every **10 minutes** | User said no more frequent than that. |
|
|
| Deduplication | **Stateless** | Re-fires every tick while failing; consumer handles aggregation. |
|
|
| Healthy publishes | **None** | Silent when OK. Only problems on the wire. |
|
|
| Recovery events | **None** | Reports stop when fixed; absence = healthy. |
|
|
| Service config | **JSON file** (`checks.json`) | Pythonic; easy to edit/commit; lives alongside `checker.py`. |
|
|
| NodePort discovery | **Live from `kubectl get svc -A -o json`** | Source of truth is the cluster; no drift. |
|
|
| NATS-down fallback | **stdout + non-zero exit** | Workstation canary + `systemctl status` surface failures. Future leaf/LAN NATS fallback via env var hook (deferred). |
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
**Deployment layout on pve-control:**
|
|
|
|
```
|
|
/opt/homelab-health/
|
|
├── checker.py # Python entrypoint, one function per check
|
|
├── checks.json # service catalog + NATS/DB config
|
|
├── venv/ # virtualenv with nats-py, requests
|
|
/etc/systemd/system/
|
|
├── homelab-health.service
|
|
└── homelab-health.timer
|
|
```
|
|
|
|
**Source of truth in repo:**
|
|
|
|
```
|
|
k3s/health/
|
|
├── home_lab_health.md # this file
|
|
├── checker.py
|
|
├── checks.json
|
|
├── requirements.txt # nats-py, requests
|
|
├── install.sh # runs on pve-control, sets up venv + units
|
|
├── homelab-health.service
|
|
├── homelab-health.timer
|
|
└── tests/
|
|
└── test_checks.py
|
|
```
|
|
|
|
**Runtime flow each tick:**
|
|
|
|
1. Load `checks.json`.
|
|
2. Connect to NATS with a 3s timeout. On failure: log loud, still run checks, publish nothing, exit 1.
|
|
3. Run each check function in sequence, each wrapped in `try/except`; exceptions in one check never stop the others (they become a `healthcheck.<fn>` meta-issue).
|
|
4. Each check returns `list[Issue]`. Main loop aggregates.
|
|
5. Log every issue to stdout (journal).
|
|
6. For each issue, publish one NATS message to `homelab_health_issue`.
|
|
7. Exit 0 if zero issues, 1 otherwise. `systemctl status` + journalctl give humans visibility.
|
|
|
|
---
|
|
|
|
## Config schema (`checks.json`)
|
|
|
|
```json
|
|
{
|
|
"nats": {
|
|
"url": "nats://nats.default.svc.cluster.local:4222",
|
|
"subject": "homelab_health_issue",
|
|
"monitoring_nodeport": 32388
|
|
},
|
|
"databases": [
|
|
{
|
|
"name": "postgres",
|
|
"namespace": "default",
|
|
"pod_label": "app=postgres",
|
|
"probe_cmd": ["pg_isready", "-U", "postgres"]
|
|
},
|
|
{
|
|
"name": "mariadb",
|
|
"namespace": "default",
|
|
"pod_label": "app=mariadb",
|
|
"probe_cmd": ["mariadb-admin", "ping", "--silent"]
|
|
}
|
|
],
|
|
"services": [
|
|
{"name": "ghost1", "namespace": "fulfillment", "db": "mariadb",
|
|
"probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
|
|
{"name": "ghost2", "namespace": "fulfillment", "db": "mariadb",
|
|
"probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
|
|
{"name": "ghost3", "namespace": "fulfillment", "db": "mariadb",
|
|
"probe_path": "/ghost/api/admin/site/", "expected": [200, 401]},
|
|
{"name": "mediawiki", "namespace": "default", "db": "mariadb",
|
|
"probe_path": "/", "expected": [200, 302]},
|
|
{"name": "forgejo", "namespace": "sjasoft", "db": "postgres",
|
|
"probe_path": "/api/healthz", "expected": [200]},
|
|
{"name": "authentik-server", "namespace": "default", "db": "postgres",
|
|
"probe_path": "/-/health/live/", "expected": [200, 204]},
|
|
{"name": "listmonk", "namespace": "default", "db": "postgres",
|
|
"probe_path": "/api/health", "expected": [200]},
|
|
{"name": "n8n", "namespace": "default", "db": "postgres",
|
|
"probe_path": "/healthz", "expected": [200]},
|
|
{"name": "mattermost", "namespace": "default", "db": "postgres",
|
|
"probe_path": "/api/v4/system/ping", "expected": [200]},
|
|
{"name": "vaultwarden", "namespace": "default", "db": null,
|
|
"probe_path": "/alive", "expected": [200]},
|
|
{"name": "garage", "namespace": "default", "db": null,
|
|
"probe_path": "/health", "expected": [200]},
|
|
{"name": "garage-webui", "namespace": "default", "db": null,
|
|
"probe_path": "/", "expected": [200, 302]}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Probe URL resolution:** at runtime, `kubectl get svc -n <ns> <name> -o json` →
|
|
extract `.spec.ports[].nodePort` → probe `http://localhost:<nodeport><probe_path>`.
|
|
|
|
**Per-service silence:** add `"disabled": true` to a service entry to skip it without
|
|
deleting it.
|
|
|
|
**Verify actual probe paths during implementation** — the paths above are reasonable
|
|
defaults but each needs a quick curl sanity check. Specifically double-check:
|
|
Authentik (`/-/health/live/` vs `/-/health/ready/`), Garage (root `/health` endpoint),
|
|
Vaultwarden (`/alive` returns 200 plain-text timestamp — confirmed), n8n (`/healthz`).
|
|
|
|
---
|
|
|
|
## Check catalog
|
|
|
|
One function per requirement, sharing an internal `probe_service(svc_cfg)` helper.
|
|
|
|
| Function | Covers | Mechanism |
|
|
|---|---|---|
|
|
| `check_nats()` | #0 | `kubectl exec` NATS pod to run `nats server check connection`; fallback HTTP GET `localhost:<monitoring_nodeport>/healthz` |
|
|
| `check_postgres()` | #2 | `kubectl exec` postgres pod to run `pg_isready -U postgres` |
|
|
| `check_mariadb()` | #1 | `kubectl exec` mariadb pod to run `mariadb-admin ping --silent` |
|
|
| `check_ghost_blogs()` | #3 | `probe_service` for every service whose name starts with `ghost` |
|
|
| `check_mariadb_dependents()` | #4 | `probe_service` for every non-ghost service where `db == "mariadb"` |
|
|
| `check_postgres_dependents()` | #5 | `probe_service` for every service where `db == "postgres"` |
|
|
| `check_standalone_services()` | implicit | `probe_service` for every service where `db == null` |
|
|
| `check_all_nodeports()` | #6 | `kubectl get svc -A -o json`; for every `nodePort`, TCP connect `localhost:<nodeport>`; failure = nothing listening |
|
|
|
|
**`probe_service(svc)`:** resolves NodePort via kubectl, calls
|
|
`requests.get(f"http://localhost:{nodeport}{svc['probe_path']}", timeout=10)`,
|
|
compares status to `expected`, returns an `Issue` on mismatch or on exception.
|
|
|
|
**Root-cause hints in payload:** if `check_mariadb()` produced an issue this run,
|
|
any `check_mariadb_dependents()` failure gets `"root_cause": "mariadb unreachable"`.
|
|
Same pattern for postgres. Decorative — consumers decide what to do with it.
|
|
|
|
---
|
|
|
|
## Error handling
|
|
|
|
```python
|
|
def run_all_checks(cfg) -> list[Issue]:
|
|
issues = []
|
|
for fn in [check_nats, check_postgres, check_mariadb,
|
|
check_ghost_blogs, check_mariadb_dependents,
|
|
check_postgres_dependents, check_standalone_services,
|
|
check_all_nodeports]:
|
|
try:
|
|
issues.extend(fn(cfg))
|
|
except Exception as e:
|
|
issues.append(Issue(
|
|
component_name=f"healthcheck.{fn.__name__}",
|
|
issue_detail=f"check function raised: {type(e).__name__}: {e}",
|
|
detected_at=now_iso(),
|
|
root_cause="healthcheck bug or missing dependency"))
|
|
return issues
|
|
```
|
|
|
|
- No single check can halt the pipeline.
|
|
- NATS connect failure is loud-logged; checks still run; individual publish failures
|
|
are logged but don't stop the rest.
|
|
- `Issue` is a small dataclass; `to_dict()` serialises to the exact NATS payload schema.
|
|
|
|
---
|
|
|
|
## Deployment
|
|
|
|
**`install.sh` (run once on pve-control as samantha, with sudo where needed):**
|
|
|
|
```bash
|
|
set -euo pipefail
|
|
sudo mkdir -p /opt/homelab-health
|
|
sudo rsync -a --delete ./ /opt/homelab-health/ --exclude=install.sh --exclude=tests
|
|
sudo chown -R samantha:samantha /opt/homelab-health
|
|
python3 -m venv /opt/homelab-health/venv
|
|
/opt/homelab-health/venv/bin/pip install -r /opt/homelab-health/requirements.txt
|
|
sudo cp homelab-health.service homelab-health.timer /etc/systemd/system/
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable --now homelab-health.timer
|
|
```
|
|
|
|
**`homelab-health.service`:**
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=Homelab internal health checks
|
|
After=network-online.target
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
User=samantha
|
|
ExecStart=/opt/homelab-health/venv/bin/python /opt/homelab-health/checker.py
|
|
StandardOutput=journal
|
|
StandardError=journal
|
|
```
|
|
|
|
**`homelab-health.timer`:**
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=Run homelab health checks every 10 minutes
|
|
|
|
[Timer]
|
|
OnCalendar=*:0/10
|
|
Persistent=true
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
```
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
**Unit tests** (`tests/test_checks.py`, pytest):
|
|
- Each check function takes a config object — easily stubbed.
|
|
- `probe_service` accepts an injected HTTP client so tests don't hit real services.
|
|
- Mock `subprocess.run` for kubectl calls.
|
|
- Assert the exact `Issue` list returned for each failure shape.
|
|
|
|
**Manual smoke test** — `checker.py --dry-run` logs all issues to stdout but skips
|
|
NATS publish. Run ad-hoc on pve-control during development.
|
|
|
|
**End-to-end verification after install:**
|
|
1. `systemctl list-timers homelab-health.timer` shows next fire time.
|
|
2. Manually fire once: `sudo systemctl start homelab-health.service`.
|
|
3. `journalctl -u homelab-health -n 200` shows outcome.
|
|
4. On workstation: `nats sub homelab_health_issue` (against the cluster NATS).
|
|
5. Break **mediawiki** (`kubectl scale deploy mediawiki -n default --replicas=0`) and
|
|
wait ≤10 min — expect a message on the subject, with `component_name:"mediawiki"`.
|
|
6. Restore (`--replicas=1`) and confirm alerts stop on the next tick.
|
|
|
|
---
|
|
|
|
## Open items / future
|
|
|
|
- **Leaf/LAN NATS fallback:** add `FALLBACK_NATS_URL` env-var hook in `checker.py`
|
|
(unused for now). When the leaf NATS comes online, publish there too on connect
|
|
failure to primary.
|
|
- **NATS auth:** current assumption is local anonymous publish is allowed. If auth is
|
|
added, introduce a `nats.creds_path` field in `checks.json` pointing at a creds
|
|
file on pve-control.
|
|
- **k8s Python client migration:** replace the two remaining `kubectl` subprocess
|
|
calls with the `kubernetes` library for a fully in-process script.
|
|
- **Recovery events:** if downstream consumers want a "resolved" signal, add a small
|
|
local state file (JSON on disk) to detect transitions and publish recovery events.
|
|
- **Per-namespace grouping:** not needed now; if service list grows beyond ~25,
|
|
reconsider organizing `checks.json` by namespace for readability.
|