homelab/k3s
Samantha Atkins 58bfd422d4 Add homelab internal health checker
Python checker runs on pve-control via systemd timer every 10 min,
publishes issues to NATS subject homelab_health_issue. Checks NATS,
Postgres, MariaDB, Ghost blogs, DB dependents, standalone services,
and every NodePort. Silent when healthy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 15:48:07 -04:00
..
authentik K3s cluster on Proxmox with WireGuard mesh networking 2026-04-07 01:23:13 -04:00
forgejo K3s cluster on Proxmox with WireGuard mesh networking 2026-04-07 01:23:13 -04:00
garage added garage, mattermost, etc 2026-04-18 18:28:55 -04:00
ghost Switch Ghost1 SMTP from Postmark to Mailgun 2026-04-19 18:57:07 -04:00
health Add homelab internal health checker 2026-04-20 15:48:07 -04:00
listmonk Add Listmonk, Mattermost manifests; Ghost SMTP and device verification fix 2026-04-11 18:07:35 -04:00
mariadb K3s cluster on Proxmox with WireGuard mesh networking 2026-04-07 01:23:13 -04:00
mattermost added garage, mattermost, etc 2026-04-18 18:28:55 -04:00
mediawiki added garage, mattermost, etc 2026-04-18 18:28:55 -04:00
monerod K3s cluster on Proxmox with WireGuard mesh networking 2026-04-07 01:23:13 -04:00
n8n Fix n8n NodePort conflict with Mattermost (32374 → 32376) 2026-04-12 14:09:27 -04:00
nats added garage, mattermost, etc 2026-04-18 18:28:55 -04:00
postgres K3s cluster on Proxmox with WireGuard mesh networking 2026-04-07 01:23:13 -04:00
redis added garage, mattermost, etc 2026-04-18 18:28:55 -04:00
resilience cleanup 2026-04-17 20:33:17 -04:00
scripts added garage, mattermost, etc 2026-04-18 18:28:55 -04:00
snikket K3s cluster on Proxmox with WireGuard mesh networking 2026-04-07 01:23:13 -04:00
storage added garage, mattermost, etc 2026-04-18 18:28:55 -04:00
synapse K3s cluster on Proxmox with WireGuard mesh networking 2026-04-07 01:23:13 -04:00
vaultwarden added garage, mattermost, etc 2026-04-18 18:28:55 -04:00
README.md Update Running Services table with today's deploys 2026-04-18 18:30:57 -04:00

K3s Cluster — Setup & Deployment Notes

This is the production cluster running on Proxmox VMs, connected via WireGuard hub-and-spoke. The VirtualBox learning cluster this replaced is retired.


WireGuard Mesh — Node Assignments

Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1/24

Node vmbr1 IP WG IP Proxmox Host
pve-control 10.10.10.151 10.0.0.6 pve
pve-worker 10.10.10.126 10.0.0.7 pve
adder-control 10.10.10.185 10.0.0.8 adder
adder-worker 10.10.10.83 10.0.0.9 adder
game-control 10.10.10.158 10.0.0.10 game
game-worker-hdd 10.10.10.186 10.0.0.11 game
game-worker-ssd 10.10.10.153 10.0.0.12 game
fat_mama 192.168.40.220 10.0.0.13 workstation (VBox, bridged LAN)

IPs 10.0.0.210.0.0.5 are reserved (old VirtualBox K3s nodes, leave alone).

All VMs are Debian Trixie on vmbr1 (10.10.10.0/24). Inter-node traffic runs over WireGuard (10.0.0.0/24).


K3s Install

Prerequisites — each VM must be on the WireGuard mesh first

WireGuard is configured via wg0.conf on each node (hub-and-spoke through DO droplet). Verify connectivity: ping 10.0.0.1 from the node.

First control plane node (cluster init)

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--cluster-init --disable traefik \
  --node-ip=<10.0.0.x> --flannel-iface=wg0" sh -

# Get token for other nodes to join
sudo cat /var/lib/rancher/k3s/server/node-token

Second and third control plane nodes

curl -sfL https://get.k3s.io | K3S_URL=https://<control-1-mesh-ip>:6443 K3S_TOKEN=<token> \
  INSTALL_K3S_EXEC="--server https://<control-1-mesh-ip>:6443 --disable traefik \
  --node-ip=<this-node-mesh-ip> --flannel-iface=wg0" sh -

Note: use --server not just K3S_URL — this is what makes it a control plane peer, not a worker. etcd requires odd numbers — 3 control nodes tolerates 1 failure. Never stop at 2.

Workers

curl -sfL https://get.k3s.io | K3S_URL=https://<any-control-mesh-ip>:6443 K3S_TOKEN=<token> \
  INSTALL_K3S_EXEC="--node-ip=<this-node-mesh-ip> --flannel-iface=wg0" sh -

kubeconfig for normal user (on any control node)

mkdir -p ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
sudo chown samantha:samantha ~/.kube/config
export KUBECONFIG=~/.kube/config   # also add to ~/.bashrc
# Update server IP in config if needed:
sed -i 's/127.0.0.1/<control-1-mesh-ip>/' ~/.kube/config

Label workers

kubectl label node <name> node-role.kubernetes.io/worker=worker

GPU Worker Nodes — adder and game

Both Proxmox hosts adder and game have RTX 2070 GPUs available for PCIe passthrough.

Proxmox PCIe passthrough setup (on each Proxmox host)

# Enable IOMMU in /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
# (use amd_iommu=on for AMD hosts)
update-grub
reboot

# Blacklist nvidia drivers on host so GPU is free for passthrough:
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
update-initramfs -u
reboot

In Proxmox UI: VM Hardware → Add → PCI Device → select the RTX 2070 → check "All Functions" and "Primary GPU" if it is the only GPU.

Inside the GPU worker VM — install NVIDIA drivers

apt-get install -y linux-headers-$(uname -r)
# Add non-free repo if needed:
apt-get install -y nvidia-driver firmware-misc-nonfree
reboot
# Verify:
nvidia-smi

Install NVIDIA device plugin in K3s

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Label GPU nodes

kubectl label node k3s-adder nvidia.com/gpu=true
kubectl label node k3s-game nvidia.com/gpu=true

Verify GPU is schedulable

kubectl get nodes -o json | jq '.items[].status.capacity'
# Should show nvidia.com/gpu: "1" on adder and game

Scheduling a workload to a GPU node

resources:
  limits:
    nvidia.com/gpu: 1

Namespaces — one per venture

kubectl create namespace sjasoft
kubectl create namespace fulfillment
kubectl create namespace privacy-practice

Secrets are always created per namespace — never share secrets across namespaces.


Secrets

Never stored in files with real values. Always create directly on a control node.

# Pattern — adapt per service and namespace
kubectl create secret generic <name> \
  --namespace <namespace> \
  --from-literal=<key>='<value>'

# Generate passwords with:
openssl rand -base64 24

NodePort Registry

NodePorts must be unique across the entire cluster (range 30000-32767). Any NodePort is reachable on any node's WireGuard IP — K3s routes internally. Caddy on each venture ingress VPS proxies to any node's WG IP + NodePort.

Port Service Notes
32368 ghost1 blog.the-fulfillment.org
32369 ghost2 blog.privacy-practice.com
32370 ghost3 blog.sjasoft.com
32371 forgejo git.sjasoft.com
32372 authentik (HTTP) auth.sjasoft.com — use this behind Caddy
32373 authentik (HTTPS) skip — Caddy handles TLS
32374 mattermost planned
32375 listmonk deployed
32376 n8n deployed
32377 vaultwarden planned
32379 monerod (RPC) planned
32380 monerod (P2P) planned
32381 snikket (HTTP) planned
32382 snikket (C2S) planned
32383 snikket (S2S) planned
32384 snikket (proxy65) planned
32385 synapse planned
32386 nats (client) planned
32387 nats (websocket) planned
32388 nats (monitoring) planned
32389 nats (leafnode) planned
32390 garage (S3 API) deployed
32391 garage-webui deployed
32392 mediawiki deployed

Caddy Pattern — venture ingress VPS

Each venture has its own ingress VPS with its own public IP. Caddy on each proxies to a different node's mesh IP for the same cluster — ventures look unrelated from outside.

# Example — any node's WG IP works for any NodePort
blog.the-fulfillment.org {
    reverse_proxy 10.0.0.6:32368
}

git.sjasoft.com {
    reverse_proxy 10.0.0.8:32371
}

auth.sjasoft.com {
    reverse_proxy 10.0.0.10:32372
}

Pick any node's WG IP per service — they all work. Use different nodes per venture so ventures look unrelated from outside. See the WireGuard mesh table above for IPs.


Current Deployment Status (2026-04-16)

K3s v1.34.6 cluster fully operational. WireGuard full mesh (direct peer-to-peer over vmbr1, hub for external traffic). Headscale removed — too buggy (0.28.x dropped nodes randomly).

Cluster Nodes

Node Role WG IP Proxmox Host Resources
pve-control control-plane, etcd 10.0.0.6 pve 2 CPU, 2GB RAM, 20GB
pve-worker worker 10.0.0.7 pve 8 CPU, 58GB RAM, 3.3TB
adder-control control-plane, etcd 10.0.0.8 adder 2 CPU, 2GB RAM, 20GB
adder-worker worker 10.0.0.9 adder 10 CPU, 58GB RAM, 1.7TB
game-control control-plane, etcd 10.0.0.10 game 2 CPU, 2GB RAM, 20GB
game-worker-hdd worker 10.0.0.11 game 4 CPU, 6GB RAM, 1.4TB HDD
game-worker-ssd worker 10.0.0.12 game 10 CPU, 8GB RAM, 200GB SSD
fat_mama worker 10.0.0.13 workstation (VBox) 20 CPU, 21GB RAM, 200GB

Running Services

Scheduler-assigned node in parens reflects current placement (unpinned services may move on restart). Pinned services have nodeName in their manifest.

Service Node NodePort Domain Status
postgres:16 pve-worker (pinned) ClusterIP running
mariadb:11 adder-worker (pinned) ClusterIP running
ghost1 unpinned (game-worker-ssd) 32368 blog.the-fulfillment.org running
ghost2 unpinned (pve-worker) 32369 blog.privacy-practice.com running
ghost3 unpinned (adder-worker) 32370 blog.sjasoft.com running
forgejo:9 unpinned (pve-worker) 32371 git.sjasoft.com running
authentik server unpinned (adder-worker) 32372 auth.sjasoft.com running
authentik worker unpinned (adder-worker) running
listmonk unpinned (pve-worker) 32375 running
n8n unpinned (game-worker-ssd) 32376 running
vaultwarden unpinned (game-worker-hdd) 32377 running
mattermost:10.11 unpinned (fatmama) 32374 chat.the-fulfillment.org running
nats (w/ leafnode + JetStream) unpinned (fatmama) 32386-32389 running
redis:7 (8GB, AOF) fatmama (pinned — kernel tuning) ClusterIP running
garage:v2.3 (S3 API + admin + web) unpinned (adder-worker) w/ anti-affinity on game-worker-hdd 32390 (S3), 31540 (web), 3903 ClusterIP (admin) s3.sjasoft.com, page.sjasoft.com running
garage-webui unpinned (pve-worker) 32391 s3-web.sjasoft.com running
mediawiki:1.43 unpinned (pve-worker) 32392 wiki.the-fulfillment.org running
nfs-subdir-external-provisioner kube-system running (StorageClass nas-nfs on /volume1/samantha-private)

Remaining Services to Deploy

synapse, snikket, monerod, plane (blocked on design decisions)

Next Steps

  • Add VirtualBox workstation VMs as workers to this cluster
  • Wire up remaining Ghost blogs in Caddy
  • Deploy remaining services from k3s/ manifests

Install Method

K3s was installed using /etc/rancher/k3s/config.yaml on each node (not INSTALL_K3S_EXEC env vars, which get lost in nested SSH). Binary was downloaded once to pve and distributed via scp. Use INSTALL_K3S_SKIP_DOWNLOAD=true when binary is pre-staged.