History

Samantha Atkins 58bfd422d4 Add homelab internal health checker Python checker runs on pve-control via systemd timer every 10 min, publishes issues to NATS subject homelab_health_issue. Checks NATS, Postgres, MariaDB, Ghost blogs, DB dependents, standalone services, and every NodePort. Silent when healthy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-04-20 15:48:07 -04:00
..
authentik	K3s cluster on Proxmox with WireGuard mesh networking	2026-04-07 01:23:13 -04:00
forgejo	K3s cluster on Proxmox with WireGuard mesh networking	2026-04-07 01:23:13 -04:00
garage	added garage, mattermost, etc	2026-04-18 18:28:55 -04:00
ghost	Switch Ghost1 SMTP from Postmark to Mailgun	2026-04-19 18:57:07 -04:00
health	Add homelab internal health checker	2026-04-20 15:48:07 -04:00
listmonk	Add Listmonk, Mattermost manifests; Ghost SMTP and device verification fix	2026-04-11 18:07:35 -04:00
mariadb	K3s cluster on Proxmox with WireGuard mesh networking	2026-04-07 01:23:13 -04:00
mattermost	added garage, mattermost, etc	2026-04-18 18:28:55 -04:00
mediawiki	added garage, mattermost, etc	2026-04-18 18:28:55 -04:00
monerod	K3s cluster on Proxmox with WireGuard mesh networking	2026-04-07 01:23:13 -04:00
n8n	Fix n8n NodePort conflict with Mattermost (32374 → 32376)	2026-04-12 14:09:27 -04:00
nats	added garage, mattermost, etc	2026-04-18 18:28:55 -04:00
postgres	K3s cluster on Proxmox with WireGuard mesh networking	2026-04-07 01:23:13 -04:00
redis	added garage, mattermost, etc	2026-04-18 18:28:55 -04:00
resilience	cleanup	2026-04-17 20:33:17 -04:00
scripts	added garage, mattermost, etc	2026-04-18 18:28:55 -04:00
snikket	K3s cluster on Proxmox with WireGuard mesh networking	2026-04-07 01:23:13 -04:00
storage	added garage, mattermost, etc	2026-04-18 18:28:55 -04:00
synapse	K3s cluster on Proxmox with WireGuard mesh networking	2026-04-07 01:23:13 -04:00
vaultwarden	added garage, mattermost, etc	2026-04-18 18:28:55 -04:00
README.md	Update Running Services table with today's deploys	2026-04-18 18:30:57 -04:00

README.md

K3s Cluster — Setup & Deployment Notes

This is the production cluster running on Proxmox VMs, connected via WireGuard hub-and-spoke. The VirtualBox learning cluster this replaced is retired.

WireGuard Mesh — Node Assignments

Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1/24

Node	vmbr1 IP	WG IP	Proxmox Host
pve-control	10.10.10.151	10.0.0.6	pve
pve-worker	10.10.10.126	10.0.0.7	pve
adder-control	10.10.10.185	10.0.0.8	adder
adder-worker	10.10.10.83	10.0.0.9	adder
game-control	10.10.10.158	10.0.0.10	game
game-worker-hdd	10.10.10.186	10.0.0.11	game
game-worker-ssd	10.10.10.153	10.0.0.12	game
fat_mama	192.168.40.220	10.0.0.13	workstation (VBox, bridged LAN)

IPs 10.0.0.2–10.0.0.5 are reserved (old VirtualBox K3s nodes, leave alone).

All VMs are Debian Trixie on vmbr1 (10.10.10.0/24). Inter-node traffic runs over WireGuard (10.0.0.0/24).

K3s Install

Prerequisites — each VM must be on the WireGuard mesh first

WireGuard is configured via wg0.conf on each node (hub-and-spoke through DO droplet). Verify connectivity: ping 10.0.0.1 from the node.

First control plane node (cluster init)

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--cluster-init --disable traefik \
  --node-ip=<10.0.0.x> --flannel-iface=wg0" sh -

# Get token for other nodes to join
sudo cat /var/lib/rancher/k3s/server/node-token

Second and third control plane nodes

curl -sfL https://get.k3s.io | K3S_URL=https://<control-1-mesh-ip>:6443 K3S_TOKEN=<token> \
  INSTALL_K3S_EXEC="--server https://<control-1-mesh-ip>:6443 --disable traefik \
  --node-ip=<this-node-mesh-ip> --flannel-iface=wg0" sh -

Note: use --server not just K3S_URL — this is what makes it a control plane peer, not a worker. etcd requires odd numbers — 3 control nodes tolerates 1 failure. Never stop at 2.

Workers

curl -sfL https://get.k3s.io | K3S_URL=https://<any-control-mesh-ip>:6443 K3S_TOKEN=<token> \
  INSTALL_K3S_EXEC="--node-ip=<this-node-mesh-ip> --flannel-iface=wg0" sh -

kubeconfig for normal user (on any control node)

mkdir -p ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
sudo chown samantha:samantha ~/.kube/config
export KUBECONFIG=~/.kube/config   # also add to ~/.bashrc
# Update server IP in config if needed:
sed -i 's/127.0.0.1/<control-1-mesh-ip>/' ~/.kube/config

Label workers

kubectl label node <name> node-role.kubernetes.io/worker=worker

GPU Worker Nodes — adder and game

Both Proxmox hosts adder and game have RTX 2070 GPUs available for PCIe passthrough.

Proxmox PCIe passthrough setup (on each Proxmox host)

# Enable IOMMU in /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
# (use amd_iommu=on for AMD hosts)
update-grub
reboot

# Blacklist nvidia drivers on host so GPU is free for passthrough:
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
update-initramfs -u
reboot

In Proxmox UI: VM Hardware → Add → PCI Device → select the RTX 2070 → check "All Functions" and "Primary GPU" if it is the only GPU.

Inside the GPU worker VM — install NVIDIA drivers

apt-get install -y linux-headers-$(uname -r)
# Add non-free repo if needed:
apt-get install -y nvidia-driver firmware-misc-nonfree
reboot
# Verify:
nvidia-smi

Install NVIDIA device plugin in K3s

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Label GPU nodes

kubectl label node k3s-adder nvidia.com/gpu=true
kubectl label node k3s-game nvidia.com/gpu=true

Verify GPU is schedulable

kubectl get nodes -o json | jq '.items[].status.capacity'
# Should show nvidia.com/gpu: "1" on adder and game

Scheduling a workload to a GPU node

resources:
  limits:
    nvidia.com/gpu: 1

Namespaces — one per venture

kubectl create namespace sjasoft
kubectl create namespace fulfillment
kubectl create namespace privacy-practice

Secrets are always created per namespace — never share secrets across namespaces.

Secrets

Never stored in files with real values. Always create directly on a control node.

# Pattern — adapt per service and namespace
kubectl create secret generic <name> \
  --namespace <namespace> \
  --from-literal=<key>='<value>'

# Generate passwords with:
openssl rand -base64 24

NodePort Registry

NodePorts must be unique across the entire cluster (range 30000-32767). Any NodePort is reachable on any node's WireGuard IP — K3s routes internally. Caddy on each venture ingress VPS proxies to any node's WG IP + NodePort.

Port	Service	Notes
32368	ghost1	blog.the-fulfillment.org
32369	ghost2	blog.privacy-practice.com
32370	ghost3	blog.sjasoft.com
32371	forgejo	git.sjasoft.com
32372	authentik (HTTP)	auth.sjasoft.com — use this behind Caddy
32373	authentik (HTTPS)	skip — Caddy handles TLS
32374	mattermost	planned
32375	listmonk	deployed
32376	n8n	deployed
32377	vaultwarden	planned
32379	monerod (RPC)	planned
32380	monerod (P2P)	planned
32381	snikket (HTTP)	planned
32382	snikket (C2S)	planned
32383	snikket (S2S)	planned
32384	snikket (proxy65)	planned
32385	synapse	planned
32386	nats (client)	planned
32387	nats (websocket)	planned
32388	nats (monitoring)	planned
32389	nats (leafnode)	planned
32390	garage (S3 API)	deployed
32391	garage-webui	deployed
32392	mediawiki	deployed

Caddy Pattern — venture ingress VPS

Each venture has its own ingress VPS with its own public IP. Caddy on each proxies to a different node's mesh IP for the same cluster — ventures look unrelated from outside.

# Example — any node's WG IP works for any NodePort
blog.the-fulfillment.org {
    reverse_proxy 10.0.0.6:32368
}

git.sjasoft.com {
    reverse_proxy 10.0.0.8:32371
}

auth.sjasoft.com {
    reverse_proxy 10.0.0.10:32372
}

Pick any node's WG IP per service — they all work. Use different nodes per venture so ventures look unrelated from outside. See the WireGuard mesh table above for IPs.

Current Deployment Status (2026-04-16)

K3s v1.34.6 cluster fully operational. WireGuard full mesh (direct peer-to-peer over vmbr1, hub for external traffic). Headscale removed — too buggy (0.28.x dropped nodes randomly).

Cluster Nodes

Node	Role	WG IP	Proxmox Host	Resources
pve-control	control-plane, etcd	10.0.0.6	pve	2 CPU, 2GB RAM, 20GB
pve-worker	worker	10.0.0.7	pve	8 CPU, 58GB RAM, 3.3TB
adder-control	control-plane, etcd	10.0.0.8	adder	2 CPU, 2GB RAM, 20GB
adder-worker	worker	10.0.0.9	adder	10 CPU, 58GB RAM, 1.7TB
game-control	control-plane, etcd	10.0.0.10	game	2 CPU, 2GB RAM, 20GB
game-worker-hdd	worker	10.0.0.11	game	4 CPU, 6GB RAM, 1.4TB HDD
game-worker-ssd	worker	10.0.0.12	game	10 CPU, 8GB RAM, 200GB SSD
fat_mama	worker	10.0.0.13	workstation (VBox)	20 CPU, 21GB RAM, 200GB

Running Services

Scheduler-assigned node in parens reflects current placement (unpinned services may move on restart). Pinned services have nodeName in their manifest.

Service	Node	NodePort	Domain	Status
postgres:16	pve-worker (pinned)	ClusterIP	—	running
mariadb:11	adder-worker (pinned)	ClusterIP	—	running
ghost1	unpinned (game-worker-ssd)	32368	blog.the-fulfillment.org	running
ghost2	unpinned (pve-worker)	32369	blog.privacy-practice.com	running
ghost3	unpinned (adder-worker)	32370	blog.sjasoft.com	running
forgejo:9	unpinned (pve-worker)	32371	git.sjasoft.com	running
authentik server	unpinned (adder-worker)	32372	auth.sjasoft.com	running
authentik worker	unpinned (adder-worker)	—	—	running
listmonk	unpinned (pve-worker)	32375	—	running
n8n	unpinned (game-worker-ssd)	32376	—	running
vaultwarden	unpinned (game-worker-hdd)	32377	—	running
mattermost:10.11	unpinned (fatmama)	32374	chat.the-fulfillment.org	running
nats (w/ leafnode + JetStream)	unpinned (fatmama)	32386-32389	—	running
redis:7 (8GB, AOF)	fatmama (pinned — kernel tuning)	ClusterIP	—	running
garage:v2.3 (S3 API + admin + web)	unpinned (adder-worker) w/ anti-affinity on game-worker-hdd	32390 (S3), 31540 (web), 3903 ClusterIP (admin)	s3.sjasoft.com, page.sjasoft.com	running
garage-webui	unpinned (pve-worker)	32391	s3-web.sjasoft.com	running
mediawiki:1.43	unpinned (pve-worker)	32392	wiki.the-fulfillment.org	running
nfs-subdir-external-provisioner	kube-system	—	—	running (StorageClass `nas-nfs` on /volume1/samantha-private)

Remaining Services to Deploy

synapse, snikket, monerod, plane (blocked on design decisions)

Next Steps

Add VirtualBox workstation VMs as workers to this cluster
Wire up remaining Ghost blogs in Caddy
Deploy remaining services from k3s/ manifests

Install Method

K3s was installed using /etc/rancher/k3s/config.yaml on each node (not INSTALL_K3S_EXEC env vars, which get lost in nested SSH). Binary was downloaded once to pve and distributed via scp. Use INSTALL_K3S_SKIP_DOWNLOAD=true when binary is pre-staged.

README.md Unescape Escape