homelab/K3s-SESSION-STATE.md
Samantha Atkins 759ef949bc K3s cluster on Proxmox with WireGuard mesh networking
Replaced Headscale (too buggy in 0.28.x — random node drops) with direct
WireGuard hub-and-spoke + full mesh. 7 Proxmox VMs across 3 hosts form a
K3s v1.34.6 cluster: 3 control-plane/etcd nodes, 4 workers.

Running services: postgres, mariadb, ghost (x3), forgejo, authentik.
All unpinned services use local-path StorageClass. Databases pinned to
pve-worker and adder-worker with local PVs.

Includes VM provisioning scripts (create-debian-template.sh, clone-vm.sh),
K3s manifests for all services, and full deployment docs in k3s/README.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:23:13 -04:00

5.6 KiB
Raw Blame History

K3s Session State

Saved: 2026-04-06 (end of session 3)

Current State

New Proxmox-based K3s cluster in progress. VirtualBox cluster retired. All 7 Proxmox VMs created and on WireGuard mesh. K3s not yet installed. Old VirtualBox services (ghost, forgejo, postgres, mariadb) still running on old cluster until migration complete.

Proxmox VMs

Node vmbr1 IP WG IP Proxmox Host Role
pve-control 10.10.10.151 10.0.0.6 pve k3s control plane
pve-worker 10.10.10.126 10.0.0.7 pve k3s worker
adder-control 10.10.10.185 10.0.0.8 adder k3s control plane
adder-worker 10.10.10.83 10.0.0.9 adder k3s worker
game-control 10.10.10.158 10.0.0.10 game k3s control plane
game-worker-hdd 10.10.10.186 10.0.0.11 game k3s worker (local-lvm/HDD)
game-worker-ssd 10.10.10.153 10.0.0.12 game k3s worker (game-ssd/NVMe)

WG IPs 10.0.0.210.0.0.5 reserved (old VirtualBox nodes, do not reuse). Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1

VM Specs

Node vCPUs RAM Disk Storage
pve-control 2 2GB 20G local-lvm
pve-worker 6 8GB 100G local-lvm
adder-control 2 2GB 20G local-lvm
adder-worker 6 8GB 100G local-lvm
game-control 2 2GB 20G local-lvm
game-worker-hdd 6 8GB 200G local-lvm (HDD)
game-worker-ssd 10 8GB 200G game-ssd (NVMe)

Network Architecture

  • All VMs on vmbr1 (10.10.10.0/24), DHCP
  • WireGuard mesh via DO hub — all nodes have static WG IPs (10.0.0.0/24)
  • Full mesh: all nodes have each other as explicit WireGuard peers (not just hub-and-spoke)
  • K3s will use --flannel-iface=wg0 so all cluster traffic runs over WireGuard
  • Caddy at DO hub proxies external traffic to any node's WG IP + NodePort
  • Tailscale/Headscale abandoned — too unreliable for cluster networking

Proxmox Host Specs

  • pve: workstation i9-13900KF, 96GB RAM
  • adder: Proxmox node with RTX 2070, 4TB NVMe available
  • game: Proxmox node with RTX 2070, 16GB RAM, 256GB NVMe (game-ssd) + 2TB HDD (local-lvm)

VM Provisioning

Template & Clone Scripts

Scripts at ~/private/Knowledge/repos/homelab/proxmox/scripts/:

  • create-debian-template.sh <VMID> <NAME> [STORAGE] [BRIDGE]
    • Defaults: STORAGE=local-lvm, BRIDGE=vmbr1
    • Bakes in: qemu-guest-agent, curl, wget, nano, rsync, htop, tmux, emacs-nox, nfs-common, tailscale
    • Zeroes /etc/machine-id, removes /etc/ssh/ssh_host_* (Cloud-Init regenerates on first boot)
    • Does NOT create .ssh or set keys — done post-boot via qm set
  • clone-vm.sh <TEMPLATE_VMID> <NEW_VMID> <NAME> [CORES] [MEMORY_MB] [DISK_SIZE] [STORAGE]
    • Defaults: 2 cores, 2048MB RAM, 20G disk, local-lvm storage
    • Full clone, auto-starts the VM

Post-Clone Formula (confirmed working)

  1. Clone: ./clone-vm.sh <template> <vmid> <name> [cores] [mem] [disk] [storage]
  2. Get IP: qm guest cmd <vmid> network-get-interfaces
  3. Set SSH key: qm set <vmid> --sshkeys <pubkey-file>
  4. Reboot VM: qm reboot <vmid>
  5. SSH in: ssh samantha@<ip>
  6. Configure WireGuard on the VM

VMID Convention

  • pve: 100-199 (templates at 199)
  • adder: 200-299 (templates at 299 — currently 200 exists, destroy after use)
  • game: 300-399 (templates at 399 — currently 300 exists, destroy after use)

Useful Proxmox CLI

  • qm guest cmd <VMID> network-get-interfaces — get VM IP
  • qm set <VMID> --vga std --delete serial0 — fix serial console
  • qm destroy <VMID> --purge — remove VM
  • qm list — list all VMs
  • vgs — check local-lvm free space
  • pvesh get /nodes/<nodename>/status — CPU/memory usage

Immediate Next Steps

  1. Install K3s on pve-control first (--cluster-init)
  2. Join adder-control and game-control as control plane peers
  3. Join all 4 workers
  4. Label workers and GPU nodes
  5. Create namespaces: sjasoft, fulfillment, privacy-practice
  6. Migrate services from old VirtualBox cluster

K3s Install — see k3s/README.md for full commands

  • Control plane uses --cluster-init on first node, --server on subsequent nodes
  • All nodes use --flannel-iface=wg0 and --node-ip=
  • Traefik disabled on all nodes
  • 3 control plane nodes for HA etcd (tolerates 1 failure)

Running Services (old VirtualBox cluster — not yet migrated)

  • postgres:16 — ClusterIP:5432
  • mariadb:11 — ClusterIP:3306
  • ghost1/2/3 — NodePorts 32368/32369/32370
  • forgejo:9 — NodePort 32371, git.sjasoft.com

NodePort Registry

Port Service Namespace
32368 ghost1 fulfillment
32369 ghost2 fulfillment
32370 ghost3 fulfillment
32371 forgejo sjasoft

Manifests

All in Knowledge/repos/homelab/k3s/:

  • k3s/postgres/postgres.yaml
  • k3s/mariadb/mariadb.yaml
  • k3s/ghost/ghost.yaml
  • k3s/forgejo/forgejo.yaml
  • k3s/README.md (authoritative WG mesh table + K3s install commands)

Remaining Services to Port (from Proxmox Docker stack)

  • authentik.yml — SSO (postgres)
  • n8n.yml — automation (postgres)
  • vaultwarden.yml — passwords
  • nats.yml — messaging
  • monerod.yml — monero node
  • snikket.yml — XMPP
  • synapse.yml — Matrix

Known Issues / Notes

  • Tailscale/Headscale abandoned — unreliable, randomly drops nodes, requires manual reconnect
  • WireGuard full mesh is the correct approach for K3s cluster networking
  • kubectl requires KUBECONFIG=~/.kube/config in ~/.bashrc on control nodes
  • Cross-namespace secrets not supported — keep secrets in same namespace as consumer
  • game node only has 16GB RAM — allocate worker VMs conservatively
  • game-ssd is only 256GB NVMe — keep disk allocations conservative on game-worker-ssd
  • Templates should be destroyed after all clones are complete on each node