cleanup

2026-04-17 20:33:17 -04:00 · 2026-04-17 20:33:17 -04:00 · b7c9dc81a0
commit b7c9dc81a0
parent 7ac26e302a
15 changed files with 247 additions and 44 deletions
--- a/.claude/projects/-home-samantha-private-Knowledge-repos-homelab/memory/MEMORY.md
+++ b/.claude/projects/-home-samantha-private-Knowledge-repos-homelab/memory/MEMORY.md
@ -0,0 +1,3 @@
 - [SSH deployment approach](feedback_ssh_scripts.md) — stage on pve, distribute from there, don't nest SSH
 - [Newsletter stack status](project_newsletter_stack.md) — Ghost+Listmonk+n8n+Postmark partially wired, Substack under consideration
 - [Don't confirm memory edits](feedback_no_confirm_memory.md) — just edit memory files directly, no prompting
--- a/.claude/projects/-home-samantha-private-Knowledge-repos-homelab/memory/feedback_no_confirm_memory.md
+++ b/.claude/projects/-home-samantha-private-Knowledge-repos-homelab/memory/feedback_no_confirm_memory.md
@ -0,0 +1,11 @@
 ---
 name: Don't ask permission to edit memory files
 description: Just edit memory files directly — they're under Claude's control, no need to confirm with user
 type: feedback
 ---
 Never prompt the user about memory file changes — no confirmation questions, no tool approval popups, nothing in the user's face. If a tool permission prompt would fire, that counts as asking.
 **Why:** Memory files are Claude's workspace. Any interruption about them — whether a question or a tool approval dialog — is unwanted friction.
 **How to apply:** Edit memory files silently. Don't deflect blame onto tool permissions — if it's in the user's face, it's Claude's fault.
--- a/.claude/projects/-home-samantha-private-Knowledge-repos-homelab/memory/feedback_ssh_scripts.md
+++ b/.claude/projects/-home-samantha-private-Knowledge-repos-homelab/memory/feedback_ssh_scripts.md
@ -0,0 +1,14 @@
 ---
 name: Use scripts on jump host instead of long SSH chains
 description: Download binaries once and distribute, write scripts to pve and run from there instead of nested SSH commands
 type: feedback
 ---
 Don't run long nested SSH commands or download the same binary on every node separately. Instead:
 - Download binaries once (on pve or locally) and scp to target nodes
 - Write deployment scripts to pve and run them from there
 - Avoid the weird background task temp paths — keep things simple and visible
 **Why:** Long nested SSH chains are fragile (env vars get lost, quoting breaks), slow, and hard to debug. Downloading the same 60MB binary 7 times is wasteful when you can download once and distribute.
 **How to apply:** When deploying to multiple nodes, stage files and scripts on the jump host (pve), then distribute from there. Prefer simple, visible approaches over clever one-liners.
--- a/.claude/projects/-home-samantha-private-Knowledge-repos-homelab/memory/project_newsletter_stack.md
+++ b/.claude/projects/-home-samantha-private-Knowledge-repos-homelab/memory/project_newsletter_stack.md
@ -0,0 +1,17 @@
 ---
 name: Newsletter stack status and frustrations
 description: Ghost+Listmonk+n8n+Postmark newsletter pipeline — partially wired, user considering Substack as alternative
 type: project
 ---
 Ghost CMS hard-codes Mailgun for newsletter sending — `bulk_email__provider: smtp` only handles transactional one-off emails (password resets, signup confirmations), NOT newsletters. Mailgun and SendGrid both rejected signup. Postmark works but account is in sandbox (under review, can only send to verified addresses).
 Current stack: Ghost (blog) → n8n (webhook automation) → Listmonk (newsletter sending) → Postmark (SMTP). Ghost fires webhooks on member.added and post.published to n8n. n8n workflow for member.added is partially built — webhook trigger works, HTTP Request node to Listmonk API not yet configured.
 Listmonk API user creation is confusing — no password field shown for API-type users. Admin credentials work for API access.
 User is frustrated with the complexity and seriously considering Substack for newsletters. The self-hosted stack requires chaining 4 services to do what Substack does natively.
 **Why:** Ghost's Mailgun lock-in is a design flaw that forces this complexity. User wants to own the stack but the overhead is high.
 **How to apply:** Don't push self-hosted over Substack — respect the tradeoff. If user continues with self-hosted, minimize friction. The n8n→Listmonk integration needs finishing: HTTP Request node with Basic Auth to listmonk:9000/api/subscribers.
--- a/.gitignore
+++ b/.gitignore
@ -7,6 +7,7 @@
 # IDE
 .idea/
 .vscode/
 .remember/
 *.swp
 *.swo
 *~
@ -15,5 +16,4 @@
 .DS_Store
 Thumbs.db
-# Claude Code
+
 .claude/
--- a/K3s-SESSION-STATE.md
+++ b/K3s-SESSION-STATE.md
@ -1,11 +1,11 @@
 # K3s Session State
-# Saved: 2026-04-06 (end of session 3)
+# Saved: 2026-04-14
 ## Current State
-New Proxmox-based K3s cluster in progress. VirtualBox cluster retired.
+K3s v1.34.6 cluster fully operational on Proxmox VMs + KVM worker over WireGuard mesh.
-All 7 Proxmox VMs created and on WireGuard mesh. K3s not yet installed.
+fat_mama migrated from VirtualBox to KVM/libvirt on workstation 2026-04-14.
-Old VirtualBox services (ghost, forgejo, postgres, mariadb) still running on old cluster until migration complete.
+All Proxmox K3s VMs have onboot: 1 set (fixed 2026-04-12).
 ## Proxmox VMs
@ -18,6 +18,7 @@ Old VirtualBox services (ghost, forgejo, postgres, mariadb) still running on old
 | game-control | 10.10.10.158 | 10.0.0.10 | game | k3s control plane |
 | game-worker-hdd | 10.10.10.186 | 10.0.0.11 | game | k3s worker (local-lvm/HDD) |
 | game-worker-ssd | 10.10.10.153 | 10.0.0.12 | game | k3s worker (game-ssd/NVMe) |
 | fat_mama | 192.168.40.220 | 10.0.0.13 | workstation (KVM/libvirt, macvtap enp4s0) | k3s worker |
 WG IPs 10.0.0.2–10.0.0.5 reserved (old VirtualBox nodes, do not reuse).
 Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1
@ -33,37 +34,40 @@ Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1
 | game-control | 2 | 2GB | 20G | local-lvm |
 | game-worker-hdd | 6 | 8GB | 200G | local-lvm (HDD) |
 | game-worker-ssd | 10 | 8GB | 200G | game-ssd (NVMe) |
 | fat_mama | 12 | 20GB | 200G | /var/lib/libvirt/images (qcow2) |
 ## Network Architecture
- All VMs on vmbr1 (10.10.10.0/24), DHCP
+- Proxmox VMs on vmbr1 (10.10.10.0/24), DHCP
 - fat_mama on LAN (192.168.40.0/24) via macvtap on enp4s0 — workstation host cannot directly ping/SSH to it; reachable from rest of LAN and via WireGuard at 10.0.0.13
 - WireGuard mesh via DO hub — all nodes have static WG IPs (10.0.0.0/24)
 - Full mesh: all nodes have each other as explicit WireGuard peers (not just hub-and-spoke)
- K3s will use --flannel-iface=wg0 so all cluster traffic runs over WireGuard
+- K3s uses --flannel-iface=wg0 so all cluster traffic runs over WireGuard
 - Caddy at DO hub proxies external traffic to any node's WG IP + NodePort
 - Tailscale/Headscale abandoned — too unreliable for cluster networking
 ## Proxmox Host Specs
- pve: workstation i9-13900KF, 96GB RAM
+- pve: Meerkat NUC, 64GB RAM, 4TB NVMe
- adder: Proxmox node with RTX 2070, 4TB NVMe available
+- adder: Adder WS laptop, 32GB RAM, 2TB NVMe, RTX 2070
- game: Proxmox node with RTX 2070, 16GB RAM, 256GB NVMe (game-ssd) + 2TB HDD (local-lvm)
+- game: old gaming PC, 16GB RAM, 256GB NVMe (game-ssd) + 2TB HDD (local-lvm)
 - workstation: i9-13900KF, 96GB RAM, RTX 4090, Fedora (runs fat_mama via KVM/libvirt)
 ## VM Provisioning
 ### Template & Clone Scripts
 Scripts at `~/private/Knowledge/repos/homelab/proxmox/scripts/`:
- `create-debian-template.sh <VMID> <NAME> [STORAGE] [BRIDGE]`
+- `create-debian-template.sh <VMID> <n> [STORAGE] [BRIDGE]`
  - Defaults: STORAGE=local-lvm, BRIDGE=vmbr1
  - Bakes in: qemu-guest-agent, curl, wget, nano, rsync, htop, tmux, emacs-nox, nfs-common, tailscale
  - Zeroes /etc/machine-id, removes /etc/ssh/ssh_host_* (Cloud-Init regenerates on first boot)
  - Does NOT create .ssh or set keys — done post-boot via qm set
- `clone-vm.sh <TEMPLATE_VMID> <NEW_VMID> <NAME> [CORES] [MEMORY_MB] [DISK_SIZE] [STORAGE]`
+- `clone-vm.sh <TEMPLATE_VMID> <NEW_VMID> <n> [CORES] [MEMORY_MB] [DISK_SIZE] [STORAGE]`
  - Defaults: 2 cores, 2048MB RAM, 20G disk, local-lvm storage
  - Full clone, auto-starts the VM
 ### Post-Clone Formula (confirmed working)
-1. Clone: `./clone-vm.sh <template> <vmid> <name> [cores] [mem] [disk] [storage]`
+1. Clone: `./clone-vm.sh <template> <vmid> <n> [cores] [mem] [disk] [storage]`
 2. Get IP: `qm guest cmd <vmid> network-get-interfaces`
 3. Set SSH key: `qm set <vmid> --sshkeys <pubkey-file>`
 4. Reboot VM: `qm reboot <vmid>`
@ -83,14 +87,6 @@ Scripts at `~/private/Knowledge/repos/homelab/proxmox/scripts/`:
 - `vgs` — check local-lvm free space
 - `pvesh get /nodes/<nodename>/status` — CPU/memory usage
 ## Immediate Next Steps
 1. Install K3s on pve-control first (--cluster-init)
 2. Join adder-control and game-control as control plane peers
 3. Join all 4 workers
 4. Label workers and GPU nodes
 5. Create namespaces: sjasoft, fulfillment, privacy-practice
 6. Migrate services from old VirtualBox cluster
 ## K3s Install — see k3s/README.md for full commands
 - Control plane uses --cluster-init on first node, --server on subsequent nodes
@ -98,21 +94,28 @@ Scripts at `~/private/Knowledge/repos/homelab/proxmox/scripts/`:
 - Traefik disabled on all nodes
 - 3 control plane nodes for HA etcd (tolerates 1 failure)
-## Running Services (old VirtualBox cluster — not yet migrated)
+## Running Services
- postgres:16 — ClusterIP:5432
+| Service | NodePort | Domain | Namespace |
- mariadb:11 — ClusterIP:3306
+|---|---|---|---|
- ghost1/2/3 — NodePorts 32368/32369/32370
+| ghost1 | 32368 | — | fulfillment |
- forgejo:9 — NodePort 32371, git.sjasoft.com
+| ghost2 | 32369 | — | fulfillment |
 | ghost3 | 32370 | — | fulfillment |
 | forgejo | 32371 | git.sjasoft.com | sjasoft |
 | postgres | ClusterIP:5432 | — | default |
 | mariadb | ClusterIP:3306 | — | default |
 | authentik-server | — | — | default |
 | authentik-worker | — | — | default |
 | n8n | — | — | default |
 | listmonk | — | — | default |
-## NodePort Registry
+## Remaining Services to Deploy
-
+- vaultwarden.yml — passwords (ACTIVE)
-| Port | Service | Namespace |
+- mattermost.yml — chat (ACTIVE)
-|---|---|---|
+- nats.yml — messaging
-| 32368 | ghost1 | fulfillment |
+- monerod.yml — monero node
-| 32369 | ghost2 | fulfillment |
+- snikket.yml — XMPP
-| 32370 | ghost3 | fulfillment |
+- synapse.yml — Matrix
 | 32371 | forgejo | sjasoft |
 ## Manifests
@ -123,15 +126,6 @@ All in Knowledge/repos/homelab/k3s/:
 - k3s/forgejo/forgejo.yaml
 - k3s/README.md (authoritative WG mesh table + K3s install commands)
 ## Remaining Services to Port (from Proxmox Docker stack)
 - authentik.yml — SSO (postgres)
 - n8n.yml — automation (postgres)
 - vaultwarden.yml — passwords
 - nats.yml — messaging
 - monerod.yml — monero node
 - snikket.yml — XMPP
 - synapse.yml — Matrix
 ## Known Issues / Notes
 - Tailscale/Headscale abandoned — unreliable, randomly drops nodes, requires manual reconnect
 - WireGuard full mesh is the correct approach for K3s cluster networking
@ -140,3 +134,5 @@ All in Knowledge/repos/homelab/k3s/:
 - game node only has 16GB RAM — allocate worker VMs conservatively
 - game-ssd is only 256GB NVMe — keep disk allocations conservative on game-worker-ssd
 - Templates should be destroyed after all clones are complete on each node
 - fat_mama macvtap: workstation host cannot directly ping/SSH to fat_mama; reachable from rest of LAN and via WireGuard at 10.0.0.13; SSH from pve-control or other LAN machines works fine
 - fat_mama disk image at /var/lib/libvirt/images/fat_mama.qcow2 on workstation
--- a/docs/.#tasks.org
+++ b/docs/.#tasks.org
@ -0,0 +1 @@
 samantha@fedora.2598412:1776023928
--- a/docs/tasks.org
+++ b/docs/tasks.org
@ -0,0 +1,34 @@
 * Security / Privacy
 ** DONE check wg hub cannot ssh into kube
 CLOSED: [2026-04-14 Tue 16:37]
 :LOGBOOK:
 - State "DONE"       from "ACTIVE"     [2026-04-14 Tue 16:37]
  Tested.  No peers can be ssh-ed to.
 :END:
 ** TODO stop login from proxmox kube nodes to LAN machines
 ** TODO set up wg bastion and LAN wg peer bastion for on the road access
 ** HOLD mullvad second account
 SCHEDULED: <2026-04-14 Tue>
 ** HOLD mullvad on proxmox nodes via CLI
 :LOGBOOK:
 - State "DONE"       from "BACKLOG"    [2026-04-14 Tue 17:00]
 :END:
 ** TODO privacy wg hub complete wg and caddy setup
 SCHEDULED: <2026-04-15 Wed>
 ** Backups
 *** TODO Automated Proxmox backups
 *** TODO special stuff for kube state??
 *** TODO specific database backups (dumpdb and friends? replicas?)
 * Monitoring
 ** ACTIVE Nats on workstation
 ** TODO n8n on workstation?
 * Kube Expansion
 ** TODO add mac VM
 * Non-Kube WG services
 ** TODO workstation as WG peer??
 * Services
 ** ACTIVE Mattermost
 ** ACTIVE VaultWarden
 * Integration
 ** TODO explore SSO
--- a/k3s/README.md
+++ b/k3s/README.md
@ -18,6 +18,7 @@ Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1/24
 | game-control | 10.10.10.158 | 10.0.0.10 | game |
 | game-worker-hdd | 10.10.10.186 | 10.0.0.11 | game |
 | game-worker-ssd | 10.10.10.153 | 10.0.0.12 | game |
 | fat_mama | 192.168.40.220 | 10.0.0.13 | workstation (VBox, bridged LAN) |
 IPs 10.0.0.2–10.0.0.5 are reserved (old VirtualBox K3s nodes, leave alone).
@ -217,6 +218,7 @@ hub for external traffic). Headscale removed — too buggy (0.28.x dropped nodes
 | game-control | control-plane, etcd | 10.0.0.10 | game | 2 CPU, 2GB RAM, 20GB |
 | game-worker-hdd | worker | 10.0.0.11 | game | 4 CPU, 6GB RAM, 1.4TB HDD |
 | game-worker-ssd | worker | 10.0.0.12 | game | 10 CPU, 8GB RAM, 200GB SSD |
 | fat_mama | worker | 10.0.0.13 | workstation (VBox) | 20 CPU, 21GB RAM, 200GB |
 ### Running Services
--- a/k3s/resilience/deploy-resilience.sh
+++ b/k3s/resilience/deploy-resilience.sh
@ -0,0 +1,55 @@
 #!/bin/bash
 # Deploy k3s resilience configs to all cluster nodes.
 # Run from workstation where SSH aliases work.
 set -e
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
 CONTROL_NODES="pve-control adder-control game-control"
 WORKER_NODES="pve-worker adder-worker game-worker-hdd game-worker-ssd fat_mama"
 ALL_NODES="$CONTROL_NODES $WORKER_NODES"
 echo "=== Deploying k3s resilience to all nodes ==="
 for host in $ALL_NODES; do
    echo "--- $host ---"
    # Copy scripts
    scp "$SCRIPT_DIR/wait-for-wg0.sh" "$host:/tmp/"
    scp "$SCRIPT_DIR/k3s-flannel-watchdog.sh" "$host:/tmp/"
    scp "$SCRIPT_DIR/k3s-flannel-watchdog.service" "$host:/tmp/"
    scp "$SCRIPT_DIR/k3s-flannel-watchdog.timer" "$host:/tmp/"
    ssh "$host" bash <<'REMOTE'
        sudo install -m 755 /tmp/wait-for-wg0.sh /usr/local/bin/
        sudo install -m 755 /tmp/k3s-flannel-watchdog.sh /usr/local/bin/
        sudo cp /tmp/k3s-flannel-watchdog.service /etc/systemd/system/
        sudo cp /tmp/k3s-flannel-watchdog.timer /etc/systemd/system/
        # Determine which k3s service runs on this node
        if systemctl is-active k3s >/dev/null 2>&1; then
            K3S_SVC="k3s"
        else
            K3S_SVC="k3s-agent"
        fi
        # Install systemd drop-in for wg0 dependency
        sudo mkdir -p /etc/systemd/system/${K3S_SVC}.service.d
        cat <<EOF | sudo tee /etc/systemd/system/${K3S_SVC}.service.d/wait-wg0.conf
 [Unit]
 After=wg-quick@wg0.service
 Wants=wg-quick@wg0.service
 [Service]
 ExecStartPre=/usr/local/bin/wait-for-wg0.sh
 EOF
        sudo systemctl daemon-reload
        sudo systemctl enable --now k3s-flannel-watchdog.timer
        echo "$host: done (service=$K3S_SVC)"
 REMOTE
 done
 echo "=== All nodes configured ==="
--- a/k3s/resilience/k3s-flannel-watchdog.service
+++ b/k3s/resilience/k3s-flannel-watchdog.service
@ -0,0 +1,6 @@
 [Unit]
 Description=K3s flannel watchdog
 [Service]
 Type=oneshot
 ExecStart=/usr/local/bin/k3s-flannel-watchdog.sh
--- a/k3s/resilience/k3s-flannel-watchdog.sh
+++ b/k3s/resilience/k3s-flannel-watchdog.sh
@ -0,0 +1,25 @@
 #!/bin/bash
 # Watchdog: restart k3s if flannel.1 interface is missing.
 # Runs via systemd timer every 60s.
 # Only act if k3s is running but flannel.1 is gone
 K3S_UNIT=$(systemctl is-active k3s 2>/dev/null)
 K3S_AGENT_UNIT=$(systemctl is-active k3s-agent 2>/dev/null)
 if [ "$K3S_UNIT" != "active" ] && [ "$K3S_AGENT_UNIT" != "active" ]; then
    exit 0  # k3s isn't running, nothing to do
 fi
 if ip link show flannel.1 >/dev/null 2>&1; then
    exit 0  # flannel is fine
 fi
 # flannel.1 is missing — restart the appropriate service
 echo "$(date): flannel.1 missing, restarting k3s"
 logger -t k3s-watchdog "flannel.1 interface missing — restarting k3s"
 if [ "$K3S_UNIT" = "active" ]; then
    systemctl restart k3s
 elif [ "$K3S_AGENT_UNIT" = "active" ]; then
    systemctl restart k3s-agent
 fi
--- a/k3s/resilience/k3s-flannel-watchdog.timer
+++ b/k3s/resilience/k3s-flannel-watchdog.timer
@ -0,0 +1,9 @@
 [Unit]
 Description=Check flannel health every 60s
 [Timer]
 OnBootSec=90
 OnUnitActiveSec=60
 [Install]
 WantedBy=timers.target
--- a/k3s/resilience/k3s-wait-wg0.conf
+++ b/k3s/resilience/k3s-wait-wg0.conf
@ -0,0 +1,11 @@
 # /etc/systemd/system/k3s.service.d/wait-wg0.conf
 # (or k3s-agent.service.d/ on worker nodes)
 #
 # Ensures k3s waits for wg0 before starting flannel.
 [Unit]
 After=wg-quick@wg0.service
 Wants=wg-quick@wg0.service
 [Service]
 ExecStartPre=/usr/local/bin/wait-for-wg0.sh
--- a/k3s/resilience/wait-for-wg0.sh
+++ b/k3s/resilience/wait-for-wg0.sh
@ -0,0 +1,19 @@
 #!/bin/bash
 # Wait for wg0 interface to be up with an IP before allowing k3s to start.
 # Used as ExecStartPre in k3s systemd drop-in.
 MAX_WAIT=120
 INTERVAL=2
 ELAPSED=0
 while [ $ELAPSED -lt $MAX_WAIT ]; do
    if ip link show wg0 >/dev/null 2>&1 && ip addr show wg0 | grep -q 'inet '; then
        echo "wg0 is up with IP after ${ELAPSED}s"
        exit 0
    fi
    sleep $INTERVAL
    ELAPSED=$((ELAPSED + INTERVAL))
 done
 echo "ERROR: wg0 not up after ${MAX_WAIT}s — starting k3s anyway"
 exit 0  # don't block k3s forever, let the watchdog handle it