cleanup
This commit is contained in:
parent
7ac26e302a
commit
b7c9dc81a0
15 changed files with 247 additions and 44 deletions
|
|
@ -0,0 +1,3 @@
|
||||||
|
- [SSH deployment approach](feedback_ssh_scripts.md) — stage on pve, distribute from there, don't nest SSH
|
||||||
|
- [Newsletter stack status](project_newsletter_stack.md) — Ghost+Listmonk+n8n+Postmark partially wired, Substack under consideration
|
||||||
|
- [Don't confirm memory edits](feedback_no_confirm_memory.md) — just edit memory files directly, no prompting
|
||||||
|
|
@ -0,0 +1,11 @@
|
||||||
|
---
|
||||||
|
name: Don't ask permission to edit memory files
|
||||||
|
description: Just edit memory files directly — they're under Claude's control, no need to confirm with user
|
||||||
|
type: feedback
|
||||||
|
---
|
||||||
|
|
||||||
|
Never prompt the user about memory file changes — no confirmation questions, no tool approval popups, nothing in the user's face. If a tool permission prompt would fire, that counts as asking.
|
||||||
|
|
||||||
|
**Why:** Memory files are Claude's workspace. Any interruption about them — whether a question or a tool approval dialog — is unwanted friction.
|
||||||
|
|
||||||
|
**How to apply:** Edit memory files silently. Don't deflect blame onto tool permissions — if it's in the user's face, it's Claude's fault.
|
||||||
|
|
@ -0,0 +1,14 @@
|
||||||
|
---
|
||||||
|
name: Use scripts on jump host instead of long SSH chains
|
||||||
|
description: Download binaries once and distribute, write scripts to pve and run from there instead of nested SSH commands
|
||||||
|
type: feedback
|
||||||
|
---
|
||||||
|
|
||||||
|
Don't run long nested SSH commands or download the same binary on every node separately. Instead:
|
||||||
|
- Download binaries once (on pve or locally) and scp to target nodes
|
||||||
|
- Write deployment scripts to pve and run them from there
|
||||||
|
- Avoid the weird background task temp paths — keep things simple and visible
|
||||||
|
|
||||||
|
**Why:** Long nested SSH chains are fragile (env vars get lost, quoting breaks), slow, and hard to debug. Downloading the same 60MB binary 7 times is wasteful when you can download once and distribute.
|
||||||
|
|
||||||
|
**How to apply:** When deploying to multiple nodes, stage files and scripts on the jump host (pve), then distribute from there. Prefer simple, visible approaches over clever one-liners.
|
||||||
|
|
@ -0,0 +1,17 @@
|
||||||
|
---
|
||||||
|
name: Newsletter stack status and frustrations
|
||||||
|
description: Ghost+Listmonk+n8n+Postmark newsletter pipeline — partially wired, user considering Substack as alternative
|
||||||
|
type: project
|
||||||
|
---
|
||||||
|
|
||||||
|
Ghost CMS hard-codes Mailgun for newsletter sending — `bulk_email__provider: smtp` only handles transactional one-off emails (password resets, signup confirmations), NOT newsletters. Mailgun and SendGrid both rejected signup. Postmark works but account is in sandbox (under review, can only send to verified addresses).
|
||||||
|
|
||||||
|
Current stack: Ghost (blog) → n8n (webhook automation) → Listmonk (newsletter sending) → Postmark (SMTP). Ghost fires webhooks on member.added and post.published to n8n. n8n workflow for member.added is partially built — webhook trigger works, HTTP Request node to Listmonk API not yet configured.
|
||||||
|
|
||||||
|
Listmonk API user creation is confusing — no password field shown for API-type users. Admin credentials work for API access.
|
||||||
|
|
||||||
|
User is frustrated with the complexity and seriously considering Substack for newsletters. The self-hosted stack requires chaining 4 services to do what Substack does natively.
|
||||||
|
|
||||||
|
**Why:** Ghost's Mailgun lock-in is a design flaw that forces this complexity. User wants to own the stack but the overhead is high.
|
||||||
|
|
||||||
|
**How to apply:** Don't push self-hosted over Substack — respect the tradeoff. If user continues with self-hosted, minimize friction. The n8n→Listmonk integration needs finishing: HTTP Request node with Basic Auth to listmonk:9000/api/subscribers.
|
||||||
4
.gitignore
vendored
4
.gitignore
vendored
|
|
@ -7,6 +7,7 @@
|
||||||
# IDE
|
# IDE
|
||||||
.idea/
|
.idea/
|
||||||
.vscode/
|
.vscode/
|
||||||
|
.remember/
|
||||||
*.swp
|
*.swp
|
||||||
*.swo
|
*.swo
|
||||||
*~
|
*~
|
||||||
|
|
@ -15,5 +16,4 @@
|
||||||
.DS_Store
|
.DS_Store
|
||||||
Thumbs.db
|
Thumbs.db
|
||||||
|
|
||||||
# Claude Code
|
|
||||||
.claude/
|
|
||||||
|
|
|
||||||
|
|
@ -1,11 +1,11 @@
|
||||||
# K3s Session State
|
# K3s Session State
|
||||||
# Saved: 2026-04-06 (end of session 3)
|
# Saved: 2026-04-14
|
||||||
|
|
||||||
## Current State
|
## Current State
|
||||||
|
|
||||||
New Proxmox-based K3s cluster in progress. VirtualBox cluster retired.
|
K3s v1.34.6 cluster fully operational on Proxmox VMs + KVM worker over WireGuard mesh.
|
||||||
All 7 Proxmox VMs created and on WireGuard mesh. K3s not yet installed.
|
fat_mama migrated from VirtualBox to KVM/libvirt on workstation 2026-04-14.
|
||||||
Old VirtualBox services (ghost, forgejo, postgres, mariadb) still running on old cluster until migration complete.
|
All Proxmox K3s VMs have onboot: 1 set (fixed 2026-04-12).
|
||||||
|
|
||||||
## Proxmox VMs
|
## Proxmox VMs
|
||||||
|
|
||||||
|
|
@ -18,6 +18,7 @@ Old VirtualBox services (ghost, forgejo, postgres, mariadb) still running on old
|
||||||
| game-control | 10.10.10.158 | 10.0.0.10 | game | k3s control plane |
|
| game-control | 10.10.10.158 | 10.0.0.10 | game | k3s control plane |
|
||||||
| game-worker-hdd | 10.10.10.186 | 10.0.0.11 | game | k3s worker (local-lvm/HDD) |
|
| game-worker-hdd | 10.10.10.186 | 10.0.0.11 | game | k3s worker (local-lvm/HDD) |
|
||||||
| game-worker-ssd | 10.10.10.153 | 10.0.0.12 | game | k3s worker (game-ssd/NVMe) |
|
| game-worker-ssd | 10.10.10.153 | 10.0.0.12 | game | k3s worker (game-ssd/NVMe) |
|
||||||
|
| fat_mama | 192.168.40.220 | 10.0.0.13 | workstation (KVM/libvirt, macvtap enp4s0) | k3s worker |
|
||||||
|
|
||||||
WG IPs 10.0.0.2–10.0.0.5 reserved (old VirtualBox nodes, do not reuse).
|
WG IPs 10.0.0.2–10.0.0.5 reserved (old VirtualBox nodes, do not reuse).
|
||||||
Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1
|
Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1
|
||||||
|
|
@ -33,37 +34,40 @@ Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1
|
||||||
| game-control | 2 | 2GB | 20G | local-lvm |
|
| game-control | 2 | 2GB | 20G | local-lvm |
|
||||||
| game-worker-hdd | 6 | 8GB | 200G | local-lvm (HDD) |
|
| game-worker-hdd | 6 | 8GB | 200G | local-lvm (HDD) |
|
||||||
| game-worker-ssd | 10 | 8GB | 200G | game-ssd (NVMe) |
|
| game-worker-ssd | 10 | 8GB | 200G | game-ssd (NVMe) |
|
||||||
|
| fat_mama | 12 | 20GB | 200G | /var/lib/libvirt/images (qcow2) |
|
||||||
|
|
||||||
## Network Architecture
|
## Network Architecture
|
||||||
|
|
||||||
- All VMs on vmbr1 (10.10.10.0/24), DHCP
|
- Proxmox VMs on vmbr1 (10.10.10.0/24), DHCP
|
||||||
|
- fat_mama on LAN (192.168.40.0/24) via macvtap on enp4s0 — workstation host cannot directly ping/SSH to it; reachable from rest of LAN and via WireGuard at 10.0.0.13
|
||||||
- WireGuard mesh via DO hub — all nodes have static WG IPs (10.0.0.0/24)
|
- WireGuard mesh via DO hub — all nodes have static WG IPs (10.0.0.0/24)
|
||||||
- Full mesh: all nodes have each other as explicit WireGuard peers (not just hub-and-spoke)
|
- Full mesh: all nodes have each other as explicit WireGuard peers (not just hub-and-spoke)
|
||||||
- K3s will use --flannel-iface=wg0 so all cluster traffic runs over WireGuard
|
- K3s uses --flannel-iface=wg0 so all cluster traffic runs over WireGuard
|
||||||
- Caddy at DO hub proxies external traffic to any node's WG IP + NodePort
|
- Caddy at DO hub proxies external traffic to any node's WG IP + NodePort
|
||||||
- Tailscale/Headscale abandoned — too unreliable for cluster networking
|
- Tailscale/Headscale abandoned — too unreliable for cluster networking
|
||||||
|
|
||||||
## Proxmox Host Specs
|
## Proxmox Host Specs
|
||||||
|
|
||||||
- pve: workstation i9-13900KF, 96GB RAM
|
- pve: Meerkat NUC, 64GB RAM, 4TB NVMe
|
||||||
- adder: Proxmox node with RTX 2070, 4TB NVMe available
|
- adder: Adder WS laptop, 32GB RAM, 2TB NVMe, RTX 2070
|
||||||
- game: Proxmox node with RTX 2070, 16GB RAM, 256GB NVMe (game-ssd) + 2TB HDD (local-lvm)
|
- game: old gaming PC, 16GB RAM, 256GB NVMe (game-ssd) + 2TB HDD (local-lvm)
|
||||||
|
- workstation: i9-13900KF, 96GB RAM, RTX 4090, Fedora (runs fat_mama via KVM/libvirt)
|
||||||
|
|
||||||
## VM Provisioning
|
## VM Provisioning
|
||||||
|
|
||||||
### Template & Clone Scripts
|
### Template & Clone Scripts
|
||||||
Scripts at `~/private/Knowledge/repos/homelab/proxmox/scripts/`:
|
Scripts at `~/private/Knowledge/repos/homelab/proxmox/scripts/`:
|
||||||
- `create-debian-template.sh <VMID> <NAME> [STORAGE] [BRIDGE]`
|
- `create-debian-template.sh <VMID> <n> [STORAGE] [BRIDGE]`
|
||||||
- Defaults: STORAGE=local-lvm, BRIDGE=vmbr1
|
- Defaults: STORAGE=local-lvm, BRIDGE=vmbr1
|
||||||
- Bakes in: qemu-guest-agent, curl, wget, nano, rsync, htop, tmux, emacs-nox, nfs-common, tailscale
|
- Bakes in: qemu-guest-agent, curl, wget, nano, rsync, htop, tmux, emacs-nox, nfs-common, tailscale
|
||||||
- Zeroes /etc/machine-id, removes /etc/ssh/ssh_host_* (Cloud-Init regenerates on first boot)
|
- Zeroes /etc/machine-id, removes /etc/ssh/ssh_host_* (Cloud-Init regenerates on first boot)
|
||||||
- Does NOT create .ssh or set keys — done post-boot via qm set
|
- Does NOT create .ssh or set keys — done post-boot via qm set
|
||||||
- `clone-vm.sh <TEMPLATE_VMID> <NEW_VMID> <NAME> [CORES] [MEMORY_MB] [DISK_SIZE] [STORAGE]`
|
- `clone-vm.sh <TEMPLATE_VMID> <NEW_VMID> <n> [CORES] [MEMORY_MB] [DISK_SIZE] [STORAGE]`
|
||||||
- Defaults: 2 cores, 2048MB RAM, 20G disk, local-lvm storage
|
- Defaults: 2 cores, 2048MB RAM, 20G disk, local-lvm storage
|
||||||
- Full clone, auto-starts the VM
|
- Full clone, auto-starts the VM
|
||||||
|
|
||||||
### Post-Clone Formula (confirmed working)
|
### Post-Clone Formula (confirmed working)
|
||||||
1. Clone: `./clone-vm.sh <template> <vmid> <name> [cores] [mem] [disk] [storage]`
|
1. Clone: `./clone-vm.sh <template> <vmid> <n> [cores] [mem] [disk] [storage]`
|
||||||
2. Get IP: `qm guest cmd <vmid> network-get-interfaces`
|
2. Get IP: `qm guest cmd <vmid> network-get-interfaces`
|
||||||
3. Set SSH key: `qm set <vmid> --sshkeys <pubkey-file>`
|
3. Set SSH key: `qm set <vmid> --sshkeys <pubkey-file>`
|
||||||
4. Reboot VM: `qm reboot <vmid>`
|
4. Reboot VM: `qm reboot <vmid>`
|
||||||
|
|
@ -83,14 +87,6 @@ Scripts at `~/private/Knowledge/repos/homelab/proxmox/scripts/`:
|
||||||
- `vgs` — check local-lvm free space
|
- `vgs` — check local-lvm free space
|
||||||
- `pvesh get /nodes/<nodename>/status` — CPU/memory usage
|
- `pvesh get /nodes/<nodename>/status` — CPU/memory usage
|
||||||
|
|
||||||
## Immediate Next Steps
|
|
||||||
1. Install K3s on pve-control first (--cluster-init)
|
|
||||||
2. Join adder-control and game-control as control plane peers
|
|
||||||
3. Join all 4 workers
|
|
||||||
4. Label workers and GPU nodes
|
|
||||||
5. Create namespaces: sjasoft, fulfillment, privacy-practice
|
|
||||||
6. Migrate services from old VirtualBox cluster
|
|
||||||
|
|
||||||
## K3s Install — see k3s/README.md for full commands
|
## K3s Install — see k3s/README.md for full commands
|
||||||
|
|
||||||
- Control plane uses --cluster-init on first node, --server on subsequent nodes
|
- Control plane uses --cluster-init on first node, --server on subsequent nodes
|
||||||
|
|
@ -98,21 +94,28 @@ Scripts at `~/private/Knowledge/repos/homelab/proxmox/scripts/`:
|
||||||
- Traefik disabled on all nodes
|
- Traefik disabled on all nodes
|
||||||
- 3 control plane nodes for HA etcd (tolerates 1 failure)
|
- 3 control plane nodes for HA etcd (tolerates 1 failure)
|
||||||
|
|
||||||
## Running Services (old VirtualBox cluster — not yet migrated)
|
## Running Services
|
||||||
|
|
||||||
- postgres:16 — ClusterIP:5432
|
| Service | NodePort | Domain | Namespace |
|
||||||
- mariadb:11 — ClusterIP:3306
|
|---|---|---|---|
|
||||||
- ghost1/2/3 — NodePorts 32368/32369/32370
|
| ghost1 | 32368 | — | fulfillment |
|
||||||
- forgejo:9 — NodePort 32371, git.sjasoft.com
|
| ghost2 | 32369 | — | fulfillment |
|
||||||
|
| ghost3 | 32370 | — | fulfillment |
|
||||||
|
| forgejo | 32371 | git.sjasoft.com | sjasoft |
|
||||||
|
| postgres | ClusterIP:5432 | — | default |
|
||||||
|
| mariadb | ClusterIP:3306 | — | default |
|
||||||
|
| authentik-server | — | — | default |
|
||||||
|
| authentik-worker | — | — | default |
|
||||||
|
| n8n | — | — | default |
|
||||||
|
| listmonk | — | — | default |
|
||||||
|
|
||||||
## NodePort Registry
|
## Remaining Services to Deploy
|
||||||
|
- vaultwarden.yml — passwords (ACTIVE)
|
||||||
| Port | Service | Namespace |
|
- mattermost.yml — chat (ACTIVE)
|
||||||
|---|---|---|
|
- nats.yml — messaging
|
||||||
| 32368 | ghost1 | fulfillment |
|
- monerod.yml — monero node
|
||||||
| 32369 | ghost2 | fulfillment |
|
- snikket.yml — XMPP
|
||||||
| 32370 | ghost3 | fulfillment |
|
- synapse.yml — Matrix
|
||||||
| 32371 | forgejo | sjasoft |
|
|
||||||
|
|
||||||
## Manifests
|
## Manifests
|
||||||
|
|
||||||
|
|
@ -123,15 +126,6 @@ All in Knowledge/repos/homelab/k3s/:
|
||||||
- k3s/forgejo/forgejo.yaml
|
- k3s/forgejo/forgejo.yaml
|
||||||
- k3s/README.md (authoritative WG mesh table + K3s install commands)
|
- k3s/README.md (authoritative WG mesh table + K3s install commands)
|
||||||
|
|
||||||
## Remaining Services to Port (from Proxmox Docker stack)
|
|
||||||
- authentik.yml — SSO (postgres)
|
|
||||||
- n8n.yml — automation (postgres)
|
|
||||||
- vaultwarden.yml — passwords
|
|
||||||
- nats.yml — messaging
|
|
||||||
- monerod.yml — monero node
|
|
||||||
- snikket.yml — XMPP
|
|
||||||
- synapse.yml — Matrix
|
|
||||||
|
|
||||||
## Known Issues / Notes
|
## Known Issues / Notes
|
||||||
- Tailscale/Headscale abandoned — unreliable, randomly drops nodes, requires manual reconnect
|
- Tailscale/Headscale abandoned — unreliable, randomly drops nodes, requires manual reconnect
|
||||||
- WireGuard full mesh is the correct approach for K3s cluster networking
|
- WireGuard full mesh is the correct approach for K3s cluster networking
|
||||||
|
|
@ -140,3 +134,5 @@ All in Knowledge/repos/homelab/k3s/:
|
||||||
- game node only has 16GB RAM — allocate worker VMs conservatively
|
- game node only has 16GB RAM — allocate worker VMs conservatively
|
||||||
- game-ssd is only 256GB NVMe — keep disk allocations conservative on game-worker-ssd
|
- game-ssd is only 256GB NVMe — keep disk allocations conservative on game-worker-ssd
|
||||||
- Templates should be destroyed after all clones are complete on each node
|
- Templates should be destroyed after all clones are complete on each node
|
||||||
|
- fat_mama macvtap: workstation host cannot directly ping/SSH to fat_mama; reachable from rest of LAN and via WireGuard at 10.0.0.13; SSH from pve-control or other LAN machines works fine
|
||||||
|
- fat_mama disk image at /var/lib/libvirt/images/fat_mama.qcow2 on workstation
|
||||||
|
|
|
||||||
1
docs/.#tasks.org
Symbolic link
1
docs/.#tasks.org
Symbolic link
|
|
@ -0,0 +1 @@
|
||||||
|
samantha@fedora.2598412:1776023928
|
||||||
34
docs/tasks.org
Normal file
34
docs/tasks.org
Normal file
|
|
@ -0,0 +1,34 @@
|
||||||
|
* Security / Privacy
|
||||||
|
** DONE check wg hub cannot ssh into kube
|
||||||
|
CLOSED: [2026-04-14 Tue 16:37]
|
||||||
|
:LOGBOOK:
|
||||||
|
- State "DONE" from "ACTIVE" [2026-04-14 Tue 16:37]
|
||||||
|
Tested. No peers can be ssh-ed to.
|
||||||
|
:END:
|
||||||
|
** TODO stop login from proxmox kube nodes to LAN machines
|
||||||
|
** TODO set up wg bastion and LAN wg peer bastion for on the road access
|
||||||
|
** HOLD mullvad second account
|
||||||
|
SCHEDULED: <2026-04-14 Tue>
|
||||||
|
** HOLD mullvad on proxmox nodes via CLI
|
||||||
|
:LOGBOOK:
|
||||||
|
- State "DONE" from "BACKLOG" [2026-04-14 Tue 17:00]
|
||||||
|
:END:
|
||||||
|
** TODO privacy wg hub complete wg and caddy setup
|
||||||
|
SCHEDULED: <2026-04-15 Wed>
|
||||||
|
** Backups
|
||||||
|
*** TODO Automated Proxmox backups
|
||||||
|
*** TODO special stuff for kube state??
|
||||||
|
*** TODO specific database backups (dumpdb and friends? replicas?)
|
||||||
|
* Monitoring
|
||||||
|
** ACTIVE Nats on workstation
|
||||||
|
** TODO n8n on workstation?
|
||||||
|
* Kube Expansion
|
||||||
|
** TODO add mac VM
|
||||||
|
* Non-Kube WG services
|
||||||
|
** TODO workstation as WG peer??
|
||||||
|
* Services
|
||||||
|
** ACTIVE Mattermost
|
||||||
|
** ACTIVE VaultWarden
|
||||||
|
* Integration
|
||||||
|
** TODO explore SSO
|
||||||
|
|
||||||
|
|
@ -18,6 +18,7 @@ Hub: DO droplet at 138.197.87.251:51820, WG IP 10.0.0.1/24
|
||||||
| game-control | 10.10.10.158 | 10.0.0.10 | game |
|
| game-control | 10.10.10.158 | 10.0.0.10 | game |
|
||||||
| game-worker-hdd | 10.10.10.186 | 10.0.0.11 | game |
|
| game-worker-hdd | 10.10.10.186 | 10.0.0.11 | game |
|
||||||
| game-worker-ssd | 10.10.10.153 | 10.0.0.12 | game |
|
| game-worker-ssd | 10.10.10.153 | 10.0.0.12 | game |
|
||||||
|
| fat_mama | 192.168.40.220 | 10.0.0.13 | workstation (VBox, bridged LAN) |
|
||||||
|
|
||||||
IPs 10.0.0.2–10.0.0.5 are reserved (old VirtualBox K3s nodes, leave alone).
|
IPs 10.0.0.2–10.0.0.5 are reserved (old VirtualBox K3s nodes, leave alone).
|
||||||
|
|
||||||
|
|
@ -217,6 +218,7 @@ hub for external traffic). Headscale removed — too buggy (0.28.x dropped nodes
|
||||||
| game-control | control-plane, etcd | 10.0.0.10 | game | 2 CPU, 2GB RAM, 20GB |
|
| game-control | control-plane, etcd | 10.0.0.10 | game | 2 CPU, 2GB RAM, 20GB |
|
||||||
| game-worker-hdd | worker | 10.0.0.11 | game | 4 CPU, 6GB RAM, 1.4TB HDD |
|
| game-worker-hdd | worker | 10.0.0.11 | game | 4 CPU, 6GB RAM, 1.4TB HDD |
|
||||||
| game-worker-ssd | worker | 10.0.0.12 | game | 10 CPU, 8GB RAM, 200GB SSD |
|
| game-worker-ssd | worker | 10.0.0.12 | game | 10 CPU, 8GB RAM, 200GB SSD |
|
||||||
|
| fat_mama | worker | 10.0.0.13 | workstation (VBox) | 20 CPU, 21GB RAM, 200GB |
|
||||||
|
|
||||||
### Running Services
|
### Running Services
|
||||||
|
|
||||||
|
|
|
||||||
55
k3s/resilience/deploy-resilience.sh
Executable file
55
k3s/resilience/deploy-resilience.sh
Executable file
|
|
@ -0,0 +1,55 @@
|
||||||
|
#!/bin/bash
|
||||||
|
# Deploy k3s resilience configs to all cluster nodes.
|
||||||
|
# Run from workstation where SSH aliases work.
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||||
|
|
||||||
|
CONTROL_NODES="pve-control adder-control game-control"
|
||||||
|
WORKER_NODES="pve-worker adder-worker game-worker-hdd game-worker-ssd fat_mama"
|
||||||
|
ALL_NODES="$CONTROL_NODES $WORKER_NODES"
|
||||||
|
|
||||||
|
echo "=== Deploying k3s resilience to all nodes ==="
|
||||||
|
|
||||||
|
for host in $ALL_NODES; do
|
||||||
|
echo "--- $host ---"
|
||||||
|
|
||||||
|
# Copy scripts
|
||||||
|
scp "$SCRIPT_DIR/wait-for-wg0.sh" "$host:/tmp/"
|
||||||
|
scp "$SCRIPT_DIR/k3s-flannel-watchdog.sh" "$host:/tmp/"
|
||||||
|
scp "$SCRIPT_DIR/k3s-flannel-watchdog.service" "$host:/tmp/"
|
||||||
|
scp "$SCRIPT_DIR/k3s-flannel-watchdog.timer" "$host:/tmp/"
|
||||||
|
|
||||||
|
ssh "$host" bash <<'REMOTE'
|
||||||
|
sudo install -m 755 /tmp/wait-for-wg0.sh /usr/local/bin/
|
||||||
|
sudo install -m 755 /tmp/k3s-flannel-watchdog.sh /usr/local/bin/
|
||||||
|
sudo cp /tmp/k3s-flannel-watchdog.service /etc/systemd/system/
|
||||||
|
sudo cp /tmp/k3s-flannel-watchdog.timer /etc/systemd/system/
|
||||||
|
|
||||||
|
# Determine which k3s service runs on this node
|
||||||
|
if systemctl is-active k3s >/dev/null 2>&1; then
|
||||||
|
K3S_SVC="k3s"
|
||||||
|
else
|
||||||
|
K3S_SVC="k3s-agent"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Install systemd drop-in for wg0 dependency
|
||||||
|
sudo mkdir -p /etc/systemd/system/${K3S_SVC}.service.d
|
||||||
|
cat <<EOF | sudo tee /etc/systemd/system/${K3S_SVC}.service.d/wait-wg0.conf
|
||||||
|
[Unit]
|
||||||
|
After=wg-quick@wg0.service
|
||||||
|
Wants=wg-quick@wg0.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
ExecStartPre=/usr/local/bin/wait-for-wg0.sh
|
||||||
|
EOF
|
||||||
|
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
sudo systemctl enable --now k3s-flannel-watchdog.timer
|
||||||
|
|
||||||
|
echo "$host: done (service=$K3S_SVC)"
|
||||||
|
REMOTE
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "=== All nodes configured ==="
|
||||||
6
k3s/resilience/k3s-flannel-watchdog.service
Normal file
6
k3s/resilience/k3s-flannel-watchdog.service
Normal file
|
|
@ -0,0 +1,6 @@
|
||||||
|
[Unit]
|
||||||
|
Description=K3s flannel watchdog
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/bin/k3s-flannel-watchdog.sh
|
||||||
25
k3s/resilience/k3s-flannel-watchdog.sh
Executable file
25
k3s/resilience/k3s-flannel-watchdog.sh
Executable file
|
|
@ -0,0 +1,25 @@
|
||||||
|
#!/bin/bash
|
||||||
|
# Watchdog: restart k3s if flannel.1 interface is missing.
|
||||||
|
# Runs via systemd timer every 60s.
|
||||||
|
|
||||||
|
# Only act if k3s is running but flannel.1 is gone
|
||||||
|
K3S_UNIT=$(systemctl is-active k3s 2>/dev/null)
|
||||||
|
K3S_AGENT_UNIT=$(systemctl is-active k3s-agent 2>/dev/null)
|
||||||
|
|
||||||
|
if [ "$K3S_UNIT" != "active" ] && [ "$K3S_AGENT_UNIT" != "active" ]; then
|
||||||
|
exit 0 # k3s isn't running, nothing to do
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ip link show flannel.1 >/dev/null 2>&1; then
|
||||||
|
exit 0 # flannel is fine
|
||||||
|
fi
|
||||||
|
|
||||||
|
# flannel.1 is missing — restart the appropriate service
|
||||||
|
echo "$(date): flannel.1 missing, restarting k3s"
|
||||||
|
logger -t k3s-watchdog "flannel.1 interface missing — restarting k3s"
|
||||||
|
|
||||||
|
if [ "$K3S_UNIT" = "active" ]; then
|
||||||
|
systemctl restart k3s
|
||||||
|
elif [ "$K3S_AGENT_UNIT" = "active" ]; then
|
||||||
|
systemctl restart k3s-agent
|
||||||
|
fi
|
||||||
9
k3s/resilience/k3s-flannel-watchdog.timer
Normal file
9
k3s/resilience/k3s-flannel-watchdog.timer
Normal file
|
|
@ -0,0 +1,9 @@
|
||||||
|
[Unit]
|
||||||
|
Description=Check flannel health every 60s
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnBootSec=90
|
||||||
|
OnUnitActiveSec=60
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
11
k3s/resilience/k3s-wait-wg0.conf
Normal file
11
k3s/resilience/k3s-wait-wg0.conf
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
# /etc/systemd/system/k3s.service.d/wait-wg0.conf
|
||||||
|
# (or k3s-agent.service.d/ on worker nodes)
|
||||||
|
#
|
||||||
|
# Ensures k3s waits for wg0 before starting flannel.
|
||||||
|
|
||||||
|
[Unit]
|
||||||
|
After=wg-quick@wg0.service
|
||||||
|
Wants=wg-quick@wg0.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
ExecStartPre=/usr/local/bin/wait-for-wg0.sh
|
||||||
19
k3s/resilience/wait-for-wg0.sh
Executable file
19
k3s/resilience/wait-for-wg0.sh
Executable file
|
|
@ -0,0 +1,19 @@
|
||||||
|
#!/bin/bash
|
||||||
|
# Wait for wg0 interface to be up with an IP before allowing k3s to start.
|
||||||
|
# Used as ExecStartPre in k3s systemd drop-in.
|
||||||
|
|
||||||
|
MAX_WAIT=120
|
||||||
|
INTERVAL=2
|
||||||
|
ELAPSED=0
|
||||||
|
|
||||||
|
while [ $ELAPSED -lt $MAX_WAIT ]; do
|
||||||
|
if ip link show wg0 >/dev/null 2>&1 && ip addr show wg0 | grep -q 'inet '; then
|
||||||
|
echo "wg0 is up with IP after ${ELAPSED}s"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
sleep $INTERVAL
|
||||||
|
ELAPSED=$((ELAPSED + INTERVAL))
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "ERROR: wg0 not up after ${MAX_WAIT}s — starting k3s anyway"
|
||||||
|
exit 0 # don't block k3s forever, let the watchdog handle it
|
||||||
Loading…
Reference in a new issue