Skip to content

Homelab Foundation Tracker

Goal: Reliable, documented, recoverable infrastructure that supports the project, learning, and home-cloud paths (Path A, Path B, Path D).

Last updated: 2026-06-04 late evening - Phase 2 essentially DONE (VLAN 10 trunk cutover + host vmbr0.10/.40 + NAS ens18 static + iLO moved to VLAN 40 + pfSense SERVERS->MGMT block rule + dead 10.0.110.x removed + tracker cleanup). Foundation Phase 3 (PBS + DNS + Traefik + Authelia + monitoring) is now the next lift. Phase 1 hardware items (cable, riser, RAID battery, SSDs) remain.

Phase 1: Recovery and basic operability (RECOVERY DONE; networking/storage in progress)

Wedding photos recovery — DONE. The primary recovery target that started this whole effort 2026-05-27 has been extracted from the ddrescue'd apple-1tb.img HFS+ image and placed on the homeNas SMB share, confirmed reachable from Windows. Original mission accomplished. The .img file remains on s-tank as the source-of-truth archive (can be deleted or moved to cold storage when convenient).

DONE

  • PVE host 10.0.110.5/24 persistence (vmbr0 post-up stanza)
  • OMV VM 189 10.0.110.11/24 persistence (interfaces.d/ fragment, survives salt regen)
  • iLO relocated from dead 10.0.100.3 to live 10.0.110.3
  • iLO Advanced license activated (35DPH-SVSXJ-HGBJN-C7N5R-2SS4W)
  • iLO web UI, SSH ([email protected] with legacy KEX flags), IPMI-over-LAN all working
  • iLO Server Name renamed: ECL-ESX02.eclh.lan -> arochukwu
  • iLO Subsystem Name: ILOCZ212806VN -> arochukwu-ilo
  • iLO time zone: Atlantic/Reykjavik -> Europe/Amsterdam
  • iLO DNS servers configured (8.8.8.8, 9.9.9.9, 1.1.1.1)
  • Windows SSH aliases: ssh ilo / ssh pve / ssh homenas
  • ipmitool installed on PVE for in-band IPMI
  • Bond architecture simplified (2026-05-31): nested LACP (bond0 LACP + bond1 LACP under bond2 active-backup) collapsed to single bond0 in active-backup over enp3s0f0+enp3s0f1, primary enp3s0f0, primary-reselect always. bond1 and bond2 removed. vmbr0 uplink moved from bond2 to bond0. enp4s0f0+enp4s0f1 now standalone (reserved for future 10GbE NIC, dedicated storage path, or second switch link). vmbr0 MAC settled at b4:99:ba:bb:1a:8a. All IPs preserved (10.0.10.5/24 + 10.0.110.5/24 + vmbr0.100 at 10.0.100.5/24). VM 189 firewall stack untouched.
  • Bulk obsolete-entity cleanup (2026-05-31 late evening): all 22 obsolete entities destroyed. Phase 2 clean destroys on local-lvm: VM 199 mgmt-workstation + VM 8000 ubuntu-mini-cloud (~12.1 GB actually reclaimed). Phase 3 messy destroys on offline fast NFS: VMs 101-109 (k8s cluster #1), VMs 181-187 (k8s cluster #2 hap-*), VM 170 kali-blue, VM 197 kali-purple, VM 198 mgmt-devops — confs removed, 36 disk files orphaned on fast. LXC 500 demo-centos: pct destroy --purge --force returned exit 255 (refuses offline storage); resolved with rm /etc/pve/lxc/500.conf as a manual workaround, disk file orphaned. Net result: only VM 189 remains. 37 orphan disk files on fast pending cleanup when storage returns. Full audit trail at D:\PVE\pre-destruction-inventory-20260531-205954.txt + D:\PVE\orphaned-fast-disks-20260531-212124.txt (copies also on host at /root/).
  • VLAN 100 retired (2026-05-31 late evening): vmbr0.100 sub-interface (was 10.0.100.5/24) removed from /etc/network/interfaces after read-only verification confirmed zero consumers (no listeners, no ARP, no service binding). ifreload -a applied cleanly. Routes for 10.0.100.0/24 auto-cleared; bridge VID 100 membership on vmbr0 dropped. SSH unaffected (lives on vmbr0's 10.0.110.5/24 secondary). Backup: /etc/network/interfaces.backup-pre-vlan100-removal-20260531-214723.
  • VM 189 net0 retagged VLAN 70 → 10 (2026-05-31 late evening): pre-stages homeNas for the new SERVERS VLAN 10 design. Hot-applied via qm set 189 --net0 ...tag=10 — VM never restarted, PID 1521 preserved across the change. Bridge port fwpr189p0 PVID now 10. Side effect: net0 stats counter reset (virtio NIC was detached and re-attached, a known qm-set side effect). Net0 remains dormant in operational terms: ens18 inside the guest has no IPv4 because no DHCP source exists on VLAN 10 yet; will activate when Phase 2 brings up VLAN 10. All real traffic (SMB, SSH to homeNas) continues on net1 (untagged 10.0.110.11). Backup: /root/189.conf.backup-pre-vlan-retag-20260531-215559.

PENDING - hardware maintenance window (waiting on RAID BBWC battery arrival)

  • Riser install (chassis-open task)
  • RAID controller BBWC battery replacement (this is the actual gate - not yet arrived)
  • CMOS / RTC battery replacement (CR2032, ~€2 - bundle into same chassis-open)
  • Visual PSU vent dust inspection (no part needed, just do it)
  • (optionally) SPP/BIOS firmware flash if ISO is safely obtained by then

DONE earlier

  • ~~Cat6A cable run between rooms~~ - DONE (fish tape work completed; this is what made the VLAN trunk physically possible. Confirmed by user 2026-06-04.)

Storage controller path — DECIDED 2026-06-04: Path B1 Hybrid

Full detail in D:\PVE\reference_p410i_ssd_compatibility.md. Path B1 keeps the P410i as boot controller AND adds a SAS HBA + ZFS pool for new data SSDs. Best of both worlds: no PVE reinstall, BBWC battery still useful, ZFS for new family-cloud / Path A data tier. P410i sale deferred to a future "clean Path B" cleanup (probably dd/rsync clone of boot off P410i first, since fresh PVE install on G7 is a known red spot).

Baseline storage inventory captured 2026-06-04 (Session 7)

smartctl -d cciss probe (since HPE MCP repo doesn't have ssacli for bookworm) captured: SPCC 512 GB SSD in slot 0 (PVE boot - healthy, ~3 mo runtime), three 450 GB SAS HDDs in slots 1-3 (not 146 GB as previously documented), four Crucial M4 512 GB SSDs in slots 4-7. Three of the four Crucial M4 drives are 79-86% worn (Wear_Leveling_Count attribute 173) - plan to retire on the same timeline as the SSD purchase. Full report: D:\PVE\p410i-baseline-20260604-222809.txt. Full analysis in reference_p410i_ssd_compatibility.md "Baseline inventory" section.

Implication for Path B1 purchase: buy 4+ NEW SSDs minimum; do NOT plan to keep three of the worn Crucial M4s in the new array.

PENDING - hardware purchase (Path B1 shopping list)

  • LSI 9305-16i HBA in IT mode (16 native SAS-3 lanes, no expander needed, no firmware flash needed). €130-250 used on eBay UK/DE / Marktplaats. Confirm: full-height bracket, firmware ≥16.00.10.00, IT mode pre-loaded (not IR mode), heatsink intact. Cables NOT included.
  • 2× SFF-8643 → SFF-8087 fan-out cables for HBA → backplane(s). ~€15-25 each. (Or SFF-8643 → SFF-8643 if the 2nd backplane uses HD connectors.)
  • HP 516966-B21 second 8-bay SFF backplane. €30-80 used. Confirm connector type (SFF-8087 vs SFF-8643) before buying cables.
  • HP 651687-001 2.5" SFF Smart Drive caddies. €5-15 each, one per drive bay populated.
  • Data SSDs — per the parked reference doc (Intel S35x0 enterprise pulls are the sweet spot; AVOID list documented).
  • CR2032 CMOS batteries — already on hand (Kay confirmed).
  • P410i cache module (HP 462968-B21) — confirm presence on first chassis-open via ssacli. If absent, source ~€20-40 used; without it, BBWC battery does nothing.
  • (optional, eventual) MikroTik CRS310 to replace TL-SG108E — no longer blocking anything since VLAN trunking works on the TL-SG108E.

DONE in Session 5-7 (3-4 June 2026)

  • VLAN 10 trunk cutover (host on tagged vmbr0.10 = 10.0.10.5, switch ports 2/3 as trunks saved to flash, NAS ens18 static at 10.0.10.11 via interfaces.d/10-ens18-vlan10.conf)
  • MGMT VLAN 40 presence on host (vmbr0.40 = 172.16.1.5, no gateway by design)
  • Dead 10.0.110.5 post-up removed from vmbr0
  • iLO moved 10.0.110.3 → 172.16.1.3 via in-band KCS (channel 2) + cold reset. No switch reconfig needed (iLO port 6 was already untagged VLAN 40)
  • Passwordless ssh ilo via DSA-1024 key (RSA permanently incompatible with iLO 3 v1.94 firmware - critical finding documented in reference_ilo3_v194_quirks.md)
  • pfSense block rule SERVERS → MGMT (cross-VLAN routing to 172.16.1.0/24 blocked at firewall; intra-VLAN-40 access still works by L2)
  • bridge-vids 2-409410 20 30 40 on vmbr0 (hardening; defense-in-depth alongside switch tagging)
  • NAS /etc/resolv.conf repointed from dead 10.0.70.1 → 10.0.10.1 (pfSense) + 1.1.1.1 fallback
  • NAS guest main-file iface ens18 inet dhcpinet manual (stops orphan dhclient on every ifup; long-term should also be reflected via OMV web UI)
  • NAS net1 detached + 99-ens19-recovery.conf removed (vestigial untagged path retired; only ens18 + ens19's actual NIC on the guest now)
  • /etc/pve/storage.cfg updated (Backup-NAS + fast pointed at 10.0.10.11; both disable 1)
  • Laptop ~/.ssh/config aliases updated to live IPs (pve → 172.16.1.5, ilo → 172.16.1.3 with DSA key, homenas → 10.0.10.11)
  • reference_ilo3_v194_quirks.md memory extended with DSA-only + RIBCL-TLS-handshake findings

PENDING - still legacy

  • Orphaned disk file cleanup on fast (~37 files, ~561 GB allocated; see D:\PVE\orphaned-fast-disks-20260531-212124.txt) — gated on fast NFS becoming reachable; currently disable 1 in storage.cfg.

PENDING - non-blocking, do anytime

  • CLAUDE.md continuous update via Claude Code
  • ~~iLO SSH key auth (optional, eliminates password prompts)~~ — DONE Session 7, DSA-1024 only (RSA rejected by firmware)
  • Boot order cleanup (remove Floppy)
  • ~~Fix OMV-native ens18 config (System → Network → Interfaces) so OMV salt regen never re-introduces inet dhcp — direct edit in Session 7 is transient~~ — DONE 2026-06-04 late evening: ens18 set to Method=Static, Address 10.0.10.11/24, GW 10.0.10.1, DNS via OMV web UI; OMV database now has correct config so salt regen produces the right file. DNS now via systemd-resolved (upstream 10.0.10.1 primary, 1.1.1.1 fallback) - more robust than plain resolv.conf. Discovered side effect: OMV's Method=DHCP setting was untouched by Session 7's direct file edit, so clicking Apply anywhere in OMV web UI would regenerate iface ens18 inet dhcp from database. Tonight's web UI cleanup is the proper long-term fix.
  • ~~Inspect VM 199 (mgmt-workstation)~~ — DONE 2026-05-31 late evening: destroyed clean on local-lvm, no inspection performed (15 months without need = sufficient evidence per user)
  • ~~Inspect VM 8000 (ubuntu-mini-cloud) and decide retention~~ — DONE 2026-05-31 late evening: destroyed clean on local-lvm; will be rebuilt fresh as part of Path B B.2 cloud-init learning

Phase 2: Network rebuild (ESSENTIALLY DONE 2026-06-04)

Originally framed as BLOCKED on Phase 1 cable/riser — turned out the cabling wasn't a blocker for the VLAN trunking work (cable is for moving a drop, not for the trunk that already existed). Phase 2 was executed across Sessions 5-7:

  • ✅ pfSense Netgate SG-1100 operational as perimeter firewall + L3 router
  • ✅ TL-SG108E configured with VLANs 10/40 trunking to pfSense + host (ports 2/3 tagged, ports 6/7 untagged VLAN 40, saved to flash)
  • ✅ VLAN 10 SERVERS = production homelab traffic (host 10.0.10.5 + NAS 10.0.10.11)
  • ✅ VLAN 40 MGMT = 172.16.1.0/24 (host 172.16.1.5, iLO 172.16.1.3, laptop 172.16.1.100)
  • ✅ iLO moved 10.0.110.3 → 172.16.1.3 (port 6 was already untagged VLAN 40; no switch reconfig needed)
  • ✅ pfSense firewall rules restricting MGMT VLAN access (Block SERVERS → MGMT)
  • ✅ Replaced temporary 10.0.110.x recovery IPs with permanent VLAN 10 + VLAN 40 IPs
  • ⏳ VLAN 20 IOT = isolated IoT devices (allocated, no clients yet)
  • ⏳ VLAN 30 GUEST = guest WiFi (allocated, no clients yet)

VLANs 20 and 30 are reserved in the switch + host bridge but have no consumers; activation happens when IoT / guest networks need them.

Phase 3: Service infrastructure (BLOCKED on Phase 2)

Architecture decisions locked 2026-05-31 — see CLAUDE.md "Architecture decisions locked" note and home-cloud-tracker.md. The infrastructure here serves both Path A (developer services) and Path D (family home cloud) — they share PBS, Traefik, Authelia, DNS.

Rollout sequence (do in this order):

  1. Proxmox Backup Server VM — DONE 2026-06-04/05 (Session 7). VM 200 (pbs) running PBS 4.2-1, 4 GB RAM / 2 vCPU / 32 GB OS disk on local-lvm, IP 10.0.10.20 on VLAN 10. Apple HDD (J550001MGG9RUC, 1 TB) passed through as scsi1 and configured as a Removable Datastore named apple-tank (mount path /mnt/datastore/apple-tank, UUID-bound for removable safety). Registered in PVE as storage pbs-apple-tank. First backup of VM 189 succeeded (32 GB OS disk → 9 min, 77% sparse, dedup'd). Restore drill PASSED via filesystem inspection (mounted /dev/pve/vm-190-disk-0 read-only, verified Debian 11.9 + OMV install + Session 7 network config fragment preserved + GRUB/kernel files present). VM 190 destroyed cleanly post-drill. Caveat: PBS's post-restore validation step hung mid-operation (SSH to PBS started rejecting connections; web UI stayed alive). Needs PBS-side investigation (resource limits? chunk handler tuning?) before re-attempting a from-scratch end-to-end restore. NOT blocking - backup + restore both proven functional.
  2. Internal DNS LXC built + cutover DONE 2026-06-05 (Session 9). LXC 250 dns-internal, Debian 12 (bookworm) + Unbound 1.17.1, 1 vCPU / 512 MB RAM / 8 GB disk, unprivileged, onboot=1. Static 10.0.10.53/24 on VLAN 10, gw 10.0.10.1. Listening on both 10.0.10.53:53 (external clients) and 127.0.0.1:53 (LXC self-use), TCP + UDP. Authoritative for hm.iamkay.eu. static zone with initial A + PTR records: dns-internal/arochukwu/homenas/pbs/arochukwu-ilo/traefik. Full recursive resolver for everything else with DNSSEC validation enabled (cloudflare.com response carries ad flag — end-to-end validation confirmed working). ACL allows 10.0.10.0/24 + 172.16.1.0/24, refuses everything else. Config at /etc/unbound/unbound.conf.d/hm-iamkay.conf. Used a fresh VMID (250), NOT LXC 500. Cutover completed same session: PVE host /etc/resolv.conf, NAS systemd-resolved drop-in, iLO DNS resolvers, AND pfSense Unbound Domain Override (hm.iamkay.eu10.0.10.53) all in place. Section 11 #10 (iLO DNS repoint) now DONE. With the pfSense Domain Override, every device on the network that uses pfSense for DNS resolves *.hm.iamkay.eu via LXC 250 — no per-device config needed for any future service. Pre-cutover backups: /etc/resolv.conf.backup-pre-dns-cutover-20260605 on host; 99-dns.conf.backup-pre-cutover-20260605 on NAS. Deferred: pfSense DHCP scope DNS hand-out (no DHCP clients on VLAN 10 yet); pfSense Unbound secondary-slave-zone for hm.iamkay.eu (would let pfSense answer internal queries even when LXC 250 is offline — future HA hardening).
  3. Reverse proxy + TLS — Traefik DONE 2026-06-05 (Session 9). LXC 251 traefik, Debian 12 + Traefik v3.7.4, 1 vCPU / 1024 MB RAM / 8 GB disk, unprivileged, onboot=1. Static 10.0.10.10/24 on VLAN 10. Listening on :80 (HTTP → HTTPS redirect) and :443 (TLS). Static config at /etc/traefik/traefik.yml, dynamic configs in /etc/traefik/dynamic/. ACME with Let's Encrypt production endpoint, DNS-01 challenge via Cloudflare API (token in /etc/traefik/cloudflare.token 0600 traefik:traefik). Wildcard cert acquired for *.hm.iamkay.eu + hm.iamkay.eu, valid 2026-06-05 → 2026-09-03 (Let's Encrypt R3-equivalent intermediate YR1). Auto-renews. Cert state in /etc/traefik/acme/acme.json (0600 traefik:traefik). Dashboard published at https://traefik.hm.iamkay.eu/dashboard/ behind BasicAuth middleware (admin user; password file at /root/traefik-build/traefik-dashboard.password on PVE host until Vaultwarden lands). Verified end-to-end from laptop via pfSense Domain Override → internal DNS → Traefik → LE cert (browser-trusted, no warnings) → BasicAuth prompt → dashboard. Build script and logs at /root/traefik-build/ on PVE host. Build gotcha for future reference: nested-heredoc + bash-backtick interactions across SSH layers will silently mangle YAML config files — always write configs to a local file first, scp to PVE, then pct push to the LXC. Caught two failures of this type during Session 9; final dashboard.yml was generated via Write→scp→sed→pct push pattern.
  4. Secrets vault — Vaultwarden DONE 2026-06-05 (Phase 3.4). Self-hosted Bitwarden-compatible password vault. LXC 252 vaultwarden, Debian 12 + Docker-in-LXC running vaultwarden/server:latest (Vaultwarden 1.36.0), 1 vCPU / 1024 MB / 8 GB on local-lvm, nesting=1,keyctl=1 features enabled for Docker, unprivileged, onboot=1 + systemd unit vaultwarden-compose.service. Static 10.0.10.7/24 on VLAN 10. Container listens on 10.0.10.7:8080 (HTTP); fronted via Traefik at https://vault.hm.iamkay.eu/ with LE wildcard cert (added vault.hm.iamkay.eu to LXC 250 Unbound → 10.0.10.10; Traefik dynamic config /etc/traefik/dynamic/vault.yml proxies to http://10.0.10.7:8080). SQLite backend, data at /opt/vaultwarden/data/, PBS-backed daily via the existing job. Lockdown done same session: signups_allowed=false + invitations_allowed=false (no self-registration); admin token rotated + argon2id-hashed via vaultwarden hash --preset bitwarden parameters (m=65536, t=3, p=4), hash stored in config.json admin_token field (config.json overrides docker-compose env); plain token surfaced to Kay for entry in vault; Cloudflare API token rotated (rolled the existing Edit zone DNS 4 traefik-hm token in Cloudflare web UI, new value pushed to /etc/traefik/cloudflare.token, Traefik restarted, dashboard + cert chain verified intact). 12-item credential migration completed by Kay manually via Vaultwarden web UI (folders: Homelab/Infrastructure, Homelab/Services, Homelab/API keys, Homelab/SSH keys, Personal, Family). DEFERRED until after Phase 3.5 Authelia (per Kay 2026-06-05). Reason: Authelia will generate additional critical secrets (admin password, JWT signing keys, MFA seed/backup codes) that should live on the same card. Printing once after Authelia means one card with the complete recovery set instead of two prints. Card contents to include at that point: Vaultwarden master, PVE root, iLO kay, Authelia admin + recovery codes, critical recovery IPs. Template lives at D:\PVE\break-glass-card.txt (current Vaultwarden-only version) — extend it before printing. Track until done.
  5. Authentication — Authelia SSO+MFA DONE 2026-06-05 (Phase 3.5). LXC 254 authelia, Debian 12 + Docker-in-LXC running authelia/authelia:latest 4.39.20 + redis:alpine sidecar (session storage). 1 vCPU / 1024 MB / 8 GB on local-lvm, nesting=1,keyctl=1 features for Docker, unprivileged, onboot=1 + systemd unit authelia-compose.service. Static 10.0.10.9/24 on VLAN 10. Authelia listens on 10.0.10.9:9091; fronted via Traefik at https://auth.hm.iamkay.eu/ with LE wildcard cert. SQLite storage (/opt/authelia/config/db.sqlite3); file-based user database (users_database.yml — single admin user kay with [email protected] + group admins; password argon2id-hashed with bitwarden preset m=65536/t=3/p=4); filesystem notifier (/opt/authelia/config/notification.txt — SMTP deferred until Path D family onboarding). PBS-backed daily. Traefik ForwardAuth integration: new /etc/traefik/dynamic/auth.yml defines the authelia middleware (http://10.0.10.9:9091/api/verify?rd=https%3A%2F%2Fauth.hm.iamkay.eu%2F) + the auth backend service; dashboard.yml replaces BasicAuth with the authelia@file middleware; vault.yml adds a higher-priority router for vault.hm.iamkay.eu/admin* through Authelia (user vault stays direct since Bitwarden mobile/desktop clients don't tolerate the redirect flow). Access control: default policy deny; auth.hm.iamkay.eu bypass; traefik.hm.iamkay.eu + vault.hm.iamkay.eu/^/admin.*$ require two_factor for group:admins. Session: 5 min inactive / 1h max / 1 month remember-me on cookie domain hm.iamkay.eu (SSO across all subdomains). Regulation: 3 failed attempts in 2 min → 5 min lockout (production-tight; may loosen to 10/5 min after enrollment-flow UX wrinkles). TOTP MFA: kay enrolled "Oak Techx Homelab" issuer (24:32:00 CEST), set as default method. Verified end-to-end: browser → traefik.hm.iamkay.eu/dashboard/ → 302 → auth.hm.iamkay.eu/?rd=... → login (password) → TOTP → dashboard loads with session cookie active across all subdomains ✅. Build gotchas surfaced: (a) Authelia 4.39 has multiple config deprecation warnings (auto-mapped jwt_secret, session domain, server.host/port, etc.) — works fine but config hygiene cleanup deferred; (b) tight regulation (3/2min) caused user lockout during enrollment-dialog-close UX wrinkle — cleared via redis-cli FLUSHALL + DELETE FROM authentication_logs WHERE username='kay' in SQLite; (c) Traefik router priority is auto-computed from rule length — explicit priority: 10 made the longer-rule router LOWER priority than the default Host-only router; fix is to remove the explicit priority and let Traefik auto-compute (longer-rule routers win automatically). Initial user credentials in Vaultwarden under "Authelia - kay user". Filesystem notifier verification codes retrievable via pct exec 254 -- tail /opt/authelia/config/notification.txt.
  6. Tailscale for remote/mobile access — DONE 2026-06-06 (~30 min, Session 9 extended). Path D.4 item (see home-cloud-tracker.md). LXC 255 tailscale-router, Debian 12 + Tailscale 1.98.4 + IP forwarding enabled + TUN device mounted (added lxc.cgroup2.devices.allow: c 10:200 rwm + lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file to /etc/pve/lxc/255.conf before first start). Static 10.0.10.12/24 on VLAN 10. Unprivileged, onboot=1. Configured as subnet router advertising 10.0.10.0/24 + 172.16.1.0/24 to the tailnet (captkay.github account, via GitHub login). Tailnet IP: 100.88.161.7. Tailnet IPv6: fd7a:115c:a1e0::ac35:a108. Subnet routes approved in Tailscale admin console. Split DNS configured: hm.iamkay.eu lookups from any tailnet device → forwarded to 10.0.10.53 (our DNS LXC 250). Kay's existing tailnet devices (iPhone72 + "md" laptop, both on tailnet from prior projects) auto-joined. Verified end-to-end: iPhone Safari → https://vault.hm.iamkay.eu/ loads with LE wildcard cert + Vaultwarden login → Kay's full vault renders with all 12 migrated items visible. Goal achieved: Vault + Authelia + every *.hm.iamkay.eu service reachable from phone/laptop anywhere on internet via Tailscale tunnel + LE-trusted TLS. WireGuard remains the purist alternative if Tailscale's coordination dependency ever becomes a concern (would migrate to Headscale self-hosted control plane). Bitwarden mobile app + laptop Tailscale install left as quick follow-ups (Kay already has both devices on the tailnet).
  7. Monitoring stack + Homepage launcher — DONE 2026-06-06 (Phase 3.6). LXC 256 monitoring, Debian 12 + Docker-in-LXC with 7 containers via docker-compose: Prometheus, Grafana, Loki, Promtail, cAdvisor, Uptime Kuma, Homepage. Static 10.0.10.14/24 on VLAN 10. 2 vCPU / 4 GB / 16 GB local-lvm. Features nesting=1,keyctl=1. Onboot=1 + systemd unit monitoring-compose.service. Components: Prometheus (30d retention, scrapes itself + Traefik + PVE host node_exporter at 10.0.10.5:9100 + cadvisor); Grafana (provisioned datasources for Prometheus + Loki; dashboard provider points at /var/lib/grafana/dashboards; Node Exporter Full dashboard #1860 imported with live PVE metrics streaming); Loki (filesystem storage, v13 schema, allow_structured_metadata, volume_enabled); Promtail (scrapes Docker logs + /var/log syslog/messages/daemon); Uptime Kuma (admin account created, signups auto-locked); cAdvisor (per-container metrics); Homepage launcher at home.hm.iamkay.eu with 4-section layout (Foundation/Admin/Observability/Coming Soon), siteMonitor probes with statusStyle: dot (10 green, 1 grey on iLO — known browser mixed-content blocking since iLO probe is HTTP, will be tracked via Uptime Kuma TCP probe instead). Behind Traefik + Authelia SSO+MFA: grafana.hm.iamkay.eu, kuma.hm.iamkay.eu, home.hm.iamkay.eu all require two_factor for group:admins per Authelia rules. Prometheus + Loki internal only (no external route, accessed via Grafana datasources). node_exporter installed on PVE host (prometheus-node-exporter Debian package, listens on 10.0.10.5:9100/metrics) so Prometheus can scrape host-level metrics. Build gotcha: Prometheus :9090 not exposed externally by design (only reachable inside Docker network); my Phase 7 verification curl on the host network returned HTTP 000 which triggered set -e to abort the script — resumed remaining phases manually. Credentials in Vaultwarden under "LXC 256 - monitoring (root)" + "Grafana - admin" + "Uptime Kuma - admin".
  8. DNS UX upgrade — Pi-hole DONE 2026-06-06 (Phase 3.6.5). Replaces LXC 250 (Unbound-only, now stopped via pct stop 250 for 1-week rollback window before destruction). LXC 253 pi-hole, Debian 12 + Docker-in-LXC running pihole/pihole:latest v6.6.2. Static 10.0.10.54/24 on VLAN 10. Bumped to 2 vCPU / 512 MB / 8 GB after initial 1 vCPU got overwhelmed by apt update DNS retry storms during the Session 10 marathon. Unbound sidecar droppedmvance/unbound:latest in unprivileged LXC restart-looped trying to cp /dev/random (needs CAP_MKNOD which unprivileged LXC blocks). Worked around by using Pi-hole's direct upstream forwarding to 1.1.1.1 + 9.9.9.9 with DNSSEC validation enabled in Pi-hole settings — slightly weaker than full self-recursive but operationally simpler and the DNSSEC checks still happen. Add Unbound back later as either a privileged-LXC sidecar OR via a different upstream pattern. DNS records persisted via pihole.toml hosts = [...] array (toml-managed, not custom.list which Pi-hole v6 auto-wipes from template on restart). 24 internal records migrated (every *.hm.iamkay.eu + reverse PTRs for the major static IPs including pfSense 172.16.1.1). Admin UI at https://dns.hm.iamkay.eu/admin/ fronted by Traefik with LE wildcard, behind Authelia (cookie-domain SSO so it loads transparently when already authed). pfSense Domain Override repointed: hm.iamkay.eu10.0.10.54. Pi-hole v6 specific notes: (a) DNSMASQ_LISTENING env var deprecated → use FTLCONF_dns_listeningMode=ALL; (b) pihole reloaddns is broken in 6.6.2 (local: FTL_PID_FILE: readonly variable) → restart container instead. Vaultwarden entry "Pi-hole - kay admin". Cross-references: learning-tracker.md B.3 (Networking).

  9. GitLab CE + Runner DONE 2026-06-06 (Phase 3.7, Path A foundation). LXC 257 gitlab, Debian 12 + Docker-in-LXC running gitlab/gitlab-ce:latest + gitlab/gitlab-runner:latest containers via docker-compose. Static 10.0.10.50/24 on VLAN 10. Resources: 4 vCPU / 8 GB / 50 GB local-lvm. Features nesting=1,keyctl=1. External URL https://gitlab.hm.iamkay.eu/ via Traefik (LE wildcard, NOT behind Authelia per git/sync-client compatibility). External SSH port 2222 (container's :22 mapped to host 2222 to avoid conflict with LXC sshd). Runner registered with docker-executor, Docker Hub egress confirmed. Pipeline #1 PASSED end-to-end in 13s proving full chain: Docker-in-LXC + Traefik + LE + Pi-hole DNS + GitLab 2FA + Personal Access Token auth + Runner docker-executor + Docker Hub egress. Build script at /root/gitlab-build/build.sh on PVE host. Build gotchas captured: (a) gitlab/gitlab-ce takes 3-5 min to initialize past (health: starting); (b) Runner couldn't verify against https://gitlab.hm.iamkay.eu from inside Docker network because Docker bridge DNS resolves to internal-IP backend which only serves :80 not :443 — fix is to register Runner with --url http://gitlab.hm.iamkay.eu (HTTP for internal-bridge resolution). Cross-references: project-tracker.md Path A.

  10. Nextcloud DONE 2026-06-07 (Phase 3.8, Path D.1 foundation). Full detail in home-cloud-tracker.md D.1. Summary: LXC 258 + Docker-in-LXC (nextcloud:fpm-alpine + nginx-alpine + postgres:16-alpine + redis:alpine + cron). Static 10.0.10.60/24. Internal https://cloud.hm.iamkay.eu/ (LE wildcard) + external https://cloud.iamkay.eu/ (via Cloudflare Tunnel, separate LE cert). Data architecture: LOCAL data dir on LXC rootfs (50 GB), homenas NFS s-tank exported at /mnt/pve/nextcloud-data bind-mounted to /mnt/nfs in container, wired through Nextcloud's External Storage feature as "Family Storage" mount. NOT behind Authelia (sync clients).

  11. Cloudflare Tunnel DONE 2026-06-07 (Phase 3.9, Path D.3 foundation). Full detail in home-cloud-tracker.md D.3. LXC 259 cloudflared Docker-in-LXC. Tunnel homelab (ID 006ccc6c-c0c9-4811-9244-d6df77b8c966) routes cloud.iamkay.eu → Traefik internal. Cloudflare DNS CNAME cloud.iamkay.eu<tunnel-id>.cfargotunnel.com proxied. Reusable for future external services (Jellyfin when D.2 lands → just add another Public Hostname route on the same tunnel).

After items 1-5 are done, Path A (GitLab CE — see project-tracker.md) and Path D (Nextcloud — see home-cloud-tracker.md) can start in parallel. Status: both started Session 10 (2026-06-06/07) — GitLab + Nextcloud both deployed.

Off-site cold backup target (Cloudflare R2 vs Backblaze B2) — decision deferred until before D.1 ships any family data.

Phase 4: Decommissioning obsolete pieces

Entity destruction (Proxmox-side): DONE 2026-05-31 late evening. All 22 obsolete entities destroyed. Only VM 189 remains. See reference_vm_inventory.md for the pre-cleanup snapshot, D:\PVE\pre-destruction-inventory-20260531-205954.txt for the per-entity destruction-time inventory, and SESSION-LOG.md 2026-05-31 late evening for the forensic walk.

What's left to do in Phase 4

Phase 2 follow-up: orphaned disk file cleanup on fast NFS. 37 disk files (~561 GB allocated, actual usage smaller due to qcow2 sparse + thin provisioning) remain physically on the fast NFS share because fast was offline during destroy and qm destroy couldn't reach to remove them. They're orphaned: no Proxmox config references them anymore, but the bytes are still on the NAS.

Trigger: once Phase 2 network rebuild brings fast NFS back online (VLAN 70 routing restored, homeNas exporting again).

Per-file inventory: D:\PVE\orphaned-fast-disks-20260531-212124.txt (also at /root/orphaned-fast-disks-20260531-212124.txt on host). Lists each orphan by fast:<vmid>/<filename> + allocated size + source entity.

Cleanup procedure:

  1. Confirm pvesm status shows fast active.
  2. Cross-check the orphan tracker against what's actually on disk:
    pvesm list fast | awk '{print $1}' | sort > /tmp/fast-actual.txt
    
    Then compare with the orphan tracker's fast:<id>/<file> entries.
  3. Remove orphan files (one option):
    for f in $(awk '/^fast:/ {print $1}' /root/orphaned-fast-disks-20260531-212124.txt); do
      pvesm free "$f" || echo "FAILED: $f"
    done
    
    pvesm free is the Proxmox-aware delete (cleans up the storage layer's tracking too). Direct rm on the NFS path also works but skips Proxmox's bookkeeping.
  4. Verify nothing left:
    pvesm list fast | grep -v "^Volid" | wc -l   # should be 0 if no other content
    
  5. Decide on the fast storage entry itself:
  6. Keep if homeNas will continue exporting it for some future use
  7. Remove (pvesm remove fast) if no longer needed

Backup-NAS storage entry — no orphans, decide post-cleanup

Backup-NAS (NFS) has no VM/CT disks referencing it (verified pre-cleanup). When the network rebuild restores it, decide whether to keep the storage definition for future PBS off-site use or remove it (pvesm remove Backup-NAS).

Phase 5: Hardware monitoring (informational, no blocker)

  • Watch Temp 30 sensor (currently 69 C, alert if >85 C)
  • Watch both PSUs for recurrence of cold-start failures
  • Plan CMOS battery replacement (next maintenance window)
  • Plan BIOS update via SPP ISO (after sandboxed download succeeds)
  • Plan Crucial M4 SSD SMART check (Bay 5-8, ~14 years old) — folded into the comprehensive storage review below

Phase 5.5: Comprehensive storage review (multi-session, planned)

User-requested deep review of storage architecture. Recognised as a multi-session effort (not a single investigation). Scope includes:

  • Crucial M4 SSDs (Bay 5-8) age + wear assessment. 2011-era consumer drives; need SMART data, wear-level %, pending sectors, reallocation events. Plan replacement triggers.
  • Layered storage architecture audit. How local, local-lvm, fast, Backup-NAS, and the passthrough virtio disks for VM 189 actually map onto physical bays, RAID volumes, and NFS exports. Document with a clear topology diagram (current + target).
  • Backup posture. Today there are effectively no backups — VM 189 holds primary data, nothing copies it elsewhere. Build the PBS plan (homelab-tracker Phase 3 item 3.1) within this review.
  • Family-data risk surface. Once Path D goes live, data loss = lost family photos. Reconcile current single-point-of-failure storage with the family-cloud responsibility.

Draft prompt parked at: D:\PVE\planned-storage-review-prompt.md — use as the starting point for a future session when Kay is ready to undertake the review. Multi-session work; don't try to do it in one pass.

Phase 6: HA Proxmox cluster expansion + OPNsense (plan committed 2026-06-05)

Two HP EliteDesk 800 G5 SFF units arrived 2026-06-05. Two more to be purchased ~September 2026 (3 months from 2026-06-05, per Kay's 2026-06-05 statement "will buy more in 3 months time"): one as the 3rd cluster node (quorum tiebreaker), one dedicated to OPNsense. This plan supersedes the earlier "OPNsense + observability" framing for these units. Kay's intent (verbatim 2026-06-05): "despite the fact that the g7 can do all things I want to have other units too in the HA for now."

Plan for the 2 existing G5s (June-September 2026) — COMMITTED 2026-06-05 (after one bad-framing iteration). Kay's intent: "i don't want them to wait. i spent money to get them. they must not be in HA to use them for now" AND "we have not installed proxmox on the G5 units. lets do everything on G7 and migrate things appropriately after the foundation tasks and the G5s has been setup."

Storage spec for G5 units — COMMITTED 2026-06-07: Kay plans to buy 2× 2 TB NVMe drives (one per G5) ahead of the migration phase. Trigger context: G7's local-lvm thin pool became overcommitted during Session 10 marathon (382 GB allocated vs 475 GB VG free 16 GB physically) once Nextcloud rootfs bumped 50 → 200 GB to accommodate Kay's ~350 GB existing cloud-storage corpus. Kay's verbatim 2026-06-07: "we will have to move them to separate storages soon. i will buy 2tb nvme each for the G5 unites and we move them to each of them." Migration target: when both G5s land Proxmox + NVMe, the heaviest LXCs (258 Nextcloud, 257 GitLab) move OFF G7's local-lvm onto the per-node NVMe. G7 local-lvm then frees up significantly for what remains there (DNS, Traefik, Authelia, Vaultwarden, monitoring). Spec to verify before purchase: M.2 2280 NVMe Gen3 x4 fits the G5 SFF's onboard M.2 slot; G5 800 SFF docs confirm 1 onboard M.2 slot. Brand preference deferred; budget-friendly enterprise-class TLC (Samsung 970 EVO Plus, WD Black SN770, Crucial P3 Plus 2 TB) acceptable since these stay on a 24/7 homelab not a write-heavy DB workload.

Three parallel tracks:

  1. G7 keeps carrying everything for now. All current services (LXC 250 DNS, 251 Traefik, 252 Vaultwarden, 254 Authelia, VM 189 homeNas, VM 200 PBS) stay put. All NEW foundation work in the queue (Tailscale, Monitoring stack, Pi-hole, GitLab CE for Path A, Nextcloud for Path D) also lands on G7 during the foundation phase. G7 has the headroom; this is the lowest-friction path.

  2. G5 #1 + G5 #2: install Proxmox as a setup task (independent of foundation work). DEFERRED 2026-06-07 — waiting on 2 TB NVMe drives before install (Kay's 2026-06-07 call: "lets keep the G5 installations for now. i want to get the extra nvme cards first"). Originally scheduled 2026-06-07 (Sunday) per Kay's 2026-06-05 commit; reason for deferral: avoid installing PVE on stock disk and migrating to NVMe later. Install with the right storage from day 1. Trigger to resume: 2x 2 TB NVMe in hand (see Phase 6 "Storage spec for G5 units" note). Fresh PVE 8.x install on each (use latest version available at install time, NOT 8.1.5), no cluster yet, no workloads. Network config: VLAN 10 (same as G7), static IPs to be assigned. Goal: have both G5s booted, on the network, reachable via web UI, ready to receive migrated workloads when foundation work completes. This is a 2-3 hour setup task per unit (USB install + IP config + storage layout decision + PBS register).

  3. Migration phase (later, after foundation done AND G5s PVE-installed). Once foundation tasks are complete AND both G5s have Proxmox + are PBS-registered, evaluate which services move where:

  4. Likely candidates for G5 #1: monitoring stack (Prometheus/Grafana/Loki/Uptime Kuma), DNS (Pi-hole + Unbound sidecar), Authelia — non-storage-dependent services that benefit from being decoupled from the storage host.
  5. Likely candidates for G5 #2: GitLab CE + Runners, Nextcloud (compute side), Jellyfin transcode workers — application services.
  6. Stays on G7: VM 189 homeNas (the storage), VM 200 PBS (the backup target — too disruptive to move), the ZFS data pool when it lands.
  7. Migration mechanism: PBS-mediated. Backup on G7, restore on G5, switch DNS/Traefik to point at new IPs, verify, then remove from G7.

When G5 #3 arrives (September 2026):

  1. Backup all VMs/LXCs on G5 #1 + G5 #2 via PBS.
  2. Cluster-build with all 3 G5s.
  3. Provision cluster shared storage (NFS off G7's ZFS pool, assuming HBA + ZFS landed by then).
  4. Restore VMs from PBS onto cluster shared storage.
  5. Enable HA groups for production-critical workloads.
  6. Decommission the standalone-node configs.

PBS-mediated migration is a known clean pattern. Expect 1 day of cluster work + workload-migration window. No data loss; PBS backups are the ground truth.

Topology (committed 2026-06-05)

Unit Status Role Workloads
G5 #1 On hand Node 1 - Infrastructure GitLab CE + CI/CD runners + DNS + monitoring
G5 #2 On hand Node 2 - Applications Assembyl + CommerceBridge staging + PostgreSQL + Redis + RabbitMQ
G5 #3 To buy ~September 2026 Node 3 - Quorum + workload reserve Corosync quorum tiebreaker; spillover capacity for either role; runs lighter workloads (Uptime Kuma, log shipping, secondary monitoring)
G5 #4 To buy ~September 2026 OPNsense firewall Replaces pfSense SG-1100 when ready; gains multi-gig throughput, more flexible VPN options. NOT part of the Proxmox cluster.
G7 arochukwu On hand (production) Standalone storage + workload host VM 189 homeNas, VM 200 pbs, ZFS data pool (post-HBA). Stays OUTSIDE the cluster — keeps production storage decoupled from HA failover decisions. Provides shared storage to the cluster via NFS/iSCSI.

Hardware as ordered (per unit)

  • Intel i5-9500 (6c/6t, Coffee Lake Refresh, 65W TDP, LGA1151)
  • 16 GB DDR4 UDIMM (likely 1× 16GB; check for second free slot on receipt)
  • 256 GB SSD (boot drive — fine for PVE root + a couple of VMs initially)
  • LGA1151 socket → CPU upgrade path is open

Planned upgrades (when budget allows)

  • 64 GB DDR4 desktop UDIMM (288-pin, NOT SODIMM — EliteDesk 800 G5 SFF takes full-size DIMMs). Note: 64 GB single-DIMM doesn't exist in DDR4 UDIMM consumer grade; this means 64 GB total — most likely 2× 32GB or 4× 16GB. Confirm board's max-per-slot at upgrade time (the SFF chassis has 2 or 4 DIMM slots depending on motherboard rev).
  • 2 TB NVMe M.2 for fast local storage (PVE root or VM disks). EliteDesk 800 G5 SFF has one M.2 NVMe slot under the 2.5" drive cage.
  • CPU upgrade to i9-9900 (8c/16t, 65W TDP — staying in the 65W envelope keeps SFF cooling happy). i9-9900K (95W) NOT recommended in the SFF chassis without thermal validation.

Location

Both units live in the kast (cupboard) upstairs alongside the TL-SG108E v1 switch — same rack zone as the G7's network endpoints. Short cable runs to switch ports.

Decisions committed 2026-06-05

  1. Quorum strategy — DECIDED 2026-06-05. Three matching G5s in the cluster (G5 #1 Infrastructure + G5 #2 Applications + G5 #3 Quorum/reserve). G7 stays standalone, decoupled from HA failover decisions. Cleanest possible split: production storage host (G7) is independent of the cluster's quorum and failover. Cluster can be rebooted, re-clustered, or rebuilt without touching VM 189 / VM 200. Cost: one extra G5 (~€330) vs the cheaper "G7-as-node-3" option, but worth it for the blast-radius decoupling.

Open architecture questions (need decisions before deploy)

  1. Shared storage for live migration. HA failover needs VM disks reachable from any surviving node. Three options:
  2. NFS off the G7 (host an export from PVE root or the future ZFS pool). Simplest; introduces G7 as SPOF for the cluster's shared storage.
  3. iSCSI off the G7 (LVM-thin or ZFS-block targets). More performant than NFS for VM disk I/O; more setup complexity.
  4. Ceph across all three nodes — Proxmox-native, eliminates G7-as-SPOF, but needs at least 1 dedicated SSD per node and ~10 GbE for sane performance. Probably too ambitious for the 256 GB SSD initial config.
  5. Default lean: NFS off the G7's ZFS pool, once the pool exists (post-HBA chassis-open). Revisit Ceph after both G5s have the NVMe upgrade.

  6. Cluster network. Corosync wants low-latency, low-jitter — ideally a dedicated NIC/VLAN, not shared with VM traffic. Options:

  7. Single NIC per G5, all traffic on VLAN 10 (with corosync sharing the bus). Works for ≤3 nodes, but a sustained backup or migration can starve corosync and cause spurious node fences.
  8. Dedicated VLAN for corosync over the same physical NIC (acceptable for 3-node).
  9. Add USB-Ethernet adapters or PCIe x1 NICs to the G5s for a dedicated cluster ring. Cheap (~€10-30 per adapter) and meaningfully more reliable.
  10. Default lean: dedicated VLAN (e.g., VLAN 50 = CLUSTER) on the existing NIC, set as corosync's link0. Upgrade to physical-NIC separation only if we see issues.

  11. Boot ordering. Node 1 hosts DNS — Node 2 services depend on DNS resolving to function. Need either: (a) bring Node 1 up first manually, (b) make Node 2 fall back to public DNS at boot, or (c) host a secondary DNS resolver elsewhere (G7? pfSense Unbound?). Lowest-risk default: pfSense Unbound is the primary external resolver during cluster bring-up; the Node 1 internal DNS LXC serves only *.hm.iamkay.eu and Node 2 has pfSense as DNS fallback.

  12. Networking to the kast. All three cluster G5s need cable runs to TL-SG108E ports. Switch currently has ports 1 (pfSense uplink), 2-3 (G7 bond0 slaves), 6 (iLO), 7 (laptop drop). Free: ports 4, 5, 8 (three ports — exactly fits three G5s, but no headroom for the OPNsense G5 #4 or any future expansion). Will run out of switch ports when G5 #4 (OPNsense) arrives — TL-SG108E upgrade or replacement (e.g., MikroTik CRS310 per Section 8 of CLAUDE.md) becomes blocking at that point. VLAN trunking config per cluster-node G5 port = same as G7 ports 2/3 (VLAN 10 + VLAN 40 tagged + PVID 1 native) plus VLAN 50 if we add it for corosync.

  13. pfSense → OPNsense migration (new — added 2026-06-05). When G5 #4 ships in July, all current pfSense config moves to OPNsense: VLAN interfaces (10/40/50), firewall rules including the recent VLAN 40 lockdown, Kea-DHCP scopes, Unbound DNS resolver, gateway IPs. Two failure modes worth planning for: (a) cutover window where all VLAN routing is offline; (b) config-translation mistakes (pfSense and OPNsense share ancestry but rules/scopes are not 1:1 portable). Mitigation: stage OPNsense in parallel on the new G5, validate rule-by-rule with both firewalls reachable via VLAN 40 direct L2, then switch the WAN cable + LAN trunk over in a planned window. Keep pfSense SG-1100 powered-off-but-cabled as a one-cable rollback for the first week.

Phase 6 sequencing (provisional)

  1. Inventory + boot test each G5 standalone before cluster join. Confirm: BIOS rev, current OS state, NIC MAC, SSD health, RAM slot population, M.2 slot present, idle power draw at the wall.
  2. Install Proxmox VE 8.x on each G5 (USB-key install, default options). Set static IPs on VLAN 10. Match the G7's PVE major version.
  3. Resolve quorum decision (question 1 above) before joining anything to a cluster — easier to plan than to undo.
  4. Build the cluster (pvecm create on whichever node is the founder, then pvecm add on the others). Verify quorum + corosync logs clean.
  5. Provision shared storage per question 2 above. Add as a storage entry on all cluster nodes.
  6. Migrate first non-critical VM (probably a fresh test VM, NOT VM 189 or VM 200) between nodes manually to validate the storage + cluster plumbing. Only after that works, configure HA groups.
  7. Place actual workloads per the topology table. GitLab and Assembyl/CommerceBridge are separate deployment projects in their own right — they have own readiness gates (Path A in project-tracker.md).

Open dependencies

  • HBA + ZFS pool (Phase 2 / Section 8 in CLAUDE.md) — needs to land BEFORE NFS-off-G7 storage option becomes available
  • Internal DNS LXC (foundation Phase 3.2) — Node 1 will eventually host this, but the LXC itself can be built on G7 first and migrated later (DONE Session 9 on G7 as LXC 250; migrate to G5 cluster when stable)
  • Network plan for VLAN 50 (if we go that route for corosync)

Dual-use: k8s learning platform (Path B alignment)

The same 3-node G5 cluster also serves as the k8s learning testbed per learning-tracker.md B.5 (Containers and orchestration). Two parallel stacks on the same physical hardware:

  • Production = Proxmox HA + LXCs/VMs (what runs GitLab, Nextcloud, Authelia, etc.)
  • Learning = k3s cluster spanning the same 3 nodes, running non-load-bearing apps for Path B hands-on

Coexistence pattern: k3s installed on a separate VM on each node (so production LXCs and the k3s control plane don't share kernel) OR as a separate physical partition of the cluster's resources. Decide closer to deploy time. The key constraint: nothing on the k3s side can affect production VM availability.

Cross-reference: CLAUDE.md §9 "Future HA k8s cluster" placeholder, learning-tracker.md B.5.

VM/LXC naming convention (committed 2026-06-05)

Every VM/LXC gets a descriptive name. Never leave the default "VM ".

  • Production: role-only (homeNas, pbs-backup-server, gitlab, nextcloud, traefik, authelia, dns-internal)
  • Test / restore-drill / scratch: role + suffix (homenas-restore-test, pbs-restore-test)
  • Templates: prefix tpl- (tpl-debian12-cloudinit, tpl-ubuntu22-cloudinit)
  • Throwaway / sandbox: suffix -scratch (gitlab-scratch, nextcloud-scratch)

Set via qm set <vmid> --name <name> or at create-time. Quick qm list is then self-documenting.

Known issues / gotchas to remember

  • iLO 3 v1.94 quirks documented in reference_ilo3_v194_quirks.md
  • Proxmox is tightly bound to hostname "arochukwu" - do not rename
  • VM 189 homeNas + storage arrays are NEVER TOUCH without explicit confirmation
  • Do NOT enable FIPS mode in iLO (would wipe all settings, license, users)
  • pfSense SG-1100 caps at ~1Gbps throughput
  • TL-SG108E supports static LAG only, not dynamic LACP