CLAUDE.md - Homelab Server Handover (Oak Techx)¶

This file gives Claude Code the context and rules to work on Kay's HP DL380 G7 Proxmox server. Read it fully before running ANY command on the server. Last updated: 7 June 2026 (Session 10 marathon: Phase 3.6 Pi-hole + Phase 3.7 GitLab + Phase 3.8 Nextcloud + Phase 3.9 cloudflared tunnel DONE in one sitting; OMV shared-folder cleanup also done. NEW: Phase 3.8 Nextcloud DONE 2026-06-07 - LXC 258 Docker-in-LXC (nextcloud:fpm-alpine + nginx-alpine + postgres:16-alpine + redis:alpine + cron) at https://cloud.hm.iamkay.eu/ (internal, LE wildcard) AND https://cloud.iamkay.eu/ (external via Cloudflare Tunnel, separate LE cert auto-issued for cloud.iamkay.eu). Static 10.0.10.60/24. Storage architecture (Kay's no-shortcuts call): Nextcloud's /data dir LOCAL on LXC rootfs (for app init/thumbnails/metadata), homenas NFS-exported nextcloud-data (on s-tank, 1.4 TB free) bind-mounted into LXC at /mnt/nfs and wired through Nextcloud's built-in External Storage feature as "Family Storage" mount. Reason for not using NFS as /var/www/html/data directly: nextcloud:fpm-alpine entrypoint does rsync --chown www-data:root which conflicts with NFS+unprivileged-LXC permission model regardless of all_squash/anonuid setup. External Storage is the upstream-blessed pattern for NAS backends. NOT behind Authelia (sync clients break with ForwardAuth redirects — Kay verified internal+external work end-to-end). NEW: Phase 3.9 Cloudflare Tunnel DONE 2026-06-07 - LXC 259 cloudflared Docker-in-LXC running cloudflare/cloudflared:latest, static 10.0.10.70/24, token-mode tunnel homelab (ID 006ccc6c-c0c9-4811-9244-d6df77b8c966). External cloud.iamkay.eu routes through tunnel to Traefik at https://10.0.10.10:443 (route uses hostname not IP, plus extra_hosts: ["cloud.iamkay.eu:10.0.10.10"] in cloudflared docker-compose so cert validation passes — the new CF dashboard UI doesn't expose Origin Server Name / No TLS Verify, so we work around it at the resolver layer). Trial-by-fire findings: (a) Traefik's web entrypoint has hard-coded HTTP→HTTPS redirect that fired on cloudflared's initial HTTP backend attempts → had to add LE cert for cloud.iamkay.eu and move CF tunnel to HTTPS; (b) LXC 251 Traefik's /etc/resolv.conf still pointed at dead 10.0.10.53 (stopped LXC 250 Unbound) → ACME calls couldn't resolve LE API → fixed via pct set 251 -nameserver 10.0.10.54 (Pi-hole); (c) Pi-hole was undersized at 1 vCPU and got overwhelmed by apt update DNS retry storms during the build → bumped to 2 vCPU, immediately stable. NEW: OMV shared-folder cleanup 2026-06-07 - 333 GB reclaimed on r-tank: deleted obsolete shared folders (p-backup 333 GB stale vzdumps from destroyed VMs; ext-store 0 B; k8s-data 0 B; k8sdata 0 B; mac-store 1.7 GB TrueNAS ISO + .DS_Store cruft per Kay's call). Pre-cleanup OMV config.xml backed up to /root/config.xml.backup-pre-cleanup-20260606-230623 on homenas. Window-store (65 GB; contains wedding-2018-12-22 .mov recovery files in r-tank/windows/Recovered/) preserved untouched. New nextcloud-data shared folder created on s-tank with NFS export to 10.0.10.5/32 only (RW, subtree_check,insecure,all_squash,anonuid=100033,anongid=100033). 5 obsolete empty stub dirs at the tank-root level deferred (need per-target confirmation per §0 #2). NEW: Phase 3.7 GitLab CE DONE 2026-06-06 - LXC 257 + Docker-in-LXC gitlab/gitlab-ce:latest + gitlab/gitlab-runner:latest at https://gitlab.hm.iamkay.eu/. Pipeline #1 PASSED end-to-end (13s) proving Docker-in-LXC + Traefik + LE + Pi-hole DNS + 2FA + PAT + Runner docker-executor + Docker Hub egress all working. NOT behind Authelia (git+SSH/HTTPS sync clients). NEW: Phase 3.6.5 Pi-hole replacement DONE 2026-06-06 - LXC 253 (replaces LXC 250 Unbound, now stopped via pct stop 250 for 1-week rollback window). DNS records in pihole.toml hosts = [...] array (toml-managed, persistent across container restart). pfSense Domain Override repointed to LXC 253. Pi-hole admin behind Authelia (cookie-domain SSO so it loads silently when already authenticated elsewhere). Original Session 9: opportunistic chassis-open completed - riser + CR2032 CMOS battery + P410i BBWC battery all installed. BBWC firmware-rejected: genuine HP 462976-001 / 460499-001 Ni-MH 4.8V battery (Mfg 03/26, VARTA Germany) reports Status=OK, but P410i firmware 3.66 marks cache as "Permanently Disabled" with reason "wrong backup power source is attached to the cache module" - 3.66 doesn't recognize the newer-rev battery signature. Fix path: P410i firmware update to 6.64 via SPP 2017.10.1 (Section 11 #19, now load-bearing not optional). Operating in write-through mode until then; safe but slower. ssacli 6.15-11.0 successfully sideloaded from HPE MCP debian/dists/bookworm/current/non-free/binary-amd64/Packages.gz route - Section 11 #15 toolchain unblocked. Daily PBS backup job created (02:00, all VMs except 200, GFS retention). VM 190 orphan LV reclaimed. Storage decision: Path B1 (HYBRID) - committed. P410i KEEPS the boot role; SAS HBA (LSI 9305-16i target) ADDED for new ZFS data pool. Detail in reference_p410i_ssd_compatibility.md. NEW: HA Proxmox cluster plan COMMITTED 2026-06-05 - 2x EliteDesk 800 G5 SFF on hand + 2 more to buy ~September 2026 (3 months from now per Kay's 2026-06-05 statement: G5 #3 = cluster quorum/reserve, G5 #4 = OPNsense to replace pfSense). G7 stays STANDALONE (decoupled from cluster's HA failover). Architecture in homelab-tracker.md Phase 6. NEW: Phase 3.3 Traefik DONE 2026-06-05 - LXC 251 + Traefik v3.7.4 + LE production wildcard cert *.hm.iamkay.eu (valid through Sept 3 2026, auto-renews via Cloudflare DNS-01) + dashboard at https://traefik.hm.iamkay.eu/dashboard/ with BasicAuth. pfSense Unbound Domain Override sends all hm.iamkay.eu queries to LXC 250. NEW: Phase 3.6 Monitoring stack + Homepage launcher DONE 2026-06-06 - LXC 256 Docker-in-LXC with 7 containers (Prometheus + Grafana + Loki + Promtail + cAdvisor + Uptime Kuma + Homepage). Grafana dashboards via Prometheus (PVE host node_exporter) + Loki (container logs). Homepage launcher at home.hm.iamkay.eu with 4-section tile layout + colored status dots. All behind Authelia SSO+MFA. NEW: Phase 3.5 Authelia SSO+MFA DONE 2026-06-05 - LXC 254 Docker-in-LXC (authelia 4.39.20 + redis sidecar) at https://auth.hm.iamkay.eu/, file-based user kay with argon2id password, TOTP "Oak Techx Homelab" enrolled, ForwardAuth middleware on Traefik replaces BasicAuth on dashboard + protects vault /admin*, cookie domain hm.iamkay.eu gives SSO across all subdomains. Verified end-to-end: traefik dashboard loads via session cookie after single login. NEW: Phase 3.4 Vaultwarden DONE 2026-06-05 - LXC 252 + Docker-in-LXC vaultwarden/server:latest at https://vault.hm.iamkay.eu/, signups disabled, admin token argon2-hashed, Cloudflare token rolled, 12 credentials migrated by Kay. Secrets vault gap CLOSED. NEW: admin services routed through Traefik with LE cert 2026-06-05 - arochukwu/pbs/homenas.hm.iamkay.eu all behind Traefik with LE wildcard, browser-trusted. iLO stays direct (TLS 1.0 incompatible). DEFERRED until after Phase 3.5 Authelia (per Kay): break-glass card — single paper print should include Vaultwarden master + Authelia admin/MFA + PVE root + iLO kay together, so we wait until Authelia generates those secrets. Template at D:\PVE\break-glass-card.txt. NEW: SPP G7.1.3 + SPP 2017.10.1 ISOs DOWNLOADED + SHA256-VERIFIED + STAGED in PVE ISO storage 2026-06-05 (Section 11 #19) - both ISO files at /var/lib/vz/template/iso/, visible in PVE web UI as ISO Images. NOT YET FLASHED to the controller - this is staging only. Provenance + checksums recorded in /var/lib/vz/template/iso/SPP-CHECKSUMS.txt. The actual firmware flash (mount via iLO Virtual Media → boot → SPP updater → P410i fw 6.64 → BBWC enable) happens at the second chassis-open (~July with HBA install).

The homelab supports four parallel goals: Path A (self-hosted developer services replacing paid SaaS — see project-tracker.md), Path B (skills curriculum — see learning-tracker.md), Path C (content + thought leadership — see content-tracker.md, opened 2026-06-07 when LinkedIn series + iamkay.eu/oak-techx.com blogs became a tracked path), and Path D (family home cloud serving 3-5 family members — see home-cloud-tracker.md). All four share the same foundation tracked in homelab-tracker.md.

REQUIRED READING BEFORE ANY WORK¶

Every Claude session working on this project MUST read these files first, in this order:

D:\PVE\CLAUDE.md (this file) - project handover, current state, safety rules
D:\PVE\homelab-tracker.md - foundation infrastructure status and pending work
D:\PVE\project-tracker.md - Path A: revenue/cost-impact work tracking
D:\PVE\learning-tracker.md - Path B: skill-building curriculum tracking
D:\PVE\content-tracker.md - Path C: content + thought leadership track (LinkedIn series + iamkay.eu / oak-techx.com blogs)
D:\PVE\home-cloud-tracker.md - Path D: family home cloud services
D:\PVE\SESSION-LOG.md - forensic detail from prior sessions

Do NOT propose work without reading all seven. The trackers are the source of truth for what is in flight, what is blocked, and what depends on what.

Architecture decisions locked 2026-05-31¶

These are committed, not options. Do not propose alternatives without explicit new context from Kay (e.g. platform changes, scale shift, family request):

External access: Cloudflare Tunnel for user-facing services. Admin VPN: WireGuard OR Tailscale (decision deferred to deploy time, see home-cloud-tracker.md D.4 — both use WireGuard datapath, differ in control plane / port-forwarding need).
DNS: Cloudflare hosts iamkay.eu and hm.iamkay.eu (registrar Namecheap). Internal *.hm.iamkay.eu served by a fresh Debian 12 LXC + Unbound, built as part of Phase 3. The pre-existing LXC 500 ("demo-centos") was investigated 2026-05-31 and is NOT a DNS server — treated as obsolete, cleanup gated on Phase 2.
TLS: Let's Encrypt wildcard *.hm.iamkay.eu via DNS-01 (Cloudflare API).
Reverse proxy: Traefik (TLS termination, hostname routing).
Auth: Authelia (SSO + MFA in front of every user-facing service).
User-facing services: Nextcloud (files/photos/calendar), Jellyfin (media). Existing homeNas SMB shares remain available on LAN.
Developer services (Path A): GitLab CE + Runner, staging VMs.
Infrastructure: Proxmox Backup Server (mandatory before any Path D data lands), Prometheus + Grafana monitoring, Uptime Kuma, log shipping for audit.
Off-site cold backup: Cloudflare R2 vs Backblaze B2 — choice deferred to before D.1 ships data.

Rollout sequence (committed): foundation Phase 1 → Phase 2 → Phase 3 (PBS → DNS → Traefik+TLS → Authelia → monitoring) → Path A and Path D start in parallel.

0. READ THIS FIRST - HARD SAFETY RULES¶

This server holds DATA THAT IS NOT YET BACKED UP. Treat every disk operation as potentially destructive until backups are confirmed. The following rules are absolute and override any task instruction:

NEVER run a destructive disk command (dd, mkfs, wipefs, sgdisk, parted, lvremove, vgremove, pvremove, qm destroy, pct destroy, zpool destroy, rm -rf on data paths) without an EXPLICIT, SPECIFIC confirmation from Kay in that session naming the exact target.
NEVER touch these until Kay confirms data is recovered/backed up:
VM 189 (homeNas) and its disks
The two RAID5 arrays inside VM 189: md0 (~839 GB, "r-tank") and md1 (~1.4 TB, "s-tank"). These are mdadm RAID5 at the block layer with btrfs on top - NOT direct btrfs RAID5. The mdadm member disks /dev/sd[b-h] are passed raw from Proxmox into the VM.
Storage entries: Backup-NAS, fast
NEVER reboot or shut down the server without asking first. The 10.0.110.x "recovery path" is DEAD and the 10.0.110.5 post-up was REMOVED from vmbr0 on 2026-06-04. Two real recovery paths now: (a) VLAN 40 direct L2 from the laptop to host 172.16.1.5 (works as long as switch ports 2/3 still carry VLAN 40 tagged), and (b) the G7 physical console at the chassis (keyboard + monitor) - ultimate fallback if VLAN 40 also breaks. Plan accordingly before any change that could lose vmbr0.10, vmbr0.40, or the tagged-VLAN trunk on switch ports 2/3.
Propose, then wait. For anything that changes disk, network, or VM state, print the exact command(s) you intend to run, explain the effect and the blast radius, and stop. Do not execute until Kay says go.
Read-only is always safe to run (ls, cat, ip addr, lsblk, qm list, pvesm status, zpool status, mount, df, pvs/vgs/lvs). Prefer these.
No em dashes in any file, comment, commit message, or email you generate for Kay. Use a hyphen or reword. (Standing preference.)
NEVER enable iLO FIPS Mode (iLO web UI -> Administration -> Security -> Encryption page). Enabling FIPS factory-resets iLO and wipes: the iLO Advanced license, the kay admin account, the static IP 172.16.1.3 / gw / mask, the Server Name / FQDN / Subsystem Name, the time zone, the IPMI/DCMI-over-LAN setting, every cipher / SSH / SNMP customisation made in sessions 3+7. This is a permanent rule, not gated on data recovery. Documented here so future sessions don't get curious and toggle it.

If a requested action conflicts with rules 1-3 or 7, refuse and explain why.

1. WHAT THIS SERVER IS¶

Hardware: HP DL380 G7 (NOT G8 - all part numbers must be G7 compatible). 2x Xeon X5675 (12c/24t), 148 GB DDR3 ECC, onboard P410i RAID controller.
Hypervisor: Proxmox VE 8.1.5, kernel 6.5.13-3-pve. Already installed and WORKING on the P410i 512 GB logical volume (/dev/sda). Boots fine from HDD.
Hostname: arochukwu
Status: Recovered on 27-28 May 2026 after ~15 months powered off. Last prior login was Feb 2024. Everything came back cleanly.

Known firmware noise (harmless, ignore)¶

ACPI: SPCR firmware bug, BIOS has corrupted hw-PMU resources, pcc_cpufreq_probe: Too many CPUs - all cosmetic G7 quirks.

2. WHAT TO KEEP vs WHAT IS OBSOLETE¶

Kay's explicit instruction: keep only homeNas and the storage system. Everything else is obsolete and will be removed - but only AFTER data recovery is confirmed.

KEEP (do not touch without confirmation)¶

VM 189 (homeNas) - the NAS VM. Primary data-recovery target. Auto-starts.
Storage system - the two mdadm RAID5 arrays (md0 r-tank, md1 s-tank) with btrfs on top, inside VM 189, and whatever data lives on them.

OBSOLETE — CLEANUP COMPLETE 2026-05-31¶

All 22 obsolete entities destroyed 2026-05-31 evening. Only VM 189 (homeNas) remains on this Proxmox host. qm list shows one VM. pct list is empty. /etc/pve/qemu-server/ contains only 189.conf. /etc/pve/lxc/ is empty.

Audit trail (preserved on host + laptop): - /root/pre-destruction-inventory-20260531-205954.txt (host) - /root/orphaned-fast-disks-20260531-212124.txt (host) - D:\PVE\pre-destruction-inventory-20260531-205954.txt (laptop) - D:\PVE\orphaned-fast-disks-20260531-212124.txt (laptop) - MD5 verified identical across host and laptop copies.

Destruction summary:

Phase	Entities	Result
Phase 2 (local-lvm clean destroy)	VM 199 mgmt-workstation, VM 8000 ubuntu-mini-cloud	2 destroyed cleanly. Disk files removed. ~12.1 GB actually reclaimed on local-lvm (thin-provisioned, less than allocated sizes).
Phase 3 (`fast` NFS messy destroy)	VMs 101-109 (k8s cluster #1), VMs 181-187 (k8s cluster #2 `hap-*`), VM 170 kali-blue, VM 197 kali-purple, VM 198 mgmt-devops	19 destroyed via `qm destroy --purge --skiplock`. Conf files removed. Disk files orphaned on `fast` (storage offline, qm couldn't reach to remove them).
Phase 3 fallback (LXC 500)	LXC 500 demo-centos	`pct destroy` refused (stricter about offline storage than qm). Conf removed manually with `rm /etc/pve/lxc/500.conf`. Disk file orphaned.
Total	22 entities	22 confs removed; 37 disk files (~561 GB allocated) orphaned on `fast`

Remaining cleanup work — orphan disk files on fast: the 37 disk files for the 20 fast-backed entities are still physically on the NFS share (because fast was offline during destroy). They become reachable for deletion once Phase 2 network rebuild restores fast.

Required reading when planning that Phase 2 follow-up cleanup: 1. D:\PVE\orphaned-fast-disks-20260531-212124.txt — per-file orphan inventory 2. D:\PVE\pre-destruction-inventory-20260531-205954.txt — pre-destruction state snapshot 3. D:\PVE\homelab-tracker.md Phase 4 — cleanup procedure (pvesm free loop) 4. D:\PVE\reference_pct_vs_qm_destroy.md — the qm-vs-pct asymmetry (only relevant if any new LXC ever needs destruction with offline storage)

For historical context (these notes apply to entities that no longer exist):

The 22 obsolete entities all had conf files last touched in a single 5-hour build window on 2024-03-24 (01:07 - 06:10 CET) — built in one sitting and untouched since.
Two parallel k8s builds (cluster #1 on VLAN 10, cluster #2 hap-* mostly on VLAN 70 with VM 187 drifted to tag 60).
VLAN 60 + 192.168.60.x was the obsolete-test-era fingerprint — five entities referenced it (VMs 170/187/198/8000 + LXC 500). That era is now fully removed from the host config; the fingerprint is only useful for understanding any disk-content forensics if Kay ever re-mounts an orphaned .qcow2 for inspection.

Do NOT recreate these VMIDs without explicit confirmation. If new VMs need creation for Path A or Path D, allocate fresh VMIDs rather than reusing 100-200 / 500 / 8000 range numbers — keeps the orphan-tracker references unambiguous if any forensic mount is ever needed.

3. THE PLAN (ordered - do not reorder, each step gates the next)¶

Recover homeNas data. Boot VM 189, inspect contents, identify what Kay wants off it. Confirm with Kay what is irreplaceable.
Inspect the btrfs RAID5 arrays. Mount READ-ONLY first. Catalogue what data exists. Report to Kay before anything else.
Back up everything irreplaceable to a safe target (external USB drive or a network share) BEFORE any destructive step. Verify the backup is readable.
Remove obsolete VMs/templates/containers (the section 2 list) - only after steps 1-3 are confirmed done.
Migrate Proxmox install from the 512 GB P410i RAID volume to the SD card. The SD card is the intended new Proxmox root so the RAID/drive bays can be freed entirely for storage. (See section 5 for migration options + open question.)
Free the 512 GB RAID volume once Proxmox boots from SD.
Storage rebuild - BLOCKED on hardware purchase. Kay has 3x 146 GB SAS drives currently in the array and wants to REMOVE them, replacing with SSDs. SSDs must be bought first. Nothing happens in those bays until SSDs arrive. NOTE: P410i is an old controller and is picky about some SSDs - spec compatibility before ordering.

4. CURRENT NETWORK STATE (important for remote access)¶

The server's networking is elaborate (built by Kay previously). Do NOT "simplify" or rewrite it without explicit instruction.

4 physical NICs (enp3s0f0, enp3s0f1, enp4s0f0, enp4s0f1)
  bond0 = enp3s0f0 + enp3s0f1   (active-backup, primary=enp3s0f0,
                                 primary-reselect always)
    vmbr0 (Proxmox bridge, MAC b4:99:ba:bb:1a:8a, VLAN-aware, vids 2-4094)
      vmbr0 itself = inet manual (no IP on the untagged self-port,
                                  since 2026-06-04; 10.0.110.5 post-up
                                  removed Phase 2 same day)
      vmbr0.10 (tagged VLAN 10 sub-iface, vlan-raw-device vmbr0)
        10.0.10.5/24            <- primary management IP, gw 10.0.10.1.
                                   Default route: default via 10.0.10.1
                                   dev vmbr0.10.
      vmbr0.40 (tagged VLAN 40 sub-iface, vlan-raw-device vmbr0)
        172.16.1.5/24           <- MGMT-VLAN presence, NO gateway (by
                                   design - default stays on vmbr0.10).
                                   Direct L2 reachable from VLAN 40
                                   laptop (TTL 64). Added 2026-06-04.
      (no vmbr0.100 - retired 2026-05-31)
enp4s0f0, enp4s0f1 standalone (reserved for future use: 10 GbE NIC,
                               dedicated storage link, second switch link)

Switch (TL-SG108E v1, saved to flash 2026-06-04):
  port 1 = pfSense SG-1100 uplink (trunk, VLAN 10 + VLAN 40 tagged)
  port 2 = bond0 slave enp3s0f0 \  TRUNK: VLAN 10 tagged + VLAN 40
  port 3 = bond0 slave enp3s0f1 /  tagged, PVID 1, VLAN 1 untagged
                                   membership retained as native
  ports 6-7 = MGMT VLAN 40 untagged endpoints (laptop lives here when
              on a wired drop)
  ports 4,5,8 = other endpoints (not in this scope)

VM 189 attaches via two firewall-bridge stacks (fwbr189i0/i1, fwpr189p0/p1,
fwln189i0/i1, tap189i0/i1):
  net0 (homeNas data path)     <- VLAN tag 10 (retagged 2026-05-31 from
                                  legacy 70; bridge port fwpr189p0 PVID=10).
                                  Inside guest: ens18 STATIC 10.0.10.11/24,
                                  gw 10.0.10.1, set via OMV-bypass fragment
                                  /etc/network/interfaces.d/10-ens18-vlan10.conf
                                  (auto + inet manual + post-up addr + post-up
                                  route). Reachable from VLAN 40 (TTL 63, one
                                  pfSense hop). Active path for ALL NAS traffic
                                  since 2026-06-04.
  net1 (legacy untagged path)  <- bridge=vmbr0 untagged. ens19 inside guest
                                  is 10.0.110.11/24. Dead cruft; frames egress
                                  into VLAN 1 at the switch and go nowhere.
                                  Removal scheduled (see deferred cleanup list).

Bond architecture (simplified 2026-05-31)¶

Original design (until 2026-05-31): nested LACP — bond0 (LACP, enp4 pair) + bond1 (LACP, enp3 pair), both as slaves of bond2 (active-backup, primary=bond1). Assumed an LACP-capable upstream switch (a Linksys that has since been decommissioned).

LACP partner problem (discovered 2026-05-30): the current TP-Link TL-SG108E is smart-managed but only supports static LAG, not dynamic 802.3ad LACP. So LACPDUs were silently dropped, both bonds reported Partner Mac Address: 00:00:00:00:00:00, slave Partner Churn State: churned, and only 1-of-2 slaves per bond ever entered the active aggregator. Effective bandwidth was 1 Gb, not 2 Gb. bond2's active- backup layer was the part actually doing useful work.

Simplified design (current, 2026-05-31): single bond0 in active-backup over enp3s0f0 + enp3s0f1, primary enp3s0f0, primary-reselect always (immediate failback to enp3s0f0 when it recovers, rather than waiting for the active slave to fail). bond1 and bond2 removed. vmbr0's uplink moved from bond2 to bond0. vmbr0's MAC settled at b4:99:ba:bb:1a:8a (enp3s0f0's permanent HW addr). All IPs and the VM 189 firewall stack preserved through the transition.

Reserved NICs: enp4s0f0 and enp4s0f1 are declared auto ... inet manual and brought up but enslaved to nothing. They auto-configured IPv6 SLAAC addresses from the upstream RA (ULA prefix fd7f:d6ce:544a:8dda::/64, advertised by the Google Nest) — harmless. Will be re-shaped (or RA-disabled) during the MGMT VLAN migration.

Process gotcha for future LACP-to-active-backup conversions: changing bond mode in /etc/network/interfaces and running ifreload -a is NOT enough. Linux won't change a bond's mode while the bond has slaves attached — the kernel returns Directory not empty on the /sys/class/net/<bond>/bonding/mode write, and ifreload swallows the warning instead of treating it as fatal. The fix is ifdown bond0 && ifup bond0 to fully destroy and recreate the bond device, so mode is set on the fresh device BEFORE slaves are attached.

Pre-change backup: /etc/network/interfaces.backup-pre-bond-simplification-20260531-134934.

Phase-2 prep work done 2026-05-31 late evening (in addition to bond simplification)¶

VLAN 100 retired. vmbr0.100 sub-interface (was 10.0.100.5/24) removed from /etc/network/interfaces. Routes for 10.0.100.0/24 auto-cleared. Bridge VID 100 membership on vmbr0 dropped. No service was using it (verified zero listeners + zero ARP entries pre-removal). Backup: /etc/network/interfaces.backup-pre-vlan100-removal-20260531-214723.
VM 189 net0 retagged 70 → 10 via qm set 189 --net0 ...tag=10 (hot, no VM restart, PID 1521 preserved). Pre-stages net0 for the new SERVERS VLAN 10 design. Currently dormant — ens18 inside the guest has no IPv4 (no DHCP source on VLAN 10 yet), so net0 carries only broadcast/multicast trickle (~1 pkt/min). All real traffic continues on net1 (untagged, 10.0.110.11). Backup: /root/189.conf.backup-pre-vlan-retag-20260531-215559.

Phase-2 cleanup still deferred¶

These VLAN-related items still reference legacy state and need attention during the Phase 2 network rebuild:

bridge-vids 2-4094 on vmbr0 — narrow to actually-used VIDs (10/20/30/40 + whatever STORAGE becomes, if separate).
/etc/pve/storage.cfg still has Backup-NAS and fast pointing at server 10.0.70.11 (legacy VLAN 70 IP). homeNas's primary IP is now 10.0.10.11 - these references need updating. Open question whether to keep fast / Backup-NAS storage entries at all — see homelab-tracker.md Phase 4.
Orphaned disk files on fast NFS (~37 files, ~561 GB allocated) — see homelab-tracker.md Phase 4 + D:\PVE\orphaned-fast-disks-20260531-212124.txt.

Deferred cleanup after 2026-06-04 VLAN cutover¶

Tracked but NOT yet acted on. Each is a separate small change:

10.0.110.x cruft on the NAS: the 99-ens19-recovery.conf fragment and the guest's net1 (Proxmox VM conf) are vestigial. Detach net1 from VM 189 and remove the fragment.
Boot-level dhclient bond0 launcher: the rogue dhclient is NOT in /etc/network/interfaces. Something at boot launches it (process started at 00:06 on the VLAN-10 cutover day). Needs hunting (systemd units, ifupdown hooks, /etc/dhcp/, NetworkManager-leftovers, etc.) and disabling so it doesn't respawn.
NAS /etc/resolv.conf still points at dead 10.0.70.1. Repoint to pfSense (10.0.10.1) or internal DNS once it exists.
NAS guest main-file iface ens18 inet dhcp stanza still in the OMV-managed /etc/network/interfaces. ifupdown2 merges it with the fragment's inet manual, running post-up AND spawning a fresh dhclient on every ifup ens18. Static IP survives (post-up sets it before dhclient gives up) but the orphan dhclient lingers until SIGTERMed. Neutralize via OMV workbench or by overriding in interfaces.d/.
Narrow bridge-vids 2-4094 on vmbr0 to actually-used VIDs (10, 40, plus whatever STORAGE becomes if separate). Hardening only.

DONE in earlier passes of this session (kept here as a historical pointer): - ~~10.0.110.x cruft on the host: post-up/pre-down lines removed~~ (Phase 2, 2026-06-04 evening) - ~~Host MGMT VLAN 40 presence: vmbr0.40 with 172.16.1.5/24~~ (Phase 1, 2026-06-04 evening)

The persistent config lives in /etc/network/interfaces. Latest backups: /etc/network/interfaces.backup-pre-vmbr0.10-20260604-083934 (pre VLAN 10 cutover), /etc/network/interfaces.backup-pre-vmbr40-20260604-181535 (pre VLAN 40 add), /etc/network/interfaces.backup-pre-110removal-20260604-182308 (pre 110 removal).
Host primary IP = 10.0.10.5/24 on vmbr0.10 (tagged VLAN 10 sub-interface, gateway 10.0.10.1, default route lives here). Added 2026-06-04 morning.
Host MGMT IP = 172.16.1.5/24 on vmbr0.40 (tagged VLAN 40 sub-interface, NO gateway by design - default stays on vmbr0.10). Added 2026-06-04 evening. Stanza: auto vmbr0.40 iface vmbr0.40 inet static address 172.16.1.5/24 vlan-raw-device vmbr0
NAS management IP = 10.0.10.11/24 on guest ens18 (VLAN 10), set via /etc/network/interfaces.d/10-ens18-vlan10.conf (auto + inet manual + post-up addr + post-up route). Lives under interfaces.d/ so OMV salt regeneration of the main file doesn't clobber it (same idiom as the legacy 99-ens19-recovery.conf).
Legacy NAS ens19 (10.0.110.11/24) still exists but reaches nothing (untagged frames egress VLAN 1 at the switch, dead path). Persisted via /etc/network/interfaces.d/99-ens19-recovery.conf. Slated for removal (see deferred cleanup list).
The "recovery via 10.0.110.x" pattern is RETIRED. It was a one-time Nest/IoT bootstrap from before the upstairs cable existed - never a real recovery method. Real recovery now: (a) VLAN 40 direct L2 to 172.16.1.5, (b) G7 physical console at the chassis if VLAN 40 also breaks. The iLO at 10.0.110.3 still lives on the retired range and is not a useful safety net today; iLO-to-VLAN-40 move is Section 11 #8.
Web UI: https://10.0.10.5:8006/ (via VLAN 10) or https://172.16.1.5:8006/ (via VLAN 40, same L2 segment as laptop). Login realm: "Linux PAM standard authentication", user root.

Network environment status (as of 2026-06-04 evening)¶

pfSense SG-1100 in production, VLAN-trunked into the TL-SG108E v1 (switch port 1, carrying VLAN 10 + VLAN 40 tagged). TL-SG108E ports 2 and 3 are now full trunks into the G7 bond0: VLAN 10 tagged + VLAN 40 tagged, PVID 1 with VLAN 1 retained as native (saved to flash 2026-06-04). The laptop on VLAN 40 (172.16.1.0/24) reaches the host two ways: 10.0.10.5 via pfSense routing (TTL 63, one hop) and 172.16.1.5 directly on L2 (TTL 64, zero hops - this is the new real recovery path). The dead 10.0.110.5 secondary IP that used to live on vmbr0 was removed in Phase 2 the same day; the Google Nest 10.0.110.x bootstrap network is gone. MikroTik CRS310 swap still planned (Section 8b) but not urgent for any current functionality.

5. PROXMOX BOOT-DRIVE STORY (revised 2026-06-04 for Path B1 Hybrid)¶

Path B1 Hybrid (committed) keeps the P410i as boot controller. No PVE reinstall, no boot-disk migration in this round. The original three options (fresh install / clone-to-SD / no migration) are deferred to a future "clean Path B" cleanup, when boot can be moved off the P410i so the controller can be sold.

Current boot story: - PVE 8.1.5 boots from the P410i 512 GB logical volume (/dev/sda), built from the existing 3x 146 GB SAS drives in hardware RAID. - This volume stays as-is through the B1 chassis-open. BBWC battery installed 2026-06-05 but cache is firmware-rejected on 3.66 ("Permanently Disabled" - 3.66 doesn't recognize the newer-rev battery signature). Cache will re-enable after P410i firmware update to 6.64 (Section 11 #19 SPP 2017.10.1, now load-bearing). Until then: write-through mode on PVE root + VM boot disks. Safe, just slower.

Future "clean Path B" task (deferred, not blocking anything): - Migrate PVE boot off the P410i onto either a small SATA boot SSD on the HBA OR the SD card. - Method: probably option B from the original three (clone via dd/rsync + GRUB reinstall) since fresh install on G7 is a known red spot (BIOS quirks, slow Virtual Media install over old TLS). - After successful migration, P410i can be removed and sold (with cache module + BBWC battery as a complete kit). - Best done after a few months of operating B1, when ZFS + PBS are proven and a known-good backup of the host config exists.

6. CONNECTION DETAILS (fill in before first use)¶

Proxmox host (arochukwu)¶

SSH (two paths, either works):
ssh [email protected] (VLAN 40 direct L2, TTL 64; preferred since 2026-06-04 evening - independent of pfSense routing).
ssh [email protected] (via VLAN 10, reachable from VLAN 40 laptop through pfSense, TTL 63). The laptop's ssh pve alias in ~/.ssh/config still points at the dead 10.0.110.5 - update it (point at 172.16.1.5).
Web UI: https://172.16.1.5:8006/ (VLAN 40) or https://10.0.10.5:8006/ (VLAN 10). Login realm: "Linux PAM standard authentication", user root.
ipmitool 1.8.19-4+deb12u2 is installed (sideloaded 2026-05-30 with freeipmi-common 1.6.10-1 + libfreeipmi17 1.6.10-1+b1 deps, SHA256-verified against bookworm main Packages.xz). KCS via /dev/ipmi0 is live since boot; all ipmi_* kernel modules already loaded.

homeNas VM 189 (OMV)¶

SSH: ssh [email protected] (key auth, via VLAN 10). The laptop's ssh homenas alias in ~/.ssh/config still points at the dead 10.0.110.11 - update it.
Legacy ens19 path at 10.0.110.11 still exists inside the guest but reaches nothing; slated for removal (see deferred cleanup list).

iLO 3 (arochukwu BMC) — on MGMT VLAN 40 as of 2026-06-04¶

IP: 172.16.1.3 (changed 2026-06-04 from 10.0.110.3 via in-band KCS, which had previously been changed 2026-05-30 from the stale ESXi-era 10.0.100.3). Static, mask 255.255.255.0, gw 172.16.1.1, channel 2. iLO's RJ45 is in TL-SG108E port 6 (untagged VLAN 40 access). 802.1q VLAN ID on channel 2 = Disabled (untagged on the wire; the switch port does the VLAN tagging).
Firmware: 1.94 dated 2020-12-06 (final release for iLO 3).
Admin user: kay at user ID 1 — HP factory "Administrator" slot was renamed kay long ago; there is no user literally named "Administrator". Password reset 2026-05-30 — stashed in Bitwarden.
iLO Advanced license: activated 2026-05-30. Key 35DPH-SVSXJ-HGBJN-C7N5R-2SS4W (community-shared key). Unlocks Virtual Media (network ISO boot), Integrated Remote Console, power graphs, advanced auth.
Server identity in iLO: Server Name = arochukwu, Server FQDN = arochukwu.hm.iamkay.eu (renamed 2026-05-30 via iLO web UI Access Settings from stale ESXi-era ECL-ESX02.eclh.lan). iLO Subsystem Name = arochukwu-ilo (renamed 2026-05-30 via Network → Dedicated Network Port → General, required an iLO reset to apply — the web UI flagged "There are pending changes that may not take effect until iLO is reset"). Time Zone corrected from Atlantic/Reykjavik (GMT) to Europe/Amsterdam (CET/CEST) via Network → Dedicated Network Port → SNTP (also needed the iLO reset to apply).
iLO DNS resolvers configured to public ones as an interim setup (no internal DNS / Pi-hole yet): Primary 8.8.8.8, Secondary 9.9.9.9, Tertiary 1.1.1.1. Repoint to internal resolvers once pfSense Unbound or Pi-hole is up — tracked in Section 11 #10.
Web UI: https://172.16.1.3/ — only speaks TLS 1.0. Use Firefox with about:config → security.tls.version.min = 1. Will not load in hardened browsers without that toggle.
SSH to iLO: ssh [email protected] — passwordless via DSA-1024 key since 2026-06-04 (Session 7). Laptop key: ~/.ssh/id_dsa_ilo. Required client config (already in ~/.ssh/config on the laptop): KexAlgorithms +diffie-hellman-group1-sha1,diffie-hellman-group14-sha1, HostKeyAlgorithms +ssh-rsa,ssh-dss, PubkeyAcceptedAlgorithms +ssh-rsa,ssh-dss, Ciphers +aes128-cbc,3des-cbc, MACs +hmac-sha1, IdentityFile ~/.ssh/id_dsa_ilo, IdentitiesOnly yes. CRITICAL: iLO 3 v1.94 ONLY accepts DSA client keys. RSA-2048, RSA-4096, RSA-no-comment, and BEGIN/END-wrapped variants all return "DSA Public Key Import Error: Invalid input" in the web UI and "Bad SSH key" via the CLI's oemhp_loadSSHKey URL-fetch path. See reference_ilo3_v194_quirks.md for the full forensic detail. Method that worked: oemhp_loadSSHKey /map1/accounts1/kay -source http://<host>/id_dsa_ilo.pub from inside an interactive SSH session to iLO (HTTP-fetch on the same VLAN; modern OpenSSL 3.0 + Windows Schannel both reject iLO's TLS handshake so RIBCL/HTTPS upload paths are dead too).
IPMI 2.0 over LAN: enabled on port 623 (2026-05-30 via iLO web UI Access Settings → IPMI/DCMI over LAN). Usable as ipmitool -H 172.16.1.3 -U kay -P ... -I lanplus .... Security caveat: UDP/623 is currently open to anything on VLAN 40 (172.16.1.0/24). Lock down via pfSense MGMT VLAN 40 firewall rules (Section 11 #9). Not appropriate for an internet-exposed segment.
NIC selection: dedicated mode (confirmed working — the IP change took effect on the dedicated RJ45 cable Kay had plugged in). The earlier hypothesis that iLO was in shared-LOM mode turned out to be wrong; the actual gating issue was that iLO 3 has gratuitous ARP disabled by default and the IP change alone did not announce the new binding. The cold reset (ipmitool mc reset cold) forced the iLO network stack to re-bind, after which the laptop could reach it.

Stash / handover¶

Bitwarden holds (or should hold) the Proxmox root password and the new iLO kay password. Kay knows Proxmox root from memory.

iLO surfaces to ignore¶

HPE System Management Homepage link (shown on the iLO SNMP page as https://arochukwu:2381): dead end. SMH is HP's old host-side management agent; not installed on this box (no Debian / Proxmox build exists), and not worth installing. The link is cosmetic; nothing listens on port 2381. Don't try to follow it, don't try to sideload SMH for it.

7. HOW TO BEHAVE IN THIS REPO¶

Work one confirmed step at a time. Show output, summarise, propose next step, wait.
Prefer read-only discovery commands. Narrate blast radius before any write.
Keep a running log of what was actually executed (append to SESSION-LOG.md).
If unsure whether something is destructive, assume it is and ask.
Never delete a "keep" item or an unclassified item. When in doubt, leave it.

8. NEXT FOUNDATION MILESTONES (revised 2026-06-04 after Path B decision)¶

Originally agreed 28 May 2026; revised after Path B (HBA + ZFS) was locked. Network milestones (b) DONE in Sessions 5-7. Storage milestones (a/c/d) collapsed into the Path B controller swap below.

DONE¶

(b) ~~MikroTik CRS310 + proper VLAN switching~~ — VLAN trunking is functional via the TL-SG108E now (Sessions 5-7). CRS310 swap remains a future hardware-purchase task but is NO LONGER blocking anything.

NEXT (in order) — Path B1 Hybrid¶

Pre-work (no chassis open):
~~Keep BBWC battery order~~ DONE 2026-06-05 (installed during opportunistic chassis-open). Note: cache is firmware-rejected on 3.66; re-enables after SPP 2017.10.1 / firmware 6.64.
Buy: LSI 9305-16i HBA in IT mode (16 native SAS-3 lanes, no expander, no flashing) — €130-250 used on eBay UK/DE / Marktplaats. Confirm full-height bracket, firmware 16.00.10.00+, and that SFF-8643 → SFF-8087 fan-out cables are accounted for (cables NOT included with HBA, ~€15-25 each, need 2)
Buy: HP 516966-B21 second 8-bay backplane; HP 651687-001 caddies as needed; data SSDs (per reference_p410i_ssd_compatibility.md)
ZFS curriculum prerequisite — at least basic pool/dataset/snapshot/send-recv fluency before building the data pool. Curriculum items B.5 (containers) and B.7 (backup & DR) cover the territory.
Host config backup (precautionary even though B1 is low-risk): vzdump VM 189 to external USB; snapshot /etc/pve, /etc/network/interfaces, storage.cfg, qemu-server/.conf off-host (laptop already has CLAUDE.md / trackers + the various interfaces.backup- files).
Second chassis-open (when HBA + 2nd backplane + SSDs in hand). BBWC + riser + CMOS already installed in the 2026-06-05 chassis-open:
Install LSI 9305-16i HBA in a PCIe x8 slot
Install 2nd backplane in chassis front (bays 9-16)
Install new data SSDs into the new bays
Cable: HBA SFF-8643 ports → backplane(s) SFF-8087/8643 connectors (2 cables typically — one per backplane half)
P410i firmware update via SPP 2017.10.1 (or standalone .scexe) — this is the BBWC enable trigger; bundle into the same window since chassis is open and iLO Virtual Media is needed anyway
PSU dust check; close chassis
Post-swap (software, host stays up):
Confirm HBA visible: lspci | grep LSI; expect mpt3sas driver loaded
Confirm IT mode firmware: sas3flash -listall (or via dmesg | grep mpt3sas)
Confirm new SSDs visible: lsblk, ls /dev/disk/by-id/
Install ZFS tools if missing: apt install zfsutils-linux (PVE 8 has it built-in)
Build ZFS pool on the new SSDs. Design choices (RAIDZ2 vs mirrors, recordsize, compression=lz4 default, ashift=12 for SSDs) - decide at the moment, informed by SSD count + capacity. Common starting layout: 4× SSDs in RAIDZ2 → 2-disk fault tolerance.
Register the ZFS pool as a Proxmox storage entry (pvesm add zfspool tank ...)
Phase 3 PBS built on the new ZFS pool (gets native dedup + ZFS snapshots). Then DNS LXC, Traefik, Authelia, monitoring per homelab-tracker.md Phase 3.
Future "clean Path B" (deferred, not blocking): migrate PVE boot off the P410i, sell P410i + cache module + BBWC battery kit. Method probably dd/rsync clone (G7 fresh install is a red spot). See Section 5.

Only after the above should Section 9 (new HA k8s cluster) begin.

9. FUTURE BUILD - NEW HA K8S CLUSTER (placeholder, dual-track)¶

Once the foundation is rebuilt, a new HA k8s cluster will be designed. This section is a placeholder; do not start sketching topology until Section 8 is complete AND the G5 cluster (Phase 6) is stood up.

Two parallel tracks, deliberately separate to keep concerns clear:

Production stack = LXCs + Docker-in-LXC for multi-container apps, with VM-level HA via the Proxmox cluster on the 3 G5s + G7 storage host. This is what serves Path A (GitLab CE + staging) and Path D (Nextcloud + Jellyfin) — chosen for backup simplicity, low complexity, and PBS-friendly lifecycle.
k8s learning + experimentation = k3s (start) or kubeadm (depth) installed on the SAME 3 G5s (or a subset), running non-load-bearing workloads. Goal: hands-on k8s fluency (Path B item B.5, see learning-tracker.md), GitOps practice with ArgoCD/Flux against the self-hosted GitLab. NOT a production replacement for the LXC stack.

Hard design constraints already agreed for whichever path becomes production-bearing:

Must NOT depend on homeNas for VM disks, etcd storage, container images, or anything on the cluster's critical path. The old cluster's root disks living on fast NFS (which is exported by homeNas itself) is the exact anti-pattern to avoid - it's why those nine VMs are unrecoverable today.
Everything in git - cluster definition, manifests, ingress, secrets scheme, the lot. No hand-edited cluster state.
Designed after the G5 cluster lands (Phase 6) and Phase 3.5 monitoring is up. Cluster decisions follow from those foundations, not the other way round.

Fill this section out with concrete topology + tooling choices when the G5 cluster is stable AND Path B has reached B.5 readiness (~August 2026 earliest).

10. SESSION DECISION LOG¶

Per-command history lives in SESSION-LOG.md. This section captures headline decisions only, dated.

Session 1 - 27-28 May 2026¶

Server cold-started cleanly after ~15 months off. VM 189 (homeNas) booted; mdadm RAID5 arrays md0 + md1 both clean.
Two external USB drives triaged: Apple HDD (J550001MGG9RUC) kept as recovery candidate; Samsung HM160HI zero-wiped for disposal.
ddrescue of Apple HDD launched inside VM 189 → s-tank/recovery/apple-1tb.img.

Session 2 - 28 May 2026¶

ddrescue continuing (~24% at session start, 1 read error). .map file preserved progress through a process restart.
K8s cluster (VMs 101-109) confirmed unrecoverable. All nine VM root disks AND cloud-init disks live on the fast NFS storage, which is exported by VM 189 itself - the exact anti-pattern called out in Section 9. Marked obsolete; cleanup executed Session 4.
Plan reordered: Section 8 (foundation milestones) added, superseding Section 3 steps 5-7. Section 9 placeholder added for the future HA k8s build.

Session 3 - 30 May 2026¶

ddrescue FINISHED. Image at s-tank/recovery/apple-1tb.img is exactly 1,000,204,886,016 B with 4 KB (8 LBAs at 124.5 GB) bad in the .map. HFS+ triage + ClamAV pending (Section 11 #17).
ipmitool 1.8.19-4+deb12u2 sideloaded on host with freeipmi-common 1.6.10-1 + libfreeipmi17 1.6.10-1+b1 deps (SHA256-verified against bookworm main Packages.xz).
iLO 3 IP changed 10.0.100.3 -> 10.0.110.3 via in-band KCS on channel 2 (channel 1 doesn't exist on iLO 3). Static, mask 255.255.255.0, gw 10.0.110.1. Cold reset (mc reset cold) was needed for the new IP to actually be reachable: iLO 3 has gratuitous ARP disabled by default and lan set ipaddr alone doesn't announce the new binding upstream. Admin password reset on user ID 1 (kay).
iLO web UI brought to current state: Server Name arochukwu, FQDN arochukwu.hm.iamkay.eu (was stale ESXi-era ECL-ESX02.eclh.lan), Subsystem Name arochukwu-ilo (was ILOCZ212806VN), Time Zone Europe/Amsterdam, DNS 8.8.8.8/9.9.9.9/1.1.1.1 (interim public). iLO Advanced license activated (key in Section 6). IPMI/DCMI over LAN enabled on UDP/623. Subsystem rename + Time Zone change each required an iLO reset to apply.
OS hostname = arochukwu.hm.iamkay.eu everywhere it matters (kernel, /etc/hostname, /etc/hosts, hostnamectl, /etc/pve members). Standalone PVE node (no Corosync).
Recovery IPs persisted this session (10.0.110.5 on vmbr0, 10.0.110.11 on NAS ens19). BOTH NOW OBSOLETE: vmbr0 post-up removed Session 6, NAS ens19 still pending removal (deferred list).
Windows laptop SSH aliases set up in ~/.ssh/config: ssh ilo, ssh pve, ssh homenas. All three need updating now that 10.0.110.x is dead (Section 6 has the new targets).
LACP partner missing on both bonds - TL-SG108E supports only static LAG, not dynamic 802.3ad. [RESOLVED in Session 4, 2026-05-31 — bond simplified to active-backup.]

Session 4 - 31 May 2026¶

Bond architecture simplified from nested LACP (bond0 LACP + bond1 LACP under bond2 active-backup) to single bond0 in plain active-backup over enp3s0f0 + enp3s0f1, primary enp3s0f0, primary-reselect always. bond1 and bond2 removed entirely. vmbr0's bridge-ports moved from bond2 to bond0. vmbr0's MAC settled at b4:99:ba:bb:1a:8a. All IPs and VM 189's firewall stack preserved through the transition. SSH and laptop reachability verified after change. See Section 4 for the full architecture description and the ifreload-vs-ifdown process gotcha (Linux won't change a bond's mode while slaves are attached; needed ifdown bond0 && ifup bond0 to complete the conversion).
enp4s0f0 and enp4s0f1 left as standalone (reserved). They auto-configured IPv6 SLAAC from the upstream Nest RA (ULA prefix fd7f:d6ce:544a:8dda::/64). Harmless; deferred to the MGMT VLAN rework.
iLO Integrated Remote Console verified non-functional on the laptop's current Windows setup. IRC requires either Java (applet path) or .NET (standalone IRC.exe path); both are broken / removed in modern Windows + browser combinations. Fallback paths verified working: iLO SSH (ssh ilo -> [email protected] with legacy KEX/cipher flags) and iLO VSP (virtual serial port, accessed via the SSH session). Physical keyboard+monitor at the chassis remains the ultimate fallback. iLO 3 v1.94 IRC quirk added to reference_ilo3_v194_quirks.md.
Documentation refactor: created three project trackers (homelab-tracker.md, project-tracker.md, learning-tracker.md) at D:\PVE\ and added a "REQUIRED READING BEFORE ANY WORK" callout near the top of CLAUDE.md so future sessions read all five docs in order. Also created D:\PVE\reference_ilo3_v194_quirks.md (ported from agent memory) so the iLO quirks reference lives alongside the project handover.

Session 5 - 3-4 June 2026¶

VLAN 10 trunk cutover COMPLETE. Host management IP moved from untagged on vmbr0 to tagged VLAN 10 on vmbr0.10 (sub-interface with vlan-raw-device vmbr0). TL-SG108E ports 2 and 3 (G7 bond0 slaves) converted from PVID 10 untagged-egress to PVID 1 with VLAN 10 TAGGED egress, VLAN 1 untagged retained as the native. Switch config SAVED to flash. NAS VM 189 ens18 statically pinned at 10.0.10.11/24 (gw 10.0.10.1) via the OMV-bypass fragment /etc/network/interfaces.d/10-ens18-vlan10.conf using the auto + inet manual + post-up addr + post-up route pattern. End-to-end reachability verified: laptop on VLAN 40 reaches 10.0.10.5 and 10.0.10.11 with TTL 63 (one pfSense hop). See Section 4 for the current topology and the Section 4 "Deferred cleanup" list for what's still open.
Rogue dhclient bond0 on host (PID 1228722) released cleanly via dhclient -r bond0. Was holding 10.0.10.107/24 as a phantom DHCP lease and creating a duplicate 10.0.10.0/24 route via bond0. Cleanup is RUNTIME ONLY; whatever launches it on boot was NOT hunted in this session - on the deferred list.
Orphan dhclient ens18 inside the guest killed. The OMV salt-managed main /etc/network/interfaces still has iface ens18 inet dhcp alongside the new fragment, and ifupdown2 merges both stanzas: post-up commands run AND dhclient is spawned. The static IP survives (post-up ip addr add runs before dhclient gives up) but the dhclient process needs an explicit SIGTERM after each ifup. On the deferred list.
Asymmetric VLAN 10 tagging at the switch identified and corrected. Pre-cutover, TL-SG108E ports 2/3 had PVID 10 untagged-egress so pfSense's tagged-VID-10 replies arrived at the host untagged - and the bridge never delivered them to the guest's VID-10 port (fwpr189p0). Setting ports 2/3 to tagged egress on VLAN 10 fixed the reply path.
pfSense SERVERS interface confirmed = VLAN 10 on mvneta0 (tagged on pfSense's side). Switch port 1 (pfSense uplink) was already a correct VLAN 10 trunk; no pfSense or port-1 changes were made in this session.
Pre-cutover backup of /etc/network/interfaces: /etc/network/interfaces.backup-pre-vmbr0.10-20260604-083934.

Session 6 - 4 June 2026 (evening)¶

MGMT VLAN 40 presence on host: ADDED + PROVEN (Phase 1). New stanza auto vmbr0.40 / iface vmbr0.40 inet static / address 172.16.1.5/24 / vlan-raw-device vmbr0 (deliberately NO gateway line - default stays on vmbr0.10). TL-SG108E ports 2 and 3 had VLAN 40 ADDED as Tagged members (existing VLAN 10 tagged config untouched). Saved to flash AFTER GATE 1 passed: laptop pinged 172.16.1.5 with TTL 64 (direct L2, zero hops - same VLAN 40 broadcast domain as the laptop) and 10.0.10.5 with TTL 63 (existing VLAN 10 path still routed via pfSense). Bridge self-port now carries VID 1 PVID + VID 10 + VID 40 - ifupdown2 added VID 40 to vmbr0 automatically when it brought up vmbr0.40 (as expected on a vlan_filtering=1 bridge).
Dead 10.0.110.5 cruft on host: REMOVED (Phase 2). The post-up ip addr add 10.0.110.5/24 dev vmbr0 || true and matching pre-down ... lines were stripped from the vmbr0 stanza. ifreload -a at the G7 console dropped the IP; ip route now shows exactly one default (via 10.0.10.1 dev vmbr0.10) and the 10.0.110.0/24 link route is gone. Both reachability paths still work post-removal (172.16.1.5 TTL 64, 10.0.10.5 TTL 63).
Real recovery method has shifted. Before today, the doc treated the G7 chassis console as the only safety net. Now: VLAN 40 direct L2 (172.16.1.5) is the practical recovery path; chassis console is the ultimate fallback if VLAN 40 also breaks. Section 0 rule #3 and Section 4 reflect this.
Backups stacked this session: /etc/network/interfaces.backup-pre-vmbr40-20260604-181535 (pre Phase 1), /etc/network/interfaces.backup-pre-110removal-20260604-182308 (pre Phase 2).
pfSense and switch port 1 untouched. NAS / VM 189 / ens19 untouched. No reboot.

Session 8 - 4-5 June 2026 (overnight) — PBS deployment + restore drill¶

Phase 3 #1 PBS deployed end-to-end. VM 200 (pbs) running PBS 4.2-1 on Debian, 4 GB RAM / 2 vCPU / 32 GB OS disk on local-lvm, static IP 10.0.10.20 on VLAN 10. Apple HDD (J550001MGG9RUC, 1 TB) passed through to VM 200 as scsi1 via /dev/disk/by-id/. Configured as Removable Datastore apple-tank in PBS (UUID-bound for safe physical unplug/replug). Registered in PVE as storage pbs-apple-tank.
First backup of VM 189 (homeNas) SUCCEEDED. vzdump snapshot mode, 32 GB OS disk only (the seven virtio2-virtio8 passthroughs have backup=0 and are correctly skipped). 9 min duration, 60.7 MiB/s, 77% sparse (24.69 GiB zeros), dedup'd. TASK OK.
Restore drill PASSED via filesystem inspection. Restore to VMID 190 written successfully (7.85 GB actual data → /dev/pve/vm-190-disk-0) but pbs-restore process hung in poll() during PBS-side validation. Killed the stuck process, cleared the VM 190 lock, mounted the restored disk read-only via losetup -P, and verified the contents: hostname homenas, Debian 11.9 (Bullseye), OMV config XML present, the Session 7 network fragment 10-ens18-vlan10.conf with inet manual + post-up addr/route preserved exactly, GRUB + kernel files all there. The backup captured the exact runtime state we set in Session 7 and is fully restorable. VM 190 destroyed cleanly after verification.
PBS health side-issue (tracked for next session): during restore, PBS VM's SSH started rejecting connections (web UI stayed alive). Likely chunk-handler resource pressure on the 4 GB VM during validation. Needs investigation (memory tuning? swap? chunk-cache size?) before re-attempting an end-to-end through-the-UI restore.
Backup-NAS storage entry removed (pvesm remove Backup-NAS). Legacy NFS entry that was never used. PBS now is the only backup target. fast storage entry kept until Phase 4 orphan cleanup.
/etc/resolv.conf on host got nameserver 1.1.1.1 fallback earlier in the evening - resilience against future pfSense Unbound restarts. Saved a session when Unbound stopped during PBS install due to the firewall Apply Changes side-effect.
bridge-vids on vmbr0 narrowed 2-4094 → 10 20 30 40 (host-side defense-in-depth alongside switch tagging).
pfSense DNS Resolver (Unbound) restarted by Kay in web UI after the Apply Changes side-effect. Verified via TCP probe to :53 and a successful nslookup. Now stable.
Wedding photos confirmed off the recovery image (already on SMB share - the original mission of this whole homelab effort is complete). apple-1tb.img was already deleted from s-tank earlier; only the .map / .log metadata remained.
iLO web UI tweaks done by Kay: Dedicated Port → IPv4 → unchecked Enable WINS Server Registration; updated DNS resolvers (Section 11 #10). iLO reset performed.
Crucial M4 SMART findings (via smartctl -d cciss,N): three of four (slots 5/6/7) are 79-86% worn (Wear_Leveling_Count VALUE 014/017/021), one (slot 4) is unusually pristine despite 5.8 yr power-on. Full inventory in D:\PVE\p410i-baseline-20260604-222809.txt. Means: buy minimum 4 new SSDs for the new ZFS pool, do not plan to keep three of the M4s in the array.
Storage architecture decision committed: Path B1 Hybrid (full detail in D:\PVE\reference_p410i_ssd_compatibility.md). LSI 9305-16i HBA in IT mode + 2nd backplane to be added during chassis-open window. P410i stays as boot controller. BBWC battery in transit stays useful. ZFS data pool on new SSDs on the HBA. P410i sale deferred to later "clean Path B" cleanup.
HBA purchase reminder scheduled for 2026-07-04 10:00 Europe/Amsterdam (routine trig_017JdC7WHwpGFwQ4TwZUMs8z).
WireGuard OR Tailscale captured as sibling options in home-cloud-tracker.md D.4; CLAUDE.md architecture decision updated.

Session 7 - 4 June 2026 (late evening)¶

iLO moved from 10.0.110.3 → 172.16.1.3 (MGMT VLAN 40). Method: in-band IPMI over /dev/ipmi0 (KCS) - no out-of-band reachability was available pre-change since 10.0.110.x had been retired earlier same day. Sequence: ipmitool lan set 2 ipsrc static ipmitool lan set 2 ipaddr 172.16.1.3 ipmitool lan set 2 netmask 255.255.255.0 ipmitool lan set 2 defgw ipaddr 172.16.1.1 ipmitool mc reset cold The mandatory cold reset (~60 s downtime) re-binds iLO's network stack so the new IP actually announces on the wire - iLO 3's gratuitous-ARP-disabled quirk means lan set ipaddr alone is not enough (matches the Session 3 finding).
No switch reconfiguration needed. iLO's RJ45 was already in TL-SG108E port 6, which Session 6 left as untagged VLAN 40 access (alongside port 7 for the laptop). iLO 802.1q VLAN ID stays Disabled on channel 2 (untagged on the wire); the switch port does the VLAN tagging on the trunk side.
Verified reachability: host (172.16.1.5) -> 172.16.1.3 = 0.3 ms RTT, MAC b4:99:ba:bb:1a:92 learned on vmbr0.40 directly. Laptop (172.16.1.100) -> 172.16.1.3 = 1 ms RTT, TTL 255 (iLO BMC default, not Linux's 64). RMCP+ probe accepts connection (auth-rejects bad password = LAN listener up). Web UI / SSH / IPMI all alive.
Doc updates this session: Section 0 rule #7 (FIPS warning - IP), Section 6 iLO heading + IP / web URL / SSH / ipmitool lines, Section 11 #8 marked DONE + #9 now actionable. ssh ilo alias on the laptop needs to point at 172.16.1.3 (Section 6).
Passwordless ssh ilo via DSA-1024 key — DONE later same evening. Laptop SSH config aliases (pve, ilo, homenas) all updated to new IPs (172.16.1.5 / 172.16.1.3 / 10.0.10.11). All three tested working: key auth for pve + homenas (ed25519), DSA-1024 for ilo. Section 11 #11 marked DONE. Forensic findings worth recording (also in reference_ilo3_v194_quirks.md):
iLO 3 v1.94 web UI "DSA Public Key Import Error: Invalid input" is LITERAL - the import path only accepts DSA. RSA paste attempts (2048, 4096, no-comment, BEGIN/END wrapped, RFC 4716) all fail.
The RIBCL XML escape hatch is unreachable from modern crypto stacks: both Debian 12 OpenSSL 3.0 and Windows Schannel reject iLO 3's TLS handshake (ancient ciphers compiled out, not just SECLEVEL-disabled).
Working method: HTTP-served .pub file (python http.server on host /tmp, VLAN 40 same broadcast domain as iLO), then via interactive SSH+password to iLO, run oemhp_loadSSHKey /map1/accounts1/kay -source http://172.16.1.5:8080/id_dsa_ilo.pub.
The oemhp_loadSshkey command needs the user target path BEFORE the -source flag - omitting it gives "Invalid target".
The web-UI "Authorize New Key" textarea is genuinely broken for non-DSA input on this firmware; do not retry it. Use the CLI path.
pfSense MGMT VLAN 40 lockdown — DONE. Added a Block rule on the SERVERS interface: SERVERS subnets -> MGMT subnets, any/any, IPv4, log enabled, placed between the existing "Block SERVERS to LAN" and "Allow SERVERS to internet" rules. Six-way verification confirmed the cutover: laptop on VLAN 40 still reaches everything (intra-VLAN, bypasses pfSense by design); VM 189 on VLAN 10 can no longer reach iLO or host vmbr0.40 via cross-VLAN routing but can still reach the pfSense gateway and the internet. Section 11 #9 marked DONE.

Session 9 - 5 June 2026¶

Opportunistic chassis-open completed earlier same day (during a household power outage, before this session's work). Installed: riser, CR2032 CMOS battery, P410i BBWC battery. Boot recovered after the install; "Sr timeout" during initial boot resolved itself by the next probe.
Morning tasks (pre-BBWC-probe) - DONE:
PBS apple-tank datastore converted to Removable Datastore (UUID-bound), so the Apple HDD can be physically unplugged/replugged without data loss. Commands: proxmox-backup-manager datastore remove apple-tank --keep-job-configs true, umount /mnt/datastore/ apple-tank, then proxmox-backup-manager datastore create apple-tank apple-tank --backing-device c8845d1c-ede7-4067-9025- 137670b47ac1 --reuse-datastore true.
NAS DNS regression fixed: created /etc/systemd/resolved.conf.d/ 99-dns.conf with DNS=10.0.10.1 1.1.1.1 and Domains=hm.iamkay.eu so OMV salt regeneration of the main interfaces file no longer breaks resolution. Long-term: also configure DNS in OMV web UI so the drop-in becomes belt-and-braces.
Daily PBS backup job created via pvesh create /cluster/backup --schedule '02:00' --storage pbs-apple-tank --all 1 --exclude 200 --mode snapshot --compress zstd --mailnotification failure --mailto [email protected] --prune-backups 'keep-last=3,keep-daily=7, keep-weekly=4,keep-monthly=12,keep-yearly=2'. Job ID 1286cbbf-cfee-4097-aef7-db3a55054658.
Stale legacy backup job deleted (backup-a7cdd9ce-bd47, schedule "sun 01:00", storage Backup-NAS — the storage entry was removed in Session 8 so the job would have started failing). Now only the new daily job exists.
VM 190 orphan LV reclaimed: lvremove -y /dev/pve/vm-190-disk-0 cleared the disposable restore-test VM's leftover disk. Thin pool pve/data dropped 6.80% → 4.71% (~7.3 GB freed, matches the actual written data from the restore drill).
BBWC firmware-rejection discovered (the headline finding). After clean shutdown + battery install + cold power-on via ssh ilo "power on" (SMASH CLP — works non-interactively), all logical drives showed Write cache: disabled in dmesg. Sideloaded ssacli 6.15-11.0 from HPE MCP to get a definitive verdict.
Battery is genuine + correct: HP Spares 462976-001, Part 460499-001, Ni-MH 4.8V 650mAh, mfg 03/26, VARTA Germany, RMN HSTNM-B011. Standard P410i BBWC battery, freshly manufactured. Photo on file.
ssacli verdict: Cache Board Present: True, Battery/Capacitor Status: OK, but Cache Status: Permanently Disabled with Cache Disable Reason: Permanent disable condition. Cache has been disabled because the wrong backup power source is attached to the cache module.
Root cause: P410i firmware 3.66 (released ~2010-2011) doesn't recognize the newer-rev battery's internal signature. The "battery revision allow-list" lives in firmware ROM, not NVRAM, so no amount of cold-booting / AC-draining will change the verdict. Same SKU as factory but internal revision moved on.
Fix path: P410i firmware update to 6.64 (latest, in SPP 2017.10.1). Standalone .scexe flash also possible. Both elevate Section 11 #19 (SPP) from "nice-to-have" to load-bearing for BBWC enable.
Operational reality until firmware update: write-through mode on PVE root + VM boot disks. Safe (write-through is the safer mode anyway), just slower. VM 189's RAID5 data arrays inside the guest are unaffected — they're on the mdadm layer, not P410i write cache.
ssacli sideload path that finally worked (after multiple URL failures last session): the HPE MCP repo uses a current/ subdirectory under each dist. Walk: https://downloads.linux.hpe.com/SDR/repo/mcp/debian/dists/bookworm/ current/non-free/binary-amd64/Packages.gz, parse for Package: ssacli, fetch the Filename: field which gives the pool/... path relative to debian/. SHA256 verified against the declared one in Packages.gz. Captured in reference_hpe_mcp_layout.md for next time.
VM 200 (PBS) did not auto-start on first probe despite onboot: 1. Did start a moment later on its own — boot timing race with NFS / pveproxy. False alarm; nothing to fix.
ipmitool sdr does NOT expose P410i BBWC status as a sensor on iLO 3. Only PSUs show up. Battery / cache state is queryable only via ssacli / iLO web UI Storage page. Worth knowing for future observability planning (Section 11 #12 — Prometheus + IPMI exporter won't capture BBWC; would need an ssacli-textfile-collector or iLO RIBCL scraper instead).
iLO power-on via SSH SMASH CLP confirmed working: ssh ilo "power on" is non-interactive, returns status=0 COMMAND COMPLETED, takes effect immediately. Same path for "power" (status query) and "power off" / "power reset". No web UI / IPMI-over-LAN needed.
2x HP EliteDesk 800 G5 SFF arrived. Intended use: HA Proxmox cluster (per Kay; reverses the earlier "OPNsense + observability" plan that lived in homelab-tracker Phase 6). Plan COMMITTED later same session: 2x more G5s to buy ~September 2026 (G5 #3 = cluster quorum/ reserve, G5 #4 = OPNsense replacing pfSense); G7 stays standalone decoupled from the cluster's HA failover. Architecture details reconciled into homelab-tracker.md Phase 6 with 4 remaining open questions (shared storage, cluster network, boot ordering, switch- port exhaustion when G5 #4 arrives).
k8s framing correction + Path B re-engagement plan (added 2026-06-05 mid-Session-9). Earlier in this session I described k8s as "probably never for this homelab" — that contradicted CLAUDE.md §9 (placeholder for "Future HA k8s cluster") and learning-tracker.md B.5. Corrected across all three docs: k8s is a planned learning-track target on the future G5 cluster (Path B), not foundation-stack production. The LXC pattern stays as the production foundation choice. Both stacks coexist on the same G5 hardware once Phase 6 lands. Path B (learning) has been deferred during Sessions 4-9 due to foundation pressure; explicit plan to re-engage learning-driven sessions after Phase 3.5 monitoring wraps. Detail in learning-tracker.md and homelab-tracker.md Phase 6.
Authelia SSO+MFA DONE (Phase 3.5 of homelab-tracker). LXC 254 authelia, Debian 12 + Docker-in-LXC running authelia/authelia 4.39.20 + redis:alpine sidecar. Static 10.0.10.9/24 on VLAN 10. Authelia at 9091 internal; fronted via Traefik at https://auth.hm.iamkay.eu/. File-based user database (kay/[email protected], argon2id-hashed password, group admins), SQLite storage, filesystem notifier (codes in /opt/authelia/config/notification.txt; SMTP deferred to Path D family onboarding). Traefik ForwardAuth wired up: auth.yml defines authelia@file middleware; dashboard.yml replaces BasicAuth with Authelia; vault.yml splits into user-vault router (direct) + /admin* router (through Authelia). Access control: default deny; auth.hm.iamkay.eu bypass; traefik.hm.iamkay.eu + vault.hm.iamkay.eu/^/admin.*$ require two_factor for group:admins. Session: 5 min inactive / 1h max / 1 month remember-me on cookie domain hm.iamkay.eu → SSO transparent across all subdomains. TOTP MFA "Oak Techx Homelab" enrolled for kay via filesystem-notifier flow. End-to-end verified: traefik.hm.iamkay.eu/dashboard/ → 302 → Authelia → password + TOTP → dashboard loads, session cookie carries across subdomains. Build gotchas: (a) Traefik router priority is rule-length-auto; explicit priority: 10 made the more-specific router LOWER than default, requiring removal of the priority key; (b) Authelia 4.39 has auto-mapped config deprecations (jwt_secret, session domain, server.host/port) — works but hygiene cleanup deferred; (c) production-tight regulation (3/2min) bit during enrollment-flow UX (user closing the verification dialog revokes the code) — cleared via Redis FLUSHALL + SQLite authentication_logs DELETE. Initial credentials in Vaultwarden under "Authelia - kay user". TOTP seed in same item.
Vaultwarden + admin TLS routing + secrets-vault lockdown DONE (Phase 3.4 of homelab-tracker). LXC 252 vaultwarden, Debian 12 + Docker-in-LXC running vaultwarden/server:latest 1.36.0 (LXC features nesting=1,keyctl=1). Static 10.0.10.7/24 on VLAN 10. Container on 10.0.10.7:8080, fronted via Traefik at https://vault.hm.iamkay.eu/ with the existing LE wildcard cert. SQLite backend; data at /opt/vaultwarden/data/; PBS-backed daily. Container lifecycle via vaultwarden-compose.service systemd unit. Initial Vaultwarden binary install attempt failed because vaultwarden upstream doesn't ship prebuilt binaries (only Docker images) — pivoted to Docker-in-LXC pattern (the "right" pattern for upstream-Docker-distributed apps; first concrete example of the LXC-foundation + Docker-for-multicontainer-apps strategy described in CLAUDE.md §9). Admin services repointed through Traefik same session: DNS records for arochukwu/pbs/homenas updated in LXC 250 Unbound from their direct backend IPs to 10.0.10.10 (Traefik); new Traefik dynamic config /etc/traefik/dynamic/admin-services.yml routes via insecureSkipVerify serversTransport (backends have self-signed certs). All four (vault + arochukwu/pve + pbs + homenas) verified browser-trusted from Kay's laptop. arochukwu-ilo stays DIRECT (iLO 3 TLS 1.0 only — Traefik's Go TLS stack refuses; documented in reference_ilo3_v194_quirks.md). Lockdown done same session: signups + invitations disabled (config.json signups_allowed=false, invitations_allowed=false); admin token rotated, argon2id-hashed with vaultwarden's bitwarden preset (m=65536, t=3, p=4), stored in config.json (overrides docker-compose env); Cloudflare API token rolled in Cloudflare web UI ("Edit zone DNS 4 traefik-hm" → Roll, scope unchanged at iamkay.eu), new value pushed to /etc/traefik/cloudflare.token 0600 traefik:traefik, Traefik restarted, dashboard + cert chain verified intact. 12-item credential migration done by Kay via Vaultwarden web UI (folders: Homelab/Infrastructure, Homelab/Services, Homelab/API keys, Homelab/SSH keys, Personal, Family). Build gotchas surfaced in Session 9: (a) docker-compose v1 interpolates $ in env values — argon2 hash needs $$ escaping in YAML (but config.json overrides so the YAML escape is cosmetic-only at runtime); (b) bash's ${VAR//\$/\$\$} substitution expands $$ to bash PID — use sed 's/\$/\$\$/g' instead; (c) nested-heredocs through SSH+pct exec consistently mangle YAML — use Write→scp→pct push pattern. Pi-hole queued as Phase 3.6 in homelab-tracker.md (DNS UX upgrade with Unbound sidecar, after Phase 3.5 Authelia). DEFERRED: break-glass card (paper printout with master pw + critical recovery creds) — template at D:\PVE\break-glass-card.txt, Kay to fill + print + store + delete; tracker keeps it surfaced.
Traefik LXC built + LE wildcard cert + dashboard reachable end-to-end DONE (Phase 3.3 of homelab-tracker). LXC 251 traefik, Debian 12 + Traefik v3.7.4, 1 vCPU / 1024 MB / 8 GB on local-lvm, unprivileged, onboot=1. Static 10.0.10.10/24 on VLAN 10. ACME with Let's Encrypt production endpoint, DNS-01 via Cloudflare API (token cfut_…13ae in /etc/traefik/cloudflare.token 0600 traefik:traefik — needs rotation once Vaultwarden lands since the value appeared in chat history). Wildcard cert *.hm.iamkay.eu valid 2026-06-05 → 2026-09-03, auto-renews. Dashboard at https://traefik.hm.iamkay.eu/dashboard/ behind BasicAuth (admin user; pw stashed at /root/traefik-build/traefik-dashboard.password until Vaultwarden). pfSense Unbound Domain Override added (hm.iamkay.eu → 10.0.10.53) so every network device using pfSense for DNS resolves *.hm.iamkay.eu via the internal Unbound LXC — no per-device hosts file needed for any future service. Verified end-to-end from Kay's laptop: nslookup returns 10.0.10.10, browser loads dashboard with green-padlock LE cert, BasicAuth prompt works. Build script + logs preserved at /root/traefik-build/ on PVE host. Build gotcha noted: nested heredoc + backtick interactions across SSH layers silently mangle YAML configs — always write configs locally, scp to host, then pct push to LXC; caught two failures of this type during Session 9.
Internal DNS LXC built + cutover DONE (Phase 3.2 of homelab-tracker). LXC 250 dns-internal, Debian 12 (bookworm) + Unbound 1.17.1, 1 vCPU / 512 MB / 8 GB on local-lvm, unprivileged, onboot=1. Static 10.0.10.53/24 on VLAN 10. Authoritative for hm.iamkay.eu. with initial A + PTR records (dns-internal, arochukwu, homenas, pbs, arochukwu-ilo). Full recursive resolver elsewhere with DNSSEC validation working end-to-end (ad flag confirmed). ACL allows VLAN 10 + VLAN 40. Listens on both 10.0.10.53:53 and 127.0.0.1:53. Initial root password generated + provided to Kay for Bitwarden; SSH key auth from host's root via injected authorized_keys. Cutover completed same session: PVE host /etc/resolv.conf → primary 10.0.10.53 + fallback 1.1.1.1; NAS /etc/systemd/resolved.conf.d/99-dns.conf → DNS=10.0.10.53 1.1.1.1; iLO web UI (manual by Kay) → primary 10.0.10.53, secondary 172.16.1.1 (pfSense MGMT), tertiary 1.1.1.1. Section 11 #10 marked DONE.
SPP firmware update ISOs DOWNLOADED + SHA256-verified + STAGED in PVE ISO storage 2026-06-05 (14:07 download start, 14:21 watchdog completed). NOT FLASHED - this is file-staging only. The actual firmware update to the P410i controller happens at the second chassis-open via iLO Virtual Media + SPP boot, NOT today. Two ISOs in /var/lib/vz/template/iso/, visible in PVE web UI as ISO Images. Both downloaded directly to PVE host (no Windows AV in the path), SHA256 cross-verified against two independent sources BEFORE download AND re-verified against downloaded bytes:
PRIMARY: SPP G7.1.3 Gen7-specific post-production stream (P04243_001_spp-g7.1-SPPG71.3.iso, 5,159,948,288 bytes / 4.81 GB, SHA256 99293396855a023d2d8c0df270dfb9b60ef83878c32a83ca102565f93c5ac80f, from ftp.paulla.asso.fr mirror, file type confirmed as bootable ISO 9660 with volume label SPPG71).
FALLBACK: SPP 2017.10.1 multi-gen (P01456_001_spp-2017.10.1-SPP2017101.2017_1027.10.iso, 5,431,967,744 bytes / 5.06 GB, SHA256 7d5d7ec56c185610a1b711e1ddd601e3bea7111c959827204b3aa30fb39f6eef, from archive.org Internet-Archive-preserved copy, file type confirmed as bootable ISO 9660 with volume label SPP2017101). One transient HTTP 500 from archive.org on the initial 2017.10.1 connection self-resolved via curl's --retry 3 --retry-delay 10; final downloaded bytes verify clean against the expected SHA. Provenance + checksums recorded in /var/lib/vz/template/iso/SPP-CHECKSUMS.txt. Apply during the second chassis-open (alongside HBA install) via iLO Virtual Media — this is the BBWC enable trigger. Section 11 #19 has full detail.

11. PENDING TASKS (prioritized, agreed 2026-05-30)¶

These are the tactical items queued after Session 3. Section 8 still holds the high-level foundation milestones (a)-(d); this list is the finer-grained breakdown plus tasks that sit alongside those milestones.

Maintenance window bundle (revised 2026-06-05 after BBWC install + firmware-rejection finding)¶

First chassis-open DONE 2026-06-05 (opportunistic, during power outage). Installed: riser (#1), CR2032 CMOS battery (#4), P410i BBWC battery. The BBWC physically lives in the controller now but cache is firmware-rejected on 3.66 — see Session 9 log + Section 5. Firmware update to 6.64 will flip the cache to enabled.

Second chassis-open gate: HBA + 2nd backplane + SSDs in hand. When all parts are present, bundle ALL of these in the second chassis-open:

LSI 9305-16i HBA installation in a PCIe x8 slot (Path B1)
HP 516966-B21 2nd 8-bay backplane installation (bays 9-16)
New data SSDs installed into bays 9-16 (or wherever they fit)
SFF-8643 ↔ SFF-8087 cables between HBA and new backplane (2 cables typically)
P410i firmware update to 6.64 via SPP 2017.10.1 (#19) — now load-bearing for BBWC enable, not optional. Bundle this with the chassis-open since iLO Virtual Media is needed and chassis is open anyway.
Visual inspection of PSU vents for dust accumulation

Items DONE in earlier chassis work (kept here as historical pointer): - ~~Riser install (#1)~~ DONE 2026-06-05 - ~~CMOS / RTC battery replacement (#4)~~ DONE 2026-06-05 - ~~P410i BBWC battery installation~~ DONE 2026-06-05 (cache firmware-rejected; enables after #19) - ~~Cat6A cable run (#2)~~ DONE 2026-06-04 (Sessions 5-7)

Hardware / physical¶

Riser install — DONE 2026-06-05 (Session 9). Installed during the opportunistic power-outage chassis-open.
Cat6A cable run — DONE 2026-06-04. Fish-tape work completed; cable is the physical substrate for the VLAN trunk used by Sessions 5-7.
Decide and buy SSDs for the P410i array (Section 8a). Specs: compatibility-vet against P410i first.
CMOS / RTC battery replacement — DONE 2026-06-05 (Session 9). CR2032 installed during the opportunistic chassis-open. Verify on next cold-from-AC-removal boot that timestamps no longer show [NOT SET] in the iLO IML.

Network¶

TP-Link TL-SG108E VLAN config from laptop. Laptop direct, 192.168.0.100/24 to the switch admin IP. Plan VLANs (10 servers, 40 management, 70 storage, 100 mgmt sub-iface) before the CRS310 purchase so the same VLAN map can be re-applied.
Bond architecture simplification — DONE 2026-05-31. bond0 is active-backup over enp3s0f0/enp3s0f1; bond1/bond2 removed. Detail in Section 4 + Session 4 log entry.
MikroTik CRS310 + proper VLAN switching (Section 8b).
Move iLO to MGMT VLAN 40 — DONE 2026-06-04 (Session 7). New IP 172.16.1.3/24, gw 172.16.1.1, in TL-SG108E port 6 (already untagged VLAN 40, no switch reconfig was needed). Untagged on the iLO side (802.1q VLAN ID = Disabled). Switched via in-band KCS + cold reset. Detail in Session 7 log entry.
Lock down cross-VLAN access to MGMT — DONE 2026-06-04 (Session 7). pfSense rule on the SERVERS interface (VLAN 10) blocks any traffic from SERVERS subnets to MGMT subnets (172.16.1.0/24). Sits between the existing "Block SERVERS to LAN" and "Allow SERVERS to internet" rules so internet egress from VLAN 10 still works. Verified six ways: laptop (VLAN 40) -> iLO + host VLAN-40 IP still work (intra-VLAN, pfSense not in path); VM 189 (VLAN 10) -> iLO + host VLAN-40 IP now blocked; VM 189 -> pfSense gw + internet still work. Note: intra-VLAN-40 access to iLO/host UDP/623 stays open by design (pfSense cannot filter L2 traffic on the same broadcast domain); blast radius for IPMI exposure is now just the VLAN 40 segment (laptop + host + iLO itself).
iLO IPv4 DNS — DONE 2026-06-05 (Session 9). Web UI Network -> Dedicated Network Port -> IPv4. DNS resolvers now point at the internal DNS LXC: Primary 10.0.10.53 (LXC 250 dns-internal), Secondary 172.16.1.1 (pfSense MGMT IP — same VLAN as iLO, no cross-VLAN routing for fallback), Tertiary 1.1.1.1 (public last-resort). Three-tier failure model: Tier 1 internal LXC; Tier 2 pfSense Unbound (would NXDOMAIN on hm.iamkay.eu queries since pfSense doesn't know the zone yet — could be fixed with a forward zone on pfSense Unbound, future hardening); Tier 3 public. Remaining cosmetic cleanup (deferred, non-blocking): uncheck "WINS Server Registration" on the same page.
SSH key auth for ssh ilo — DONE 2026-06-04 (Session 7). DSA-1024 keypair at ~/.ssh/id_dsa_ilo on the laptop, authorized on iLO for user kay via oemhp_loadSSHKey CLI command + HTTP-fetched key. RSA does NOT work on iLO 3 v1.94. Detail in Session 7 log entry and Section 6 iLO subsection.
Configure iLO SNMP alert destinations once observability stack exists. Current state: SNMP Alerts enabled, Forward IM Agent SNMP enabled, SNMP Pass-thru enabled, but all three Alert Destination IP boxes are empty — alerts are generated but go nowhere. Configure once Box 2 (the HP EliteDesk 800 G5 SFF observability host) is online. Where: iLO web UI -> Administration -> Management -> SNMP Settings -> SNMP Alert Destination(s). Two integration paths: a) SNMP receiver (LibreNMS or Prometheus snmp_exporter): point iLO destinations at receiver IP; async push on fan / PSU / temp / drive events. Traditional approach. b) IPMI exporter for Prometheus (preferred for this stack): Prometheus scrapes iLO directly via IPMI-over-LAN; pull model, integrates cleanly with Grafana. Use cipher suite 3 (-C 3) to dodge the cipher-negotiation warning. More granular than SNMP — reads sensor values, not just events. Decision deferred until observability architecture is chosen.

Storage / host¶

Migrate Proxmox to SD card (Section 8c).
Free the 512 GB P410i RAID volume (Section 8d).
Investigate P410i RAID layout. Toolchain ready as of Session 9 (ssacli 6.15-11.0 installed on host). Run ssacli ctrl all show config detail to map physical drives -> logical volumes / passthroughs. Identify which drives back Proxmox boot vs OMV backing storage vs anything else. Record drive serial numbers for future replacement planning. The Session 9 BBWC probe already captured controller-level detail (cache + battery state); the remaining inventory work is the per-drive mapping and SMART correlation (feeds #16).
Crucial M4 SSDs (Bay 5-8) age check. These are 2011-era consumer SSDs running in the array. Need SMART health check: smartctl -a /dev/sdX once device paths are known (output of #15 informs the mapping). Plan for replacement if any show >80% wear, pending sectors, or reallocation events. Crucial M4 famously had a firmware bug at 5184 power-on hours; the patched firmware (0309) should already be on these but worth confirming in the SMART output.

Recovery follow-through¶

HFS+ triage of apple-1tb.img — DONE. Wedding photos (the primary recovery target that started this whole effort 2026-05-27) extracted from the HFS+ image and placed on the homeNas SMB share, confirmed reachable from Windows. The original s-tank/recovery/ apple-1tb.img (~1 TB ddrescue'd image with 4 KB bad sectors at LBA 124.5 GB) remains on s-tank as the source-of-truth archive - can be deleted to reclaim space OR moved to external cold storage when convenient. The ClamAV scan recommended at recovery time can still be run against the SMB share as a hygiene check before long-term family-cloud serving.

Firmware / maintenance (low priority — not urgent)¶

Update DL380 G7 System BIOS from P67 (2011-05-05) to P67 (2018 final release). What the newer BIOS brings:
- CPU microcode update (Spectre / Meltdown mitigations)
- Improved memory training
- Misc bug fixes Install path: USB-key boot, OR iLO Virtual Media (now usable thanks to the iLO Advanced license applied 2026-05-30 — mount the ISO over the network, no physical media trip required). HP part: SP59653.exe / sp59653.zip from HPE Support Center. Not urgent — no observed operational issues with the current 2011 BIOS, just stale microcode. Prefer item #19 (SPP) over this if SPP can be safely obtained — SPP bundles BIOS + iLO + NIC + RAID + drive firmware in one operation.
HPE SPP for Gen7 — ISOs DOWNLOADING 2026-06-05 (Session 9), LOAD-BEARING for BBWC enable. Was "Firmware / maintenance (low priority - not urgent)" until Session 9 discovered that P410i firmware 3.66 rejects the freshly-installed BBWC battery as "wrong backup power source" - a known issue fixed by P410i firmware 6.64, included in the Gen7 SPP. Until this is applied, the P410i runs in write-through mode (safe, slower).

Two ISOs acquired + verified + STAGED in PVE ISO storage 2026-06-05 (download initiated 14:07 CEST, watchdog completed at 14:21 CEST with both SHA256s matching expected; both ISO files now at /var/lib/vz/template/iso/). NOT YET FLASHED to the controller — staging only. The actual firmware update happens at the second chassis-open via iLO Virtual Media → boot from ISO → SPP updater interactively flashes P410i + BIOS + iLO + NIC firmware → reboot. Until then, P410i firmware stays at 3.66 and BBWC stays in "Permanently Disabled" state.

PRIMARY — SPP G7.1.3 (Gen7-specific post-production) ← explicitly the safest match for "must be DL380 G7": - Filename: P04243_001_spp-g7.1-SPPG71.3.iso - Size: 4.81 GB (5,159,948,288 bytes) - SHA256: 99293396855a023d2d8c0df270dfb9b60ef83878c32a83ca102565f93c5ac80f - Source: https://ftp.paulla.asso.fr/pub/firmwares/hpe/spp/P04243_001_spp-g7.1-SPPG71.3.iso - SHA corroborated from two independent sources (paulla mirror notes.txt + dokumen.tips contents report) - This is the Gen7-only post-production stream HPE created after dropping G7 from mainstream SPP. Last-of-its-kind for ProLiant G7.

FALLBACK — SPP 2017.10.1 (multi-gen, also covers G7): - Filename: P01456_001_spp-2017.10.1-SPP2017101.2017_1027.10.iso - Size: ~5.05 GB - SHA256: 7d5d7ec56c185610a1b711e1ddd601e3bea7111c959827204b3aa30fb39f6eef - Source: https://archive.org/details/p-01456-001-spp-2017.10.1-spp-2017101.2017-1027.10.iso - SHA from digiboy.ir source; will be re-verified against the downloaded bytes by the watchdog.

Verified state (as of 2026-06-05 14:22 CEST): Both ISOs at /var/lib/vz/template/iso/, SHA256s match expected, file type confirmed as bootable ISO 9660. Provenance recorded in /var/lib/vz/template/iso/SPP-CHECKSUMS.txt. Ready to mount via iLO Virtual Media at the second chassis-open. The watchdog log (/root/spp-staging/verify-watchdog.log) and verification script (/root/spp-staging/verify-and-install.sh) remain on the host for audit; staging dir is otherwise empty.

Alternative lighter path: standalone P410i .scexe flash. HP's "Smart Component" Linux firmware updater for just the P410i (~5-10 MB), runnable on the PVE host, one reboot. Lower-risk than full SPP, doesn't update BIOS/iLO/NIC at the same time. The G7.1 ISO contains the same .scexe extractable via xorriso or 7z once verified, if a targeted-flash-only path is desired later. Not being downloaded separately - extract from SPP if/when needed.

Plan when ready: 1. Spin up isolated Linux container or VM (Proxmox LXC or Docker) with no persistent mounts to real data, network access only to the download target, throwaway environment. 2. Download SPP ISO from community mirror (ugg.li or alternative). 3. Verify integrity in the isolated env: - SHA256 vs any checksum published in HPE archived docs. - Mount ISO read-only, sanity-check the expected directory structure / catalogue files. - ClamAV (or similar) scan. - Cross-reference file size against HPE catalog. 4. If verified clean: - Copy ISO to safe storage location. - Record SHA256 in services.md (or equivalent). 5. Application path: - Mount ISO via iLO Virtual Media (Advanced license unlocks this). - Boot DL380 from virtual CD. - SPP auto-detects DL380 G7, presents available updates. - Flash BIOS + iLO + NIC + RAID + drive firmware in one run. - Reboot.

Apply during: the second chassis-open (alongside HBA + backplane + SSD install). BBWC enable depends on this; bundle them.

Alternative if SPP ISO can't be safely obtained: skip BIOS / SPP update entirely. Proxmox 8's kernel-level Spectre / Meltdown mitigations are sufficient for this homelab's threat model. Revisit only if the threat model changes (e.g. introducing untrusted multi-tenant workloads).
Boot order cleanup (cosmetic). Remove Floppy from the BIOS boot order to skip its POST probe. Saves ~2-3 seconds per boot. Not urgent. Do via BIOS F9 RBSU or via iLO web UI's Boot Order settings (now reachable with the web UI working).

Hardware aging / monitoring (informational, not urgent)¶

PSU intermittent failures during cold-start (May 2026) — awareness item, no action right now. Both PSUs are PS-2122-2H, ~14 years old. They showed intermittent failures during the cold-start in May 2026, correlated with memory training failures (so the DIMMs themselves are probably fine — it was PSU instability cascading into the memory subsystem during init). Probable cause: electrolytic capacitor drying after the 15-month dormancy; behaviour normalised after a few minutes of warm-up. Monitor: watch for recurrence after the next planned reboot — if the same symptoms appear on a stable, warmed-up system, the caps are likely past their service life. Mitigation if needed: replacement PSUs — used HP DPS-750RB-1 or HSTNS-PD30 from eBay / Marktplaats (~€25-50 each). Two PSUs in the chassis = swap one at a time without downtime if both are seated and healthy.
Temp 30 sensor — currently 69 C, alert if >85 C. Highest- running thermal point in the iLO sensor map. iLO's own thresholds are caution 110 C / critical 115 C — plenty of headroom — but a sustained climb above ~85 C suggests dust buildup in the heatsink / vents or a degrading fan, and warrants the dust inspection in the maintenance window bundle. No action right now; just keep an eye on it via iLO web UI Information -> System Information -> Temperatures, or via ipmitool sdr type Temperature.