Storage Hardening Tracker - Oak Techx Homelab¶
Goal: put the storage in an efficient, sound structure using ONLY what we currently have. Zero purchases. No new hardware. No architectural rebuild. Every item below is either a free tool, a config change, or space reclaim on the existing DL380 G7 + OMV (VM 189) + the two mdadm-RAID5/btrfs arrays. Created 02 June 2026, after the Apple-image wedding recovery closed out.
Hard constraints (do not violate)¶
- NOTHING in this plan requires buying anything. If a step seems to need new hardware, it is out of scope - flag it, do not do it.
- Explicitly OUT of scope (already ruled out by Kay, ~2-year hardware horizon): HBA card, ZFS migration, replacing the 14-year-old Crucial M4 SSDs, MikroTik switch. None of that here.
- VM 189 (homeNas) and its arrays: every change is a config/software change inside the existing OMV stack. No array rebuilds, no disk repartitioning.
- propose-stop-confirm on anything that writes. Read-only discovery first.
- No em dashes in generated content for Kay. English only.
Current storage state (baseline, 02 June 2026)¶
- s-tank (md1 RAID5, btrfs): 1.4 TB, ~8% used / ~1.3 TB free after the apple-1tb.img deletion. Bays 5-8 = Crucial M4 SSDs, ~14 years old.
- r-tank (md0 RAID5, btrfs): 839 GB, ~48% used. Bays 2-4 = HP EG0450FBDSQ SAS.
- Both arrays clean, no degradation. P410i presents each drive as single-drive RAID-0, which MASKS SMART from the OS.
- Structure is sound: real btrfs subvolumes, shadow_copy2 working, weekly snapshots on r-tank.
The two real risks this plan closes (both free to fix)¶
- DRIVE HEALTH IS INVISIBLE. P410i masks SMART; ssacli not installed. On 14-year-old M4 SSDs this is the biggest blind spot. Fix = install a free HP utility. No hardware.
- FAILURE ALERTS GO NOWHERE. OMV notifications were configured then disabled in delivery. A degraded RAID5 could go unnoticed until a second drive fails = data loss. Fix = re-enable existing OMV notifications via a free SMTP relay.
Redundancy without monitoring is redundancy you only find out failed too late. These two convert "I have RAID" into "I will hear about it in time to act."
TIER 0 - visibility and safety (do first, all free)¶
0.1 Install ssacli for drive SMART visibility¶
- What: HP Smart Storage Admin CLI. Free utility. Lets us read SMART/health the P410i currently hides.
- Why first: 14-year-old M4 SSDs in bays 5-8. We need to SEE their health.
- Cost: none (software). Risk: read tool, no array changes.
- Note: install on PVE host (the P410i lives there). apt cannot reach the mirror in current network - sideload the .deb from laptop, same pattern as testdisk/ffmpeg. Check HP's repo / a trusted mirror for ssacli .deb.
- Done when:
ssacli ctrl all show configlists the controller and we can read per-physical-drive SMART status, especially the four M4 SSDs.
0.2 Re-enable OMV failure notifications (SMTP relay)¶
- What: OMV notification system is installed but delivery is disabled. Wire it to a free SMTP relay (e.g. a Gmail account Kay already has, app password).
- Why: mdadm degrade + SMART warnings must reach Kay. Currently silent.
- Cost: none (uses an existing email account). Risk: config only.
- Verify the old vzdump/notification target too - the handover flagged a [email protected] address possibly sitting on 15 months of failure mail. Confirm or replace.
- Done when: a test notification arrives, AND mdadm/SMART event categories are enabled in OMV notification settings.
0.3 Fix the open NFS export /export/k8sdata¶
- What: it currently exports to
*(any host). Scope it to specific hosts or remove it (the k8s VMs that used it are destroyed). - Why: security hygiene. Harmless today behind the firewall, but wrong.
- Cost: none. Risk: config; confirm nothing live still mounts it first.
- Done when:
exportfs -vshows no*-scoped export, or k8sdata removed.
TIER 1 - efficiency and structure (free, current hardware)¶
1.1 Enable btrfs compression (zstd) on r-tank¶
- What: turn on transparent zstd compression (start zstd:3) on r-tank.
- Why: free space savings on text/code/docs - exactly the Path A (GitLab) and Path D (Nextcloud) content profile. High impact, no hardware.
- How: add compress=zstd:3 to the mount, then rebalance existing data to compress it (btrfs filesystem defragment -czstd, or a compress rebalance).
- Cost: none. Risk: writes; do during low use, propose-stop-confirm.
- Done when:
compsize(or btrfs fi df) shows compression active and a real ratio on existing r-tank data.
1.2 Schedule weekly btrfs scrub on both arrays¶
- What: cron a weekly
btrfs scrubon r-tank and s-tank. - Why: detects and (with RAID redundancy) repairs bit-rot before it spreads. On aging drives this matters more, not less.
- Cost: none. Risk: read/verify pass; schedule off-peak.
- Done when: scrub runs on a schedule and results are captured in the notification path from 0.2.
1.3 Extend snapshot schedule to s-tank¶
- What: r-tank has weekly snapshots; s-tank has none. Add the same OMV snapshot schedule to s-tank subvolumes worth protecting.
- Why: parity of protection. s-tank now holds real content (and will hold more after Phase 2 reclaim).
- Cost: none (btrfs snapshots are cheap/CoW). Risk: config.
- Done when: s-tank shows a running snapshot schedule.
1.4 Delete empty/obsolete subvolumes (k8s-data, k8sdata)¶
- What: remove the now-empty k8s subvolumes left from the destroyed cluster.
- Why: tidy structure, removes confusion.
- Cost: none. Risk: destructive (subvolume delete) - confirm they are truly empty and unreferenced first. propose-stop-confirm.
- Done when: the empty k8s subvolumes are gone,
btrfs subvolume listclean.
PHASE 2 DEPENDENT - space reclaim (free, waits on network)¶
2.1 Reclaim the 37 orphaned qcow2 + LXC 500 disk on s-tank (~561 GB)¶
- What: delete the disk files of the 22 destroyed VMs/LXC that were orphaned on s-tank. Recovers ~561 GB.
- Why: pure space reclaim, no hardware.
- Blocker: needs the
fastNFS path reachable from the host, which needs Foundation Phase 2 (switch VLANs operational). Do NOT hand-delete before then per CLAUDE.md. - Cost: none. Risk: destructive - own confirm-first step naming exact files, AFTER verifying each is an orphan of a destroyed VM.
- Done when: orphans gone, s-tank reflects the reclaim post btrfs sync.
Suggested execution order¶
- Finish recovery teardown + SESSION-LOG (in progress, separate).
- Tier 0: 0.1 ssacli -> 0.2 notifications -> 0.3 k8sdata export. (Do 0.1 and 0.2 together-ish; they are the safety net.)
- Tier 1: 1.1 compression -> 1.2 scrub -> 1.3 s-tank snapshots -> 1.4 subvol cleanup.
- Phase 2 dependent: 2.1 orphan reclaim, once VLANs are live.
Tier 0 is the priority. On 14-year-old SSDs, getting SMART visibility (0.1) and working alerts (0.2) is the difference between a heads-up and a surprise loss. Everything else is improvement on an already-sound structure.
What this plan deliberately does NOT do¶
- No HBA, no IT-mode flash (P410i cannot anyway), no controller change.
- No ZFS. No filesystem migration. btrfs-on-mdadm stays.
- No replacing the M4 SSDs or the SAS drives. We monitor them, we do not buy.
- No new switch. The TL-SG108E handles the VLAN work.
- No restructuring of the array layout. We turn on dormant features and reclaim space within the structure that already exists.