Storage Review — arochukwu — 2026-05-31¶
Read-only investigation across PVE host + VM 189 (homeNas) + NFS exports + backup state. No system changes made. Goal: build a complete picture of storage architecture, health, and risks, to inform a future decision-making session.
Critical context: no operational backups exist. A vzdump job is configured but has been failing every Sunday since 2024-03 (target storage was unreachable). This is the central finding.
Section A: Physical drive inventory with health metrics¶
Drive identification (from May 2026 boot dmesg)¶
The DL380 G7 chassis has 8 bays + iLO SD reader. The P410i RAID controller fronts all 8 drives and presents each as a single-drive RAID-0 logical volume to the host. Per kernel boot log:
| Bay | SCSI | Drive type | Model | Notes |
|---|---|---|---|---|
| 1 | 2:0:1:0 | SATA SSD | SPCC Solid State (M.2 / SATA SSD, exact model not in dmesg) | Boot disk + Proxmox root + local-lvm thin pool |
| 2 | 2:0:2:0 | SAS HDD | HP EG0450FBDSQ (450 GB 10K SAS) | passthrough -> VM 189 vda -> md0 |
| 3 | 2:0:3:0 | SAS HDD | HP EG0450FBDSQ | passthrough -> VM 189 vdb -> md0 |
| 4 | 2:0:4:0 | SAS HDD | HP EG0450FBDSQ | passthrough -> VM 189 vdc -> md0 |
| 5 | 2:0:5:0 | SATA SSD | Crucial M4-CT512M4SSD2 512 GB (2011-era, ~14 yrs) | passthrough -> VM 189 vdd -> md1 |
| 6 | 2:0:6:0 | SATA SSD | Crucial M4-CT512M4SSD2 | passthrough -> VM 189 vde -> md1 |
| 7 | 2:0:7:0 | SATA SSD | Crucial M4-CT512M4SSD2 | passthrough -> VM 189 vdf -> md1 |
| 8 | 2:0:8:0 | SATA SSD | Crucial M4-CT512M4SSD2 | passthrough -> VM 189 vdg -> md1 |
| iLO SD | sdi | USB Flash Reader | "Single Flash Reader" 29.1 GB vfat | Empty, available for Proxmox-to-SD migration |
SMART data — NOT obtainable through current tooling¶
- From host side:
smartctl -a /dev/sd[a-h]returns "HP LOGICAL VOLUME" — the P410i abstracts away the underlying drive SMART.smartctl -d scsireturned exit 4 for all 8 — no useful per-drive health data. - From inside VM 189: the 7 virtio passthrough disks (vda-vdg) similarly return no SMART (virtio does not carry SMART through by default).
- What would unlock this: install
ssacli(HPE) orhpssacli(older) on host to query P410i directly.ssacliis NOT installed. Per safety rules, I did not install it.
Until ssacli is sideloaded (same playbook as the ipmitool / hponcfg sideloads — pull .deb from HPE MCP repo or community mirror), the individual physical drive health is opaque. We know what models are in each bay; we do not know their power-on hours, wear levels, reallocation counts, or pending sectors.
This is the drive-health blind spot for the homelab right now. Item 0 on the priority list.
What we DO know¶
- All 8 host logical volumes are visible and responding to I/O (lsblk enumerates them cleanly, mdadm uses them inside VM 189 with no errors).
- No kernel
medium error/unrecovered readmessages on host since boot (3+ days uptime). The "critical medium error" lines visible in dmesg are from session 1's ddrescue work on the external Apple HDD via Sabrent enclosure — unrelated to chassis drives. - BTRFS device stats inside VM 189: zero errors of every type (read, write, flush, corruption, generation) on both md0 and md1. Filesystem layer is happy with what the drives are returning.
Aging concern: Crucial M4 SSDs¶
The four Crucial M4-CT512M4SSD2 drives in bays 5-8 are from 2011. They have been running ~14 years. Without SMART data we cannot determine their wear level. Crucial M4 famously had a firmware bug at 5184 power-on hours requiring firmware 0309; that should already be applied on these (they have been past that mark for years if they are still working), but we cannot confirm.
Single-failure tolerance: all four M4s are in md1 (4-disk RAID 5). md1 survives 1 drive failure; 2 simultaneous failures = total loss of the s-tank btrfs filesystem (and everything on it, including the 1.01 TB of current data). With 4 drives of the same age and the same write-volume history, correlated failure risk is real — running a wear-balance check via ssacli would be informative.
Section B: RAID/array health¶
Host P410i layer¶
8 logical volumes (RAID-0 single-drive each). All visible, all responding to I/O. Health detail is opaque without ssacli.
The P410i is itself ~14 years old hardware and has a backup battery / capacitor (BBWC) that is slated for replacement (homelab-tracker Phase 1 PENDING — to be done during the same maintenance window as the riser install). A failed BBWC means write cache is disabled, which causes write performance to drop drastically. Doesn't immediately threaten data, but is a stability concern.
VM 189 mdadm arrays — both healthy¶
md0 (homenas:0, r-tank, p-backup target): - RAID-5 over 3x ~419 GB devices (vda, vdb, vdc — backed by 3x HP EG0450FBDSQ SAS HDDs) - Array size: 838.05 GiB - Created: Sat Nov 11 09:49:22 2023 - State: clean [3/3] [UUU] - Active devices: 3/3, failed: 0, spares: 0 - Last update: Sun May 31 01:17:34 2026 - Events: 825 (low — array is stable)
md1 (homenas:1, s-tank, ext-store target): - RAID-5 over 4x ~476 GB devices (vdd-vdg — backed by 4x Crucial M4 SSDs) - Array size: 1.43 TiB / 1535.82 GB - Created: Sat Nov 11 11:22:50 2023 - State: clean [4/4] [UUUU] - Active devices: 4/4, failed: 0, spares: 0 - Last update: Fri May 29 16:18:16 2026 - Events: 934
BTRFS filesystems — both clean, scrubbed recently¶
| FS | UUID | Backing | Size used | Last scrub | Errors |
|---|---|---|---|---|---|
| r-tank | 71c8857b-...-5ecc90a8abeb |
/dev/md0 (838 GB) | 336.25 GiB (~41%) | 2026-05-28 01:33, finished clean | 0 (all categories) |
| s-tank | 55cabf8d-...-67e9ba6e0e09 |
/dev/md1 (1.40 TiB) | 1.01 TiB (~73%) | 2026-05-28 01:31, finished clean | 0 (all categories) |
Both scrubbed at the May 28 cold-start (OMV ran an auto-scrub after the 15-month dormancy). No errors found on either. Scrub durations: r-tank 21 min, s-tank 2 min (s-tank is much faster because SSD-backed).
Watch item: s-tank btrfs Data allocation is 96.58% used of allocated 1.05 TiB. BTRFS behaves badly when running out of unallocated space (operations slow / fail). The filesystem has 351 GiB unallocated on the device — Data can grow into that, so we are not in trouble yet — but at current rate of growth (if any), the unallocated buffer matters.
Section C: Capacity and utilization¶
PVE host¶
| Storage | Type | Size | Used | Available | % | Notes |
|---|---|---|---|---|---|---|
local |
dir on / | 98.5 GB | 46.7 GB | 46.8 GB | 47% | Only content: OMV install ISO (898 MB). The 46 GB used is mostly Proxmox system + logs. |
local-lvm (pve/data) |
LVM-thin | 348.79 GiB | 11.4 GiB | 337 GiB | 3.27% | Only VM 189 OS (32 GB allocated, 16.29% used = ~5.2 GB actual). 4 of 5 VMs worth of headroom for Phase 3 infrastructure. |
Backup-NAS |
NFS | 0 | 0 | 0 | disabled | Pre-staged for VLAN 10 + disabled (today) |
fast |
NFS | 0 | 0 | 0 | disabled | Same |
pve/root (/) |
ext4 (LV) | 94 GiB | 45 GiB | 45 GiB | 50% | Proxmox root |
pve/swap |
swap | 8 GiB | n/a | n/a | n/a | Swap LV |
sda2 |
vfat (LV) | 1 GiB | n/a | n/a | n/a | EFI partition |
pve/data thin pool: data 3.27%, metadata 0.53% — plenty of room.
VM 189 (homeNas)¶
| Mount | Backing | Size | Used | Available | % |
|---|---|---|---|---|---|
/ (root) |
sda1 (32 GB virtio on local-lvm) | 31 GB | 4.0 GB | 25 GB | 14% |
/srv/dev-disk-by-uuid-71c... (= /export/p-backup) |
md0 / btrfs r-tank | 839 GB | 338 GB | 501 GB | 41% |
/srv/dev-disk-by-uuid-55c... (= /export/ext-store, also bind-mounted at /export/k8s-data, /export/k8sdata) |
md1 / btrfs s-tank | 1.4 TB | 1.1 TB | 389 GB | 73% |
s-tank at 73% (df) / 96.58% allocated (btrfs) is the headline capacity concern. Most of that 1.01 TB is the 37 orphaned .qcow2 files from the destroyed k8s VMs (per D:\PVE\orphaned-fast-disks-20260531-212124.txt). Reclaiming them once VLAN 10 is up will free a substantial chunk of s-tank.
Section D: Configuration documentation — the full storage path¶
PHYSICAL P410i CONTROLLER HOST VIEW PVE STORAGE VM 189 PASSTHROUGH GUEST MDADM BTRFS MOUNT NFS EXPORT
========== ================= ========= =========== ================== ============ ===== ===== ==========
Bay 1: SPCC SSD RAID-0 logvol /dev/sda local (dir) (not passthrough) none ext4 pve-root / -
sda1 vfat local-lvm VM 189 OS disk
sda2 EFI (lvm-thin (32 GB alloc /
sda3 LVM2_member on pve VG) 5.2 GB used)
-> pve-root
-> pve-swap
-> pve-data (thin)
-> vm-189-disk-0 -----------------> virtio sda inside VM 189 (OMV root, ext4)
Bay 2: HP EG0450FBDSQ RAID-0 logvol /dev/sdb passthrough VM 189 virtio2 -> vda (raid mbr) -+
Bay 3: HP EG0450FBDSQ RAID-0 logvol /dev/sdc passthrough VM 189 virtio3 -> vdb (raid mbr) -+-> md0 raid5 (838 GB) -> btrfs r-tank -> /srv/...-71c... -> /export/p-backup -> 10.0.10.5/24
Bay 4: HP EG0450FBDSQ RAID-0 logvol /dev/sdd passthrough VM 189 virtio4 -> vdc (raid mbr) -+
Bay 5: Crucial M4 SSD RAID-0 logvol /dev/sde passthrough VM 189 virtio5 -> vdd (raid mbr) -+
Bay 6: Crucial M4 SSD RAID-0 logvol /dev/sdf passthrough VM 189 virtio6 -> vde (raid mbr) -+-> md1 raid5 (1.4 TB) -> btrfs s-tank -> /srv/...-55c... -> /export/ext-store -> 10.0.10.5/24
Bay 7: Crucial M4 SSD RAID-0 logvol /dev/sdg passthrough VM 189 virtio7 -> vdf (raid mbr) -+ -> /export/k8s-data -> 172.16.10.0/24
Bay 8: Crucial M4 SSD RAID-0 logvol /dev/sdh passthrough VM 189 virtio8 -> vdg (raid mbr) -+ -> /export/k8sdata -> * (everyone!)
iLO Internal SD reader (passthrough via USB) /dev/sdi (29 GB) - - - vfat (empty) - -
Nested complexity summary¶
5 layers between physical bay and end-user share for the SAS / M4 data:
bay -> P410i logical volume -> host /dev/sdN -> virtio passthrough -> mdadm raid5 (inside guest) -> btrfs (inside guest) -> NFS export (out of guest)
This is fragile in three specific ways:
- P410i is hardware RAID acting in single-drive mode. Every drive is one logical volume. This works but adds an opaque layer (no per-drive SMART without ssacli). True passthrough mode (HBA) would expose drive SMART through to mdadm and let smartctl work. P410i cannot be flashed to IT mode; would require replacing the controller with an LSI HBA.
- mdadm is doing the actual redundancy, inside the guest. The "hardware RAID" controller is doing nothing useful — its protection is single-drive RAID-0 which is no protection at all. If a bay fails, the corresponding
/dev/vdXinside the guest goes bad and mdadm degrades the array. Recovery requires swapping the physical drive, re-creating the single-drive RAID-0 on the P410i (using ssacli), reattaching to the VM, and letting mdadm rebuild. - btrfs on top of mdadm is a known "nest doll" choice. btrfs CAN do its own RAID and would prefer to. But with mdadm underneath, btrfs sees a single device per filesystem and can only do single-copy data (no btrfs-level redundancy). The DUP metadata gives some protection against corruption, but if mdadm gives btrfs bad data, btrfs has no second copy of data blocks to fall back on. Architecturally, this means btrfs is doing checksumming / scrub on data that mdadm trusts to be correct; if a drive returns wrong bytes silently, only btrfs notices.
Reverse mapping — what each export serves¶
| NFS export | Backing FS | Net dest | Purpose | State today |
|---|---|---|---|---|
/export/p-backup |
md0 / r-tank | 10.0.10.5/24 | Proxmox vzdump backup target (Backup-NAS storage in PVE) |
disabled in PVE storage.cfg; the vzdump job has been failing every Sunday since 2024-03 |
/export/ext-store |
md1 / s-tank | 10.0.10.5/24 | VM disk store (was fast storage in PVE) |
disabled in PVE storage.cfg; holds 37 orphaned .qcow2 files from destroyed VMs |
/export/k8s-data |
md1 / s-tank (bind) | 172.16.10.0/24 | k8s persistent volume (?) | unused — 172.16.10.0/24 is not a current VLAN |
/export/k8sdata |
md1 / s-tank (bind) | * (everyone!) |
k8s persistent volume? | SECURITY ISSUE — exports to any IP that can reach the NFS port. Not actually reachable today because the network is locked down, but the config is sloppy. |
/export (pseudo-root) |
/ | 10.0.10.5/24 + 172.16.10.0/24 + * |
NFSv4 pseudo-root | read-only, low risk |
Section E: Honest risk assessment¶
Single points of failure¶
- VM 189 itself. Everything depends on it being up. The OMV install, the NFS exports, the SMB shares — all live in this one VM. A guest-OS-level failure (OMV bug, kernel panic, accidental misconfig) takes down ALL family-data access until the VM is recovered.
- homeNas btrfs filesystems. Two single-copy btrfs filesystems, each with one device (md0 / md1). btrfs metadata is DUP-protected; data is not. A silent corruption that mdadm does not catch (rare but possible — bad cache battery + power loss + write-in-flight is the classic case) leaves only metadata DUP to find it.
- P410i controller. Old, capacitors / BBWC suspected aging, replacement is non-trivial (requires identifying a P410i-compatible replacement controller and re-importing the logical volume config). A controller failure with no spare = chassis is down until a replacement arrives.
- The chassis itself. A single G7 with no second machine means any whole-server event (motherboard, PSU, fire, theft, flood) loses everything.
- Cooling / power. PSUs are the suspected-aging PS-2122-2H pair (homelab-tracker hardware monitoring item). Recent cold-start failures suggest cap drying.
Hardware showing age¶
| Component | Concern | Indicator |
|---|---|---|
| Crucial M4 SSDs (bays 5-8) | 2011-era consumer SSDs, ~14 years; SMART blind today | No data; need ssacli to confirm wear |
| HP EG0450FBDSQ SAS HDDs (bays 2-4) | Enterprise-grade but old (~14 yrs estimated, matches the chassis vintage) | Same SMART blind spot |
| P410i controller + BBWC | Cap drying suspected; BBWC replacement already queued | Phase 1 PENDING in homelab-tracker |
| PSUs (PS-2122-2H) | Cold-start failures during May 2026 boot | Documented in CLAUDE.md Phase 5 monitoring |
| CMOS battery | "[NOT SET]" timestamps in IML suggest dead | Already noted in maintenance bundle |
Architectural concerns¶
- Zero working backups. vzdump job exists, fires every Sunday, has been failing for 15+ months (target NFS was unreachable through the cold-start period AND through the network rebuild in progress). No off-site backup at all. No PBS deployed.
- Nested storage architecture is fragile (Section D). Five layers between physical bay and end-user data; P410i HW RAID is doing no useful work; SMART is masked.
/export/k8sdataexports to*— sloppy config. Not exploitable today (network not routed) but will be the moment the network rebuild brings exposure.- vzdump email notifications going to
[email protected]— Hotmail address. Worth confirming Kay still has access to that mailbox; if not, all those failure notifications from the last 15 months went into a void. - s-tank at 96.58% allocated — needs watching. Cleanup of the 37 orphaned
.qcow2files (when VLAN 10 is up andfastis reachable) will reclaim a substantial chunk.
Data-loss scenarios mapping¶
| Scenario | What is lost | Detected by | Recoverable from |
|---|---|---|---|
| Single drive failure in md0 (3-drive RAID 5) | nothing (array runs degraded; replace + rebuild) | btrfs device stats / mdadm event | Replace drive, re-create logvol on P410i, mdadm rebuild |
| Single drive failure in md1 (4-drive RAID 5) | nothing immediately; but high correlated-failure risk because 4 identical-vintage drives | same | same — but second failure during rebuild = total data loss |
| Two simultaneous drive failures in same array | The entire btrfs filesystem on that array | mdadm | Nothing — no backups. Data is gone. |
| Silent corruption (bad write, cache battery dying) | Probably nothing visible until btrfs scrub finds it; subtle file corruption | Next btrfs scrub | btrfs metadata DUP may localize; data is single-copy so depends on luck |
| P410i controller failure | Whole chassis I/O stops; LV config may be lost if controller-side state is not recoverable | Boot or kernel error | Replace controller, re-import. Risk of mis-import losing data if new controller is a different revision. |
| VM 189 corruption (OMV bug, kernel panic) | Access to data; underlying mdadm + btrfs should survive | Service alerts (none configured) | Repair / reinstall OMV, re-mount existing btrfs |
| Whole-server event (PSU explosion, fire, theft, water damage) | Everything. No off-site backup. | obvious | Nothing. Total loss. |
How serious is this today?¶
Today, with no family data and no production workload running on it, the risk is "acceptable" in the sense that "if it breaks, only Kay's old data is lost." The moment Path D goes live and family data starts landing on Nextcloud (which itself stores on storage backed by these same drives, indirectly via PBS or direct mount), the risk profile changes from "I lose my own old stuff" to "I lose my sister wedding photos."
Path D must not go live until at least: PBS deployed + first backup + restore drill verified + off-site backup target chosen and operating + drive health visibility (ssacli) established.
Section F: Open questions for human decision¶
- ssacli sideload. Without it, we have no view into individual drive SMART. Is this OK to add to homelab-tracker Phase 1 PENDING and execute now (same sideload playbook as ipmitool — pull from HPE MCP or community mirror)?
- Crucial M4 replacement strategy. If ssacli reveals wear-level >80% on any of the four, we need a replacement plan. The bay layout suggests replacing one at a time (let mdadm rebuild between swaps), but: are we replacing with M4-compatible drives (other consumer SATA SSDs), enterprise SSDs (which are pickier with P410i), or doing a wholesale array migration to new drives during the foundation rebuild?
- HP EG0450FBDSQ replacement strategy. Same questions. These have been less-flagged as aging concerns but are equally old. SAS HDDs have a different replacement market than SATA SSDs.
- vzdump email destination — is
[email protected]still live? If yes, that mailbox has 15+ months of "backup failed" emails sitting in it. Worth checking even just to confirm the alerting channel works. - Sloppy NFS export
/export/k8sdatato*— fix now or after VLAN rebuild? Fixing now is one line in/etc/exports+ anexportfs -ra. Risk of doing it now: zero, since no client currently uses it. Doing it later: forgetting and exposing it when the network is opened up. - k8s-data and k8sdata exports — were these for the now-destroyed k8s clusters? If so, can the bind-mounts and exports be removed entirely?
- btrfs on mdadm vs btrfs native RAID — is now the time to plan a migration? Doing so requires destroying and recreating the arrays (with a full data evacuation first), which is a multi-day operation. ZFS as an alternative would require swapping the P410i for an HBA. None of these are tonight call; just flagging that the current "mdadm-then-btrfs" stack is architecturally suboptimal.
- The 32 GB SD card for Proxmox-to-SD migration is empty and sitting in the iLO reader slot. Phase ordering: do we Proxmox-to-SD-migrate before or after the storage rebuild?
- Off-site backup target — Cloudflare R2 vs Backblaze B2 — locked architecture says deferred until before D.1, but knowing now informs the PBS configuration we will deploy.
Section G: Recommended priority of follow-up items¶
Ordered by risk impact, not implementation effort.
TIER 0 — Address immediately (no good reason to defer)¶
- Sideload
ssacliso we can read individual drive SMART through the P410i. ~80 KB .deb, same playbook as ipmitool. Without this, every drive-health decision is blind. - Fix
/export/k8sdataNFS export from*to a specific subnet (or remove entirely if the k8s clusters are gone — they are). One line edit,exportfs -rato reload. Risk: zero. - Verify or replace the vzdump notification email. If
[email protected]is alive: check it for 15+ months of failure mails. If dead or unread: update the notification address and add a fallback (multi-recipient). - Reclaim the 37 orphaned
.qcow2files onfastonce VLAN 10 is up. Frees substantial space on s-tank (currently 96.58% allocated). This is already tracked in homelab-tracker Phase 4 but worth re-highlighting given how full s-tank is.
TIER 1 — Address as part of Phase 2 / 3 (network + service infrastructure)¶
- Deploy Proxmox Backup Server. This is
homelab-tracker.mdPhase 3 item 3.1 already. After PBS is running and target storage (homeNasBackup-NASre-enabled, or a separate target) is reachable, run a full backup of VM 189 and verify it. First successful PBS restore drill is the gate for Path D launch. - Re-evaluate
Backup-NASandfaststorage entries. Both are pre-staged for VLAN 10. Decide post-network-rebuild whether to keep + re-enable orpvesm removeentirely. - Run a btrfs scrub on s-tank after orphan cleanup. Today it was clean, but after deleting ~561 GB of orphaned files we should re-scrub to confirm health and to let btrfs reclaim unallocated space cleanly.
TIER 2 — Plan during Phase 2 maintenance window¶
- CMOS battery replacement + P410i BBWC replacement + visual PSU dust inspection — already tracked in homelab-tracker maintenance window bundle.
- PSU spares procurement — if cold-start failures recur after the upcoming reboot. Already in CLAUDE.md Section 11 #21 (informational).
TIER 3 — Plan after Phase 2 / before Path D launch (D.1)¶
- Off-site backup target chosen and operating. Cloudflare R2 vs Backblaze B2 decision; PBS configured to push to it; encryption verified.
- Full end-to-end restore drill of VM 189 backup, restoring to a throwaway VM on local-lvm. Document RTO. This is the gate for D.1 (Nextcloud) — no family data lands before this passes.
TIER 4 — Plan after Path D is operational¶
- Drive replacement strategy based on ssacli wear data. M4 SSDs probably need to go first; SAS HDDs second. Budget + sourcing.
- btrfs native RAID migration OR ZFS migration with HBA controller swap — long discussion, not a tonight item, not even a this-quarter item. Worth keeping on the radar as "the current architecture has limits we will eventually hit."
- Second machine for redundancy. Right now the chassis is a true single point of failure. The OPNsense / observability HP EliteDesk boxes (Phase 6 in homelab-tracker) are budgeted; an additional storage-capable second host is not. Probably out of scope for this homelab generation; flagged for completeness.
Appendix: Raw data files (on host)¶
Collected to /tmp/storage-review-20260531-234421/ on host and /tmp/storage-review-vm189-20260531-234518/ on VM 189. Includes individual smartctl outputs (mostly empty because of the P410i masking), pvesm / lvs / btrfs / mdadm outputs, and dmesg / journalctl extracts.
These get cleared on next reboot of host / VM 189. If long-term retention is desired, copy to /root/ on respective hosts.
Report generated 2026-05-31 by read-only investigation. No system state was modified. Counter-check this report against any conflicting recollection or context Kay has — the investigation is technically complete but recommendations are subject to revision based on context I do not have.