Skip to content

Storage Review — arochukwu — 2026-05-31

Read-only investigation across PVE host + VM 189 (homeNas) + NFS exports + backup state. No system changes made. Goal: build a complete picture of storage architecture, health, and risks, to inform a future decision-making session.

Critical context: no operational backups exist. A vzdump job is configured but has been failing every Sunday since 2024-03 (target storage was unreachable). This is the central finding.


Section A: Physical drive inventory with health metrics

Drive identification (from May 2026 boot dmesg)

The DL380 G7 chassis has 8 bays + iLO SD reader. The P410i RAID controller fronts all 8 drives and presents each as a single-drive RAID-0 logical volume to the host. Per kernel boot log:

Bay SCSI Drive type Model Notes
1 2:0:1:0 SATA SSD SPCC Solid State (M.2 / SATA SSD, exact model not in dmesg) Boot disk + Proxmox root + local-lvm thin pool
2 2:0:2:0 SAS HDD HP EG0450FBDSQ (450 GB 10K SAS) passthrough -> VM 189 vda -> md0
3 2:0:3:0 SAS HDD HP EG0450FBDSQ passthrough -> VM 189 vdb -> md0
4 2:0:4:0 SAS HDD HP EG0450FBDSQ passthrough -> VM 189 vdc -> md0
5 2:0:5:0 SATA SSD Crucial M4-CT512M4SSD2 512 GB (2011-era, ~14 yrs) passthrough -> VM 189 vdd -> md1
6 2:0:6:0 SATA SSD Crucial M4-CT512M4SSD2 passthrough -> VM 189 vde -> md1
7 2:0:7:0 SATA SSD Crucial M4-CT512M4SSD2 passthrough -> VM 189 vdf -> md1
8 2:0:8:0 SATA SSD Crucial M4-CT512M4SSD2 passthrough -> VM 189 vdg -> md1
iLO SD sdi USB Flash Reader "Single Flash Reader" 29.1 GB vfat Empty, available for Proxmox-to-SD migration

SMART data — NOT obtainable through current tooling

  • From host side: smartctl -a /dev/sd[a-h] returns "HP LOGICAL VOLUME" — the P410i abstracts away the underlying drive SMART. smartctl -d scsi returned exit 4 for all 8 — no useful per-drive health data.
  • From inside VM 189: the 7 virtio passthrough disks (vda-vdg) similarly return no SMART (virtio does not carry SMART through by default).
  • What would unlock this: install ssacli (HPE) or hpssacli (older) on host to query P410i directly. ssacli is NOT installed. Per safety rules, I did not install it.

Until ssacli is sideloaded (same playbook as the ipmitool / hponcfg sideloads — pull .deb from HPE MCP repo or community mirror), the individual physical drive health is opaque. We know what models are in each bay; we do not know their power-on hours, wear levels, reallocation counts, or pending sectors.

This is the drive-health blind spot for the homelab right now. Item 0 on the priority list.

What we DO know

  • All 8 host logical volumes are visible and responding to I/O (lsblk enumerates them cleanly, mdadm uses them inside VM 189 with no errors).
  • No kernel medium error / unrecovered read messages on host since boot (3+ days uptime). The "critical medium error" lines visible in dmesg are from session 1's ddrescue work on the external Apple HDD via Sabrent enclosure — unrelated to chassis drives.
  • BTRFS device stats inside VM 189: zero errors of every type (read, write, flush, corruption, generation) on both md0 and md1. Filesystem layer is happy with what the drives are returning.

Aging concern: Crucial M4 SSDs

The four Crucial M4-CT512M4SSD2 drives in bays 5-8 are from 2011. They have been running ~14 years. Without SMART data we cannot determine their wear level. Crucial M4 famously had a firmware bug at 5184 power-on hours requiring firmware 0309; that should already be applied on these (they have been past that mark for years if they are still working), but we cannot confirm.

Single-failure tolerance: all four M4s are in md1 (4-disk RAID 5). md1 survives 1 drive failure; 2 simultaneous failures = total loss of the s-tank btrfs filesystem (and everything on it, including the 1.01 TB of current data). With 4 drives of the same age and the same write-volume history, correlated failure risk is real — running a wear-balance check via ssacli would be informative.


Section B: RAID/array health

Host P410i layer

8 logical volumes (RAID-0 single-drive each). All visible, all responding to I/O. Health detail is opaque without ssacli.

The P410i is itself ~14 years old hardware and has a backup battery / capacitor (BBWC) that is slated for replacement (homelab-tracker Phase 1 PENDING — to be done during the same maintenance window as the riser install). A failed BBWC means write cache is disabled, which causes write performance to drop drastically. Doesn't immediately threaten data, but is a stability concern.

VM 189 mdadm arrays — both healthy

md0 (homenas:0, r-tank, p-backup target): - RAID-5 over 3x ~419 GB devices (vda, vdb, vdc — backed by 3x HP EG0450FBDSQ SAS HDDs) - Array size: 838.05 GiB - Created: Sat Nov 11 09:49:22 2023 - State: clean [3/3] [UUU] - Active devices: 3/3, failed: 0, spares: 0 - Last update: Sun May 31 01:17:34 2026 - Events: 825 (low — array is stable)

md1 (homenas:1, s-tank, ext-store target): - RAID-5 over 4x ~476 GB devices (vdd-vdg — backed by 4x Crucial M4 SSDs) - Array size: 1.43 TiB / 1535.82 GB - Created: Sat Nov 11 11:22:50 2023 - State: clean [4/4] [UUUU] - Active devices: 4/4, failed: 0, spares: 0 - Last update: Fri May 29 16:18:16 2026 - Events: 934

BTRFS filesystems — both clean, scrubbed recently

FS UUID Backing Size used Last scrub Errors
r-tank 71c8857b-...-5ecc90a8abeb /dev/md0 (838 GB) 336.25 GiB (~41%) 2026-05-28 01:33, finished clean 0 (all categories)
s-tank 55cabf8d-...-67e9ba6e0e09 /dev/md1 (1.40 TiB) 1.01 TiB (~73%) 2026-05-28 01:31, finished clean 0 (all categories)

Both scrubbed at the May 28 cold-start (OMV ran an auto-scrub after the 15-month dormancy). No errors found on either. Scrub durations: r-tank 21 min, s-tank 2 min (s-tank is much faster because SSD-backed).

Watch item: s-tank btrfs Data allocation is 96.58% used of allocated 1.05 TiB. BTRFS behaves badly when running out of unallocated space (operations slow / fail). The filesystem has 351 GiB unallocated on the device — Data can grow into that, so we are not in trouble yet — but at current rate of growth (if any), the unallocated buffer matters.


Section C: Capacity and utilization

PVE host

Storage Type Size Used Available % Notes
local dir on / 98.5 GB 46.7 GB 46.8 GB 47% Only content: OMV install ISO (898 MB). The 46 GB used is mostly Proxmox system + logs.
local-lvm (pve/data) LVM-thin 348.79 GiB 11.4 GiB 337 GiB 3.27% Only VM 189 OS (32 GB allocated, 16.29% used = ~5.2 GB actual). 4 of 5 VMs worth of headroom for Phase 3 infrastructure.
Backup-NAS NFS 0 0 0 disabled Pre-staged for VLAN 10 + disabled (today)
fast NFS 0 0 0 disabled Same
pve/root (/) ext4 (LV) 94 GiB 45 GiB 45 GiB 50% Proxmox root
pve/swap swap 8 GiB n/a n/a n/a Swap LV
sda2 vfat (LV) 1 GiB n/a n/a n/a EFI partition

pve/data thin pool: data 3.27%, metadata 0.53% — plenty of room.

VM 189 (homeNas)

Mount Backing Size Used Available %
/ (root) sda1 (32 GB virtio on local-lvm) 31 GB 4.0 GB 25 GB 14%
/srv/dev-disk-by-uuid-71c... (= /export/p-backup) md0 / btrfs r-tank 839 GB 338 GB 501 GB 41%
/srv/dev-disk-by-uuid-55c... (= /export/ext-store, also bind-mounted at /export/k8s-data, /export/k8sdata) md1 / btrfs s-tank 1.4 TB 1.1 TB 389 GB 73%

s-tank at 73% (df) / 96.58% allocated (btrfs) is the headline capacity concern. Most of that 1.01 TB is the 37 orphaned .qcow2 files from the destroyed k8s VMs (per D:\PVE\orphaned-fast-disks-20260531-212124.txt). Reclaiming them once VLAN 10 is up will free a substantial chunk of s-tank.


Section D: Configuration documentation — the full storage path

PHYSICAL                P410i CONTROLLER       HOST VIEW           PVE STORAGE     VM 189 PASSTHROUGH    GUEST MDADM       BTRFS             MOUNT                NFS EXPORT
==========              =================      =========           ===========     ==================    ============      =====             =====                ==========

Bay 1: SPCC SSD         RAID-0 logvol          /dev/sda            local (dir)     (not passthrough)     none              ext4 pve-root     /                    -
                                               sda1 vfat           local-lvm       VM 189 OS disk
                                               sda2 EFI            (lvm-thin       (32 GB alloc /
                                               sda3 LVM2_member     on pve VG)     5.2 GB used)
                                               -> pve-root
                                               -> pve-swap
                                               -> pve-data (thin)
                                                  -> vm-189-disk-0 -----------------> virtio sda inside VM 189 (OMV root, ext4)

Bay 2: HP EG0450FBDSQ   RAID-0 logvol          /dev/sdb            passthrough     VM 189 virtio2 -> vda (raid mbr) -+
Bay 3: HP EG0450FBDSQ   RAID-0 logvol          /dev/sdc            passthrough     VM 189 virtio3 -> vdb (raid mbr) -+-> md0 raid5 (838 GB) -> btrfs r-tank -> /srv/...-71c... -> /export/p-backup -> 10.0.10.5/24
Bay 4: HP EG0450FBDSQ   RAID-0 logvol          /dev/sdd            passthrough     VM 189 virtio4 -> vdc (raid mbr) -+

Bay 5: Crucial M4 SSD   RAID-0 logvol          /dev/sde            passthrough     VM 189 virtio5 -> vdd (raid mbr) -+
Bay 6: Crucial M4 SSD   RAID-0 logvol          /dev/sdf            passthrough     VM 189 virtio6 -> vde (raid mbr) -+-> md1 raid5 (1.4 TB) -> btrfs s-tank -> /srv/...-55c... -> /export/ext-store -> 10.0.10.5/24
Bay 7: Crucial M4 SSD   RAID-0 logvol          /dev/sdg            passthrough     VM 189 virtio7 -> vdf (raid mbr) -+                                              -> /export/k8s-data -> 172.16.10.0/24
Bay 8: Crucial M4 SSD   RAID-0 logvol          /dev/sdh            passthrough     VM 189 virtio8 -> vdg (raid mbr) -+                                              -> /export/k8sdata -> * (everyone!)

iLO Internal SD reader  (passthrough via USB)  /dev/sdi (29 GB)    -               -                     -                 vfat (empty)      -                    -

Nested complexity summary

5 layers between physical bay and end-user share for the SAS / M4 data:

bay -> P410i logical volume -> host /dev/sdN -> virtio passthrough -> mdadm raid5 (inside guest) -> btrfs (inside guest) -> NFS export (out of guest)

This is fragile in three specific ways:

  1. P410i is hardware RAID acting in single-drive mode. Every drive is one logical volume. This works but adds an opaque layer (no per-drive SMART without ssacli). True passthrough mode (HBA) would expose drive SMART through to mdadm and let smartctl work. P410i cannot be flashed to IT mode; would require replacing the controller with an LSI HBA.
  2. mdadm is doing the actual redundancy, inside the guest. The "hardware RAID" controller is doing nothing useful — its protection is single-drive RAID-0 which is no protection at all. If a bay fails, the corresponding /dev/vdX inside the guest goes bad and mdadm degrades the array. Recovery requires swapping the physical drive, re-creating the single-drive RAID-0 on the P410i (using ssacli), reattaching to the VM, and letting mdadm rebuild.
  3. btrfs on top of mdadm is a known "nest doll" choice. btrfs CAN do its own RAID and would prefer to. But with mdadm underneath, btrfs sees a single device per filesystem and can only do single-copy data (no btrfs-level redundancy). The DUP metadata gives some protection against corruption, but if mdadm gives btrfs bad data, btrfs has no second copy of data blocks to fall back on. Architecturally, this means btrfs is doing checksumming / scrub on data that mdadm trusts to be correct; if a drive returns wrong bytes silently, only btrfs notices.

Reverse mapping — what each export serves

NFS export Backing FS Net dest Purpose State today
/export/p-backup md0 / r-tank 10.0.10.5/24 Proxmox vzdump backup target (Backup-NAS storage in PVE) disabled in PVE storage.cfg; the vzdump job has been failing every Sunday since 2024-03
/export/ext-store md1 / s-tank 10.0.10.5/24 VM disk store (was fast storage in PVE) disabled in PVE storage.cfg; holds 37 orphaned .qcow2 files from destroyed VMs
/export/k8s-data md1 / s-tank (bind) 172.16.10.0/24 k8s persistent volume (?) unused — 172.16.10.0/24 is not a current VLAN
/export/k8sdata md1 / s-tank (bind) * (everyone!) k8s persistent volume? SECURITY ISSUE — exports to any IP that can reach the NFS port. Not actually reachable today because the network is locked down, but the config is sloppy.
/export (pseudo-root) / 10.0.10.5/24 + 172.16.10.0/24 + * NFSv4 pseudo-root read-only, low risk

Section E: Honest risk assessment

Single points of failure

  1. VM 189 itself. Everything depends on it being up. The OMV install, the NFS exports, the SMB shares — all live in this one VM. A guest-OS-level failure (OMV bug, kernel panic, accidental misconfig) takes down ALL family-data access until the VM is recovered.
  2. homeNas btrfs filesystems. Two single-copy btrfs filesystems, each with one device (md0 / md1). btrfs metadata is DUP-protected; data is not. A silent corruption that mdadm does not catch (rare but possible — bad cache battery + power loss + write-in-flight is the classic case) leaves only metadata DUP to find it.
  3. P410i controller. Old, capacitors / BBWC suspected aging, replacement is non-trivial (requires identifying a P410i-compatible replacement controller and re-importing the logical volume config). A controller failure with no spare = chassis is down until a replacement arrives.
  4. The chassis itself. A single G7 with no second machine means any whole-server event (motherboard, PSU, fire, theft, flood) loses everything.
  5. Cooling / power. PSUs are the suspected-aging PS-2122-2H pair (homelab-tracker hardware monitoring item). Recent cold-start failures suggest cap drying.

Hardware showing age

Component Concern Indicator
Crucial M4 SSDs (bays 5-8) 2011-era consumer SSDs, ~14 years; SMART blind today No data; need ssacli to confirm wear
HP EG0450FBDSQ SAS HDDs (bays 2-4) Enterprise-grade but old (~14 yrs estimated, matches the chassis vintage) Same SMART blind spot
P410i controller + BBWC Cap drying suspected; BBWC replacement already queued Phase 1 PENDING in homelab-tracker
PSUs (PS-2122-2H) Cold-start failures during May 2026 boot Documented in CLAUDE.md Phase 5 monitoring
CMOS battery "[NOT SET]" timestamps in IML suggest dead Already noted in maintenance bundle

Architectural concerns

  1. Zero working backups. vzdump job exists, fires every Sunday, has been failing for 15+ months (target NFS was unreachable through the cold-start period AND through the network rebuild in progress). No off-site backup at all. No PBS deployed.
  2. Nested storage architecture is fragile (Section D). Five layers between physical bay and end-user data; P410i HW RAID is doing no useful work; SMART is masked.
  3. /export/k8sdata exports to * — sloppy config. Not exploitable today (network not routed) but will be the moment the network rebuild brings exposure.
  4. vzdump email notifications going to [email protected] — Hotmail address. Worth confirming Kay still has access to that mailbox; if not, all those failure notifications from the last 15 months went into a void.
  5. s-tank at 96.58% allocated — needs watching. Cleanup of the 37 orphaned .qcow2 files (when VLAN 10 is up and fast is reachable) will reclaim a substantial chunk.

Data-loss scenarios mapping

Scenario What is lost Detected by Recoverable from
Single drive failure in md0 (3-drive RAID 5) nothing (array runs degraded; replace + rebuild) btrfs device stats / mdadm event Replace drive, re-create logvol on P410i, mdadm rebuild
Single drive failure in md1 (4-drive RAID 5) nothing immediately; but high correlated-failure risk because 4 identical-vintage drives same same — but second failure during rebuild = total data loss
Two simultaneous drive failures in same array The entire btrfs filesystem on that array mdadm Nothing — no backups. Data is gone.
Silent corruption (bad write, cache battery dying) Probably nothing visible until btrfs scrub finds it; subtle file corruption Next btrfs scrub btrfs metadata DUP may localize; data is single-copy so depends on luck
P410i controller failure Whole chassis I/O stops; LV config may be lost if controller-side state is not recoverable Boot or kernel error Replace controller, re-import. Risk of mis-import losing data if new controller is a different revision.
VM 189 corruption (OMV bug, kernel panic) Access to data; underlying mdadm + btrfs should survive Service alerts (none configured) Repair / reinstall OMV, re-mount existing btrfs
Whole-server event (PSU explosion, fire, theft, water damage) Everything. No off-site backup. obvious Nothing. Total loss.

How serious is this today?

Today, with no family data and no production workload running on it, the risk is "acceptable" in the sense that "if it breaks, only Kay's old data is lost." The moment Path D goes live and family data starts landing on Nextcloud (which itself stores on storage backed by these same drives, indirectly via PBS or direct mount), the risk profile changes from "I lose my own old stuff" to "I lose my sister wedding photos."

Path D must not go live until at least: PBS deployed + first backup + restore drill verified + off-site backup target chosen and operating + drive health visibility (ssacli) established.


Section F: Open questions for human decision

  1. ssacli sideload. Without it, we have no view into individual drive SMART. Is this OK to add to homelab-tracker Phase 1 PENDING and execute now (same sideload playbook as ipmitool — pull from HPE MCP or community mirror)?
  2. Crucial M4 replacement strategy. If ssacli reveals wear-level >80% on any of the four, we need a replacement plan. The bay layout suggests replacing one at a time (let mdadm rebuild between swaps), but: are we replacing with M4-compatible drives (other consumer SATA SSDs), enterprise SSDs (which are pickier with P410i), or doing a wholesale array migration to new drives during the foundation rebuild?
  3. HP EG0450FBDSQ replacement strategy. Same questions. These have been less-flagged as aging concerns but are equally old. SAS HDDs have a different replacement market than SATA SSDs.
  4. vzdump email destination — is [email protected] still live? If yes, that mailbox has 15+ months of "backup failed" emails sitting in it. Worth checking even just to confirm the alerting channel works.
  5. Sloppy NFS export /export/k8sdata to * — fix now or after VLAN rebuild? Fixing now is one line in /etc/exports + an exportfs -ra. Risk of doing it now: zero, since no client currently uses it. Doing it later: forgetting and exposing it when the network is opened up.
  6. k8s-data and k8sdata exports — were these for the now-destroyed k8s clusters? If so, can the bind-mounts and exports be removed entirely?
  7. btrfs on mdadm vs btrfs native RAID — is now the time to plan a migration? Doing so requires destroying and recreating the arrays (with a full data evacuation first), which is a multi-day operation. ZFS as an alternative would require swapping the P410i for an HBA. None of these are tonight call; just flagging that the current "mdadm-then-btrfs" stack is architecturally suboptimal.
  8. The 32 GB SD card for Proxmox-to-SD migration is empty and sitting in the iLO reader slot. Phase ordering: do we Proxmox-to-SD-migrate before or after the storage rebuild?
  9. Off-site backup target — Cloudflare R2 vs Backblaze B2 — locked architecture says deferred until before D.1, but knowing now informs the PBS configuration we will deploy.

Ordered by risk impact, not implementation effort.

TIER 0 — Address immediately (no good reason to defer)

  1. Sideload ssacli so we can read individual drive SMART through the P410i. ~80 KB .deb, same playbook as ipmitool. Without this, every drive-health decision is blind.
  2. Fix /export/k8sdata NFS export from * to a specific subnet (or remove entirely if the k8s clusters are gone — they are). One line edit, exportfs -ra to reload. Risk: zero.
  3. Verify or replace the vzdump notification email. If [email protected] is alive: check it for 15+ months of failure mails. If dead or unread: update the notification address and add a fallback (multi-recipient).
  4. Reclaim the 37 orphaned .qcow2 files on fast once VLAN 10 is up. Frees substantial space on s-tank (currently 96.58% allocated). This is already tracked in homelab-tracker Phase 4 but worth re-highlighting given how full s-tank is.

TIER 1 — Address as part of Phase 2 / 3 (network + service infrastructure)

  1. Deploy Proxmox Backup Server. This is homelab-tracker.md Phase 3 item 3.1 already. After PBS is running and target storage (homeNas Backup-NAS re-enabled, or a separate target) is reachable, run a full backup of VM 189 and verify it. First successful PBS restore drill is the gate for Path D launch.
  2. Re-evaluate Backup-NAS and fast storage entries. Both are pre-staged for VLAN 10. Decide post-network-rebuild whether to keep + re-enable or pvesm remove entirely.
  3. Run a btrfs scrub on s-tank after orphan cleanup. Today it was clean, but after deleting ~561 GB of orphaned files we should re-scrub to confirm health and to let btrfs reclaim unallocated space cleanly.

TIER 2 — Plan during Phase 2 maintenance window

  1. CMOS battery replacement + P410i BBWC replacement + visual PSU dust inspection — already tracked in homelab-tracker maintenance window bundle.
  2. PSU spares procurement — if cold-start failures recur after the upcoming reboot. Already in CLAUDE.md Section 11 #21 (informational).

TIER 3 — Plan after Phase 2 / before Path D launch (D.1)

  1. Off-site backup target chosen and operating. Cloudflare R2 vs Backblaze B2 decision; PBS configured to push to it; encryption verified.
  2. Full end-to-end restore drill of VM 189 backup, restoring to a throwaway VM on local-lvm. Document RTO. This is the gate for D.1 (Nextcloud) — no family data lands before this passes.

TIER 4 — Plan after Path D is operational

  1. Drive replacement strategy based on ssacli wear data. M4 SSDs probably need to go first; SAS HDDs second. Budget + sourcing.
  2. btrfs native RAID migration OR ZFS migration with HBA controller swap — long discussion, not a tonight item, not even a this-quarter item. Worth keeping on the radar as "the current architecture has limits we will eventually hit."
  3. Second machine for redundancy. Right now the chassis is a true single point of failure. The OPNsense / observability HP EliteDesk boxes (Phase 6 in homelab-tracker) are budgeted; an additional storage-capable second host is not. Probably out of scope for this homelab generation; flagged for completeness.

Appendix: Raw data files (on host)

Collected to /tmp/storage-review-20260531-234421/ on host and /tmp/storage-review-vm189-20260531-234518/ on VM 189. Includes individual smartctl outputs (mostly empty because of the P410i masking), pvesm / lvs / btrfs / mdadm outputs, and dmesg / journalctl extracts.

These get cleared on next reboot of host / VM 189. If long-term retention is desired, copy to /root/ on respective hosts.


Report generated 2026-05-31 by read-only investigation. No system state was modified. Counter-check this report against any conflicting recollection or context Kay has — the investigation is technically complete but recommendations are subject to revision based on context I do not have.