Virtualized TrueNAS on Proxmox with motherboard SATA passthrough - constant Command Timeout SMART failures, pool crashes - solved partially but need help
I've been running TrueNAS SCALE virtualized on Proxmox with 4x Seagate IronWolf 4TB (ST4000VN006-3CW104) in a RAIDZ1 pool. I was passing through the motherboard's SATA controller as a PCI device to the TrueNAS VM. Recently Scrutiny flagged all three drives with SMART failures and I went down a rabbit hole trying to figure out what was actually wrong.
My setup:
- Proxmox 9.1.9, single node
- TrueNAS SCALE virtualized
- 4x Seagate IronWolf 4TB in RAIDZ1
- Motherboard SATA controller passed through as PCI device to TrueNAS VM
What Scrutiny reported on all 3 drives:
- SMART attribute 188 (Command Timeout) — FAILED
- Attribute 199 (UltraDMA CRC Error Count) — WARN on one drive
- Attribute 183 (Runtime_Bad_Block) — extremely high on one drive (8,653)
The worst drive (WW66E7T3):
- Command Timeout raw value: 17,180,262,774
- Runtime_Bad_Block: 8,653
- Power Cycle Count: 849 (other drives were at ~153-154)
- UltraDMA CRC: 12
The insane power cycle count compared to the others (bought all at the same time, same usage period) was the first red flag — the drive was clearly losing and re-establishing connection constantly.
What I tried:
- Realized motherboard SATA controller passthrough is fundamentally problematic — it's not a discrete PCIe device, causes shared interrupt/DMA issues with the host, leading to command timeouts across all drives. Switched to disk-by-id passthrough instead (
qm set <vmid> -scsi1 /dev/disk/by-id/ata-...). - The worst drive (WW66E7T3) started causing pool crashes immediately after VM start — kernel log flooded with
critical target error, dev sdaand ZIO errors with error=121. - Removed the drive from VM passthrough, ran
smartctl -t longfrom Proxmox host — completed without error, PASSED. No reallocated sectors, no pending sectors, SMART error log clean. - Changed SATA cable and port for that drive, re-ran extended SMART — still PASSED, Command Timeout and CRC values didn't increase at all.
- Re-added drive to pool via
zpool replace, but drive keeps causing I/O errors under load (during scrub especially) and crashing the VM. Currently sitting UNAVAIL in the pool.
Current state:
- Pool is DEGRADED with WW66E7T3 UNAVAIL
- Other 3 drives ONLINE, no data errors
- Scrub completed: repaired 0B, 0 errors
- Drive is within warranty (Seagate IronWolf, expires June 2028)
My questions:
- SMART extended test passes perfectly but the drive fails under real ZFS load — is this a known failure mode? Can a drive pass long SMART but still be genuinely failing?
- Could this still be a cable/port/controller issue despite the cable swap? The Runtime_Bad_Block of 8,653 and the power cycle anomaly point strongly to physical connection instability.
- Anyone else running TrueNAS virtualized on Proxmox with motherboard SATA passthrough experiencing similar timeout issues? Switching to disk-by-id helped the other drives significantly.
- Should I just RMA the drive given it's under warranty, or is there more diagnostics worth running first?
Thanks in advance.
u/yuaina42 — 4 days ago