u/yuaina42

I've been running TrueNAS SCALE virtualized on Proxmox with 4x Seagate IronWolf 4TB (ST4000VN006-3CW104) in a RAIDZ1 pool. I was passing through the motherboard's SATA controller as a PCI device to the TrueNAS VM. Recently Scrutiny flagged all three drives with SMART failures and I went down a rabbit hole trying to figure out what was actually wrong.

My setup:

Proxmox 9.1.9, single node
TrueNAS SCALE virtualized
4x Seagate IronWolf 4TB in RAIDZ1
Motherboard SATA controller passed through as PCI device to TrueNAS VM

What Scrutiny reported on all 3 drives:

SMART attribute 188 (Command Timeout) — FAILED
Attribute 199 (UltraDMA CRC Error Count) — WARN on one drive
Attribute 183 (Runtime_Bad_Block) — extremely high on one drive (8,653)

The worst drive (WW66E7T3):

Command Timeout raw value: 17,180,262,774
Runtime_Bad_Block: 8,653
Power Cycle Count: 849 (other drives were at ~153-154)
UltraDMA CRC: 12

The insane power cycle count compared to the others (bought all at the same time, same usage period) was the first red flag — the drive was clearly losing and re-establishing connection constantly.

What I tried:

Realized motherboard SATA controller passthrough is fundamentally problematic — it's not a discrete PCIe device, causes shared interrupt/DMA issues with the host, leading to command timeouts across all drives. Switched to disk-by-id passthrough instead (qm set <vmid> -scsi1 /dev/disk/by-id/ata-...).
The worst drive (WW66E7T3) started causing pool crashes immediately after VM start — kernel log flooded with critical target error, dev sda and ZIO errors with error=121.
Removed the drive from VM passthrough, ran smartctl -t long from Proxmox host — completed without error, PASSED. No reallocated sectors, no pending sectors, SMART error log clean.
Changed SATA cable and port for that drive, re-ran extended SMART — still PASSED, Command Timeout and CRC values didn't increase at all.
Re-added drive to pool via zpool replace, but drive keeps causing I/O errors under load (during scrub especially) and crashing the VM. Currently sitting UNAVAIL in the pool.

Current state:

Pool is DEGRADED with WW66E7T3 UNAVAIL
Other 3 drives ONLINE, no data errors
Scrub completed: repaired 0B, 0 errors
Drive is within warranty (Seagate IronWolf, expires June 2028)

My questions:

SMART extended test passes perfectly but the drive fails under real ZFS load — is this a known failure mode? Can a drive pass long SMART but still be genuinely failing?
Could this still be a cable/port/controller issue despite the cable swap? The Runtime_Bad_Block of 8,653 and the power cycle anomaly point strongly to physical connection instability.
Anyone else running TrueNAS virtualized on Proxmox with motherboard SATA passthrough experiencing similar timeout issues? Switching to disk-by-id helped the other drives significantly.
Should I just RMA the drive given it's under warranty, or is there more diagnostics worth running first?

Thanks in advance.

Virtualized TrueNAS on Proxmox with motherboard SATA passthrough - constant Command Timeout SMART failures, pool crashes - solved partially but need help