Recently, I left for about a week on vacation. After coming home, I noticed this PC had crashed. After restarting, it crashed again an hour later. After restarting, it crashed again… an hour later.
After checking the hot-swap bay on my Corsair 800D (bypassing it by connecting direct), re-arranging the power cables to daisy chain less drives on a single cable, swapping SATA cables between the working and broken drives, upgrading the BIOS, and a few other things.. I finally came across this forum post for my Crucial SSD’s firmware:
Release Date: 01/13/2012
- Changes made in version 0002 (m4 can be updated to revision 0309 directly from either revision 0001, 0002, or 0009)
- Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point. The condition will allow the end user to successfully update firmware, and poses no risk to user or system data stored on the drive.
Apparently, after 5184 hours, the default firmware on these drives causes the drive to “disappear” an hour after booting up – due to some S.M.A.R.T. related bug. This timeout doesn’t reset until you power OFF, so the drive will still be missing from the BIOS settings after a crash and soft reboot.
Crucial claims there is no risk of data loss – but that’s completely false, as the associated BSOD / freeze degraded my RAID mirrors almost every time this happened. It also corrupted the truecrypt volume I had mounted – every time. That the failing drive itself flushes it’s caches isn’t enough to prevent data loss on other drives as an indirect result of the failure!
I would have never guessed, at the start of this problem, that hard drive firmware could have been the issue. To be honest, I don’t usually even think to upgrade hard drive firmware.. since when did that become a necessary maintenance step?
After a firmware upgrade for both drives (I’m using 2 in a striped RAID), everything is back to stable