My first disk failure w/ ZFS using RAIDZ1! Panic!

I recently (seriously, less than 36 hours ago) experienced my first disk failure in an array running ZFS (FreeNAS/FreeBSD).  The pool configuration is comprised of 4 RAIDZ1 3 disk vdevs involving 1TB disks.  Many people on the internet will tell your RAID5/RAIDZ1 is dead but they claim this with no context.  When I’ve mentioned running RAIDZ1 to people their first reaction is to tell me how RAIDZ2 is better.  Sure, with 3 – 6TB disks I’d probably run RAIDZ2.  However, when dealing with 3 disks vdevs with “small” (by today’s standards) disks, is it worth running double or triple parity or risking your data with single disk redundancy?  Well, read more.

The volume StoragePool (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.

The message above is not something you want to see on a Monday as you come into work.  Catching this disk failure was actually a bit of a lucky situation.  FreeNAS sent me the failure message above at 2:15AM November 7 and I had missed it when cruising my email.  Coincidentally, a co-worker and I put together a brief shell script (that I will share in a future post) that emails me drive status and I had been tinkering with the script the day prior.  When the script sent me a daily email (at 9:00AM), I noticed that the output was showing bad sectors – lots of them.  I thought for sure that I must have pulled the wrong column from the smartctl output or something:

dev/da9 status is Passed with 0 bad sectors. Disk temperature is 41.

/dev/da10 status is Passed with 0 bad sectors. Disk temperature is 39.

/dev/da11 status is Passed with 65535 bad sectors. Disk temperature is 38.

/dev/da12 status is Passed with bad sectors. Disk temperature is 31.

The “Passed” figure comes right out of smartctl which is somewhat concerning.  But, wow – 65,535 bad sectors…  it filled the counter up.  That’s not good.

I checked the report from the day prior and it was 0.  I then manually ran smartctl against the device and it was in fact reporting 65,535 bad sectors and the volume was degraded.  Crap.  I don’t have any hot spares in the array because someone is there 99.9% of the time to swap a drive in but of course no one would be available for over 2 weeks.  I used Storage vMotion to evacuate data I cared about and made sure my Veeam backups had been completing successfully… just in case.

Because I am running FreeNAS 9.10 on my Dell R510, I knew I’d be able to hot-swap the drive with no down time.  It was simply a matter of making the ~1 hour drive and swapping the hardware.  I posted on the FreeNAS forums just to double check the process.  When I built this array I used sas2ircu to map out the serial numbers of the drives to the slot number on the backplane – this is critical to successfully pulling the correct disk:

# sas2ircu 0 DISPLAY…
Device is a Hard disk
  Enclosure # : 2
  Slot # : 9
  SAS Address : 500065b-3-6789-abf1
  State : Ready (RDY)
  Size (in MB)/(in sectors) : 953869/1953525167
  Manufacturer : ATA
  Model Number : Hitachi HUA72101
  Firmware Revision : A74A
  Serial No : GTE000PAJX8NKE
  GUID : N/A
  Protocol : SATA
  Drive Type : SATA_HDD

Note:  If you do run RAIDZ1 or any combination of single-disk redundancy per vdev or span, do realize that pulling the wrong disk out during replacement could result in total pool failure.  Don’t mess this up!

After correlating the serial number of the failed drive to the sas2ircu output (it was also convenient that slot #9 was not flashing any activity LEDs) I pulled the tray out of slot 9 and none of my VMs on the array exploded.  I then slid a new Dell Enterprise 1TB 7.2k RPM SATA disk into position and clicked the “Replace” button:

FreeNAS failed disk

This kicked off the ZFS resilvering process and changed the alert from a “A volume is degraded, fix your stuff” message to:

FreeNAS failed disk

I could see in the disk view that the serial number of the disk being “replaced” now had no description which I use to identify what slot it is in.  So, I checked sas2ircu again and updated that:

FreeNAS failed disk

Then I waited.  I made sure the resilvering process hit the 50 – 75% mark before heading home.  In all, the resilvering process took 1 hour 18 minutes to scan 1.91TB and resilver 162GB of data:

FreeNAS failed disk

Success!  Though, honestly, I didn’t expect this to fail.  The drives I am using are mostly enterprise units and since they’re 1TB in size there’s not a ton of thrashing during the rebuild.  However, people on the internet would have you believe that this was destined for disaster.  I do need to pick up a couple more enterprise class SATA disks since I only have one or two spares now but that’s my own problem.  In a 12-disk configuration, I do think that a pool made up of 4 RAIDZ1 vdevs with 3 disks each is the best compromise for usable space vs. performance.

Anyway, I am very content with my decision to choose FreeNAS as the solution to my shared storage dilemma in my vSphere cluster.  The product is built on a very reliable, resilient filesystem and offers tons of flexibility (like supporting my switchless 10 GbE configuration).  I’ll post up a couple video clips I have from replacing the disks to share just how simple this system is to use.  It’s always nice when technology works as intended!  I am giving FreeNAS the credit here, but the real hero is ZFS.  It just works.

Thanks for reading and as always please subscribe and feel free to comment!

Author: Jon

Share This Post On

7 Comments

  1. A decade ago, I was using the Addonics disk arrays so that I could hotswap bad drives out of my raids array. It worked really well.

    Later, I realized that I could take advantage of it – I replaced all the 250gb drives with 1tb drives, one at a time. After the last one, the entire pool was able to increase. With no downtime.

    Post a Reply
    • raidz2 – stupid autocorrect

      Post a Reply
  2. I did this on my raidz2 array just a day before you posted this. Small world

    Post a Reply
  3. Agreed that a pair of mirrors (software equivalent to raid10) is better for vm’s than raidZ1 as it gives better io and some redundancy. And for better redundancy you could do sets of three disks stripped. The more stripes you add the better the io. Add to that automated replication to a raidZ volume and a disk failure is nothing to write home about. I run a raidZ3 at home with seven HGST 4Tb drives.

    Post a Reply
  4. I have 3 production and 1 test FreeNAS zfs storage systems at my office. In the last 12 months I have had 12+ failed drives. I use raid 10 and raidz6. The first thing to do is not panic. HDD’s are lightbulbs they will burn out. Finding the corect volume if you dont have a good identity map can be done with a cat /dev/sdx > /dev/null walk for each drive. Look at the activity light and your good. My syatems have 20+ disks I even had the dreaded unrecoverable sector during a resilver. Here is what sepertase ZFS from standard RAID. It told me the filename that was dammged by the bad sector. I simply deleted that file and it allowed the resilver to compleate!! Thus I lost a file not a volume!!!! Read that. Lost a file not the entire 100TB volume becuase ZFS is superior in a lot of ways.

    To be honsest I thought I would lose the entire vol. Here is another tip. Turn on the “SCRUB” schedule. This will read each block and meta data and premptivly fix most failed data block and prevent hidden data coruption. This does slow the array down during the scan. But it likely would have prevented my unrecoverable file situation had it been on at the time. Since then I get about one fixed block every few months across 100+ drives.

    In the end I use vmware and FreeNAS with great success.

    In my case the failed file was part of a snapshot on the volume the other thing i lost was that I had to remove all sanpshots that included the bad file. In my case about 30 snaps had that file name included and had to remove the snapshots since each had a map to the bad file and the resilver would fail as long as something occupied the bad sector on the remaining disk out of the mirror.

    Thus the authors use of 3 disk raid1 volumes is not a bad idea if the data is important.

    The prior comment about raid10 fails to realize the authors system is already raid 10. Just each mirror is 3 disks not 2 and its actualy recomended if you use 2tb+ drives because of the posible issue I had.

    Post a Reply
    • @Koplin

      I did not “fail to realize” that the author is using striped raidz vdevs.

      Proclaiming is “Thus the authors use of 3 disk raid1 volumes is not a bad idea if the data is important” is folly as RAID is not a backup. Regardless of the pool layout, backups should be implemented.

      Post a Reply
  5. A better compromise would have been striping multiple mirrors (aka: Raid10.) Why didn’t you go with this solution as your needs (typical VM usage, further validated with the fact you only have about 1.2TB of data) depend more on random IO than straight throughput.

    The chance of failure is not much higher. Even at that… raid isn’t a backup. A double disk failure could have led to your entire pool failing, even using striped raidz1 as you have it.

    You would have lost another 2TB due to redundancy, but your pool would have 50% more random IO.

    Post a Reply

Submit a Comment

Your email address will not be published. Required fields are marked *

Share This
%d bloggers like this: