832 TB – ZFS on Linux – Configuring Storage: Part 3

In this blog post I’ll be continuing the ZFS on Linux project we’ve been going over.  If you’ve found yourself on this page directly and are completely confused, no worries, just check out the earlier articles to get you going in the right direction:

832 TB – ZFS on Linux – Project “Cheap and Deep”: Part 1

832 TB – ZFS on Linux – Setting Up Ubuntu: Part 2

With that out of the way let’s talk about this phase of the project.  If you’re following along then you know we’ve already got the hardware configured, the OS (Ubuntu 16.04 LTS in my case) installed, and we’re ready to actually start setting up the ZFS side of things.

Prepare the OS

The first thing I always, always do is bring the OS up to date and let it install all updates.  For Ubuntu, it looks like this:

sudo apt-get update && sudo apt-get upgrade -y && sudo reboot

I keep my VM templates (relatively) up to date but if you’re installing on bare metal you’ll surely need to update a bunch.  Here’s what I am faced with:

~$ sudo apt-get update && sudo apt-get upgrade
Hit:1 http://us.archive.ubuntu.com/ubuntu xenial InRelease
[...]
Get:12 http://security.ubuntu.com/ubuntu xenial-security/universe i386 Packages [146 kB]
Fetched 3,586 kB in 1s (2,061 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following packages have been kept back:
  linux-generic linux-headers-generic linux-image-generic
The following packages will be upgraded:
  apparmor bind9-host cryptsetup cryptsetup-bin dnsutils grub-legacy-ec2 libapparmor-perl libapparmor1 libbind9-140 libcryptsetup4
  libdns-export162 libdns162 libisc-export160 libisc160 libisccc140 libisccfg140 liblwres141 libpython3.5 libpython3.5-minimal
  libpython3.5-stdlib libxml2 linux-firmware python3.5 python3.5-minimal snapd tcpdump
26 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.
Need to get 59.3 MB of archives.
After this operation, 5,327 kB of additional disk space will be used.
Do you want to continue? [Y/n]

For CentOS and RHEL, I’d run:

sudo yum check-update && sudo yum update -y && sudo reboot

For RHEL this will only work with a valid subscription, etc.

The machine should update its sources, then upgrade existing packages, and reboot.  Once it comes back, log back in and we’ll start to install the ZFS packages.

It’s important to note that you should read the following page(s) in regards to installing ZFS on CentOS/RHEL in order to decide whether you’ll install kABI-tracking kmod or DKMS packages.

 

Get your ZFS on

Ok – time to get down.  If running Ubuntu 16.04 LTS, just run:

sudo apt-get install zfs nfs-kernel-server snmpd snmp mailutils pv lzop mbuffer fio

Let’s take a second to go over what we just did there.  Below is what we installed and why:

  • zfs – obviously we’ll need the ZFS packages to do anything
  • nfs-kernel-server – even though we’re installing ZFS and it supports exporting via NFS, we need the NFS server
  • snmpd/snmp – we are going to want to monitor this thing for disk space, up time, load, etc.
  • mailutils – this part is optional but I prefer to just setup postfix to setup this server as an SMTP host to relay mail off of something else in the environment
  • pv lzop mbuffer – these will be useful later on when we talk about ZFS replication using Sanoid/Syncoid
  • fio – no, this isn’t FIOS spelled wrong, this is the flexible I/O tester for Linux.  You will want to run some sort of benchmarks locally vs. over NFS or iSCSI, or, maybe you don’t

This should result in ~80MB of downloads.  There’s only one portion of this that I am not going to go into super detail about configuring and that’s this:

The reason for this is because there’s just too many assumptions to make.  That said, I will choose Satellite System because I relay off of another host.  There’s some postfix post-configuration (say that 80 times fast) that needs to take place as well but unless I receive a ton of flak, I will omit that as well.

For good measure, reboot your system after installing all of those packages.  Then, let’s see if ZFS is ready to get down:

:~$ sudo zpool status
no pools available

Once you’ve got Ubuntu 16.04 recognizing ZFS commands (as above) you can start configuring stuff.  The first thing we need to do is configure our disk layout.  This is where you need to apply what you’ve read about ZFS and RAIDZ-1, RAIDZ-2, etc.  So, as you might recall, I have a bunch of disks involved in this 832TB build:

Here’s the deal as concise as I can make it – pick your RAIDZx vdev configuration, find the disks you want to involve, and then create your zpool by referencing the disk id.  Why use disk id?  If you reference the device id such as /dev/sda/dev/sdb, etc. when building the pool, you risk the system losing track of which disk is which should it decide to mount the disks in a different order upon boot.  ZFS metadata should be able to put the pieces back together, but just avoid this all together.  I have seen this happen!  I have seen where, for some reason, various Linux distributions mount devices different are reboot randomly – what really, really sucks is when your /etc/fstab file is referencing the /dev/sdx device for a mount point and it’s flip flopped with /dev/sdand all of a sudden your application is dumping data on the wrong disk.

It is for this reason that I ONLY mount (yes, even in singular disk environments) disks by XFS labels or by /dev/disk/by-id.  What does this look like?

~$ ll /dev/disk/by-id/
drwxr-xr-x 2 root root 7080 Jul 28 08:57 ./
drwxr-xr-x 7 root root  140 Jul 27 11:57 ../
lrwxrwxrwx 1 root root    9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G49Y6Y -> ../../sdz
lrwxrwxrwx 1 root root    9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G4RN5Y -> ../../sdg
lrwxrwxrwx 1 root root    9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G4RPTY -> ../../sdh
lrwxrwxrwx 1 root root    9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G4RZ5Y -> ../../sdi
lrwxrwxrwx 1 root root   10 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G52EEY -> ../../sdaf
lrwxrwxrwx 1 root root    9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G52NWY -> ../../sdc
lrwxrwxrwx 1 root root   10 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G537PY -> ../../sdak
lrwxrwxrwx 1 root root    9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G548YY -> ../../sdm
lrwxrwxrwx 1 root root   10 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G54NXY -> ../../sdac
lrwxrwxrwx 1 root root    9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G54UJY -> ../../sdj
[etc...]

You get the idea.  All of those ata-HGST_HUH72808… values are the id of the disk(s) and is/are usually a concatenation of the manufacturer, model, and serial number all in one.  You get the idea – this never changes.

Once you have that you’re ready to create your zpool!  This is done with the following command:

 sudo zpool create -o ashift=12 [poolname] raidz2 ata-HGST_HUH728080ALE600_R6G49Y6Y ...

The above command would create a zpool with a given name that consists of a RAIDZ-2 vdev with the disks listed thereafter.  If you want to create multiple vdevs in the pool that’s easy, too!  Just list the disks out for each vdev and then throw in another RAIDZx type and the rest of the disks, etc.

HOLD UP WAIT A MINUTE!

You can see above we’re specifying -o ashift=12 and the reason for this is clear but may not be obvious at first.  We’re using modern disks that have 4k sector sizes.  Almost all disks today support 4k sectors but may also support 512b sectors to remain backwards compatible with legacy systems.  That said, if you do not specify the ashift (alignment shift) above, then you will incur significant performance penalties.  I won’t bore you, but the reason for this is because 2^ashift_value is the smallest I/O allowed on the vdev.  So, match that to your sector size and you’re golden.  This cannot be retroactively set.  Do this at the creation of each vdev even if adding a new vdev to an existing pool!

How do you find out what your disks support as far as sector size?  Easy!  Run the two commands below:

~$ sudo fdisk -l
Disk /dev/sdac: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

~$ sudo blockdev --getbsz /dev/sdg
4096

You see above that our disk supports 512 byte logical and 4k physical sectors which is confirmed by the blockdev command.

Note:  Depending on what model disk(s) you’re using, ZFS may correctly identify the sector size and create the vdev/zpool with the right alignment shift withough specifying it.  However, do not bet on this.  If you created a zpool with the default sector size for 512b (ashift=9) and future disks stop reporting 512b compatibility, you will not be able to replace failed disks with new!  Be super cautious here!

But what about SSDs?  Same process!  However, depending on your SSD you may find that the sector size is 8192 bytes which is an 8k device.  This is common on SSDs.  If this is the case, you would want to create your SSD vdev with -o ashift=13 .

A special case with Intel NVMe devices

Ok so we know -o ashift=13 is for SSDs if they show 8192 byte sectors.  However, what does an Intel P3700 800GB PCIe NVMe disk support?  Well, using the Intel isdct command (available here) we can report on the sector sizes straight from the device:

~$ sudo isdct show -a intelssd
ProductFamily : Intel SSD DC P3700 Series
ProductProtocol : NVME
ProtectionInformation : 0
ProtectionInformationLocation : 0
ReadErrorRecoveryTimer : Device does not support this command set.
SMARTEnabled : True
SMARTHealthCriticalWarningsConfiguration : 0
SMBusAddress : 106
SectorSize : 512

Hrm… 512b sectors.  Oh well..  wait – not so fast!  Intel NVMe devices have variable sector sizes according to this article.  First, update the firmware after downloading the utility to the host:

sudo isdct load -intelssd 0
sudo isdct load -intelssd 1

Once complete, reboot the host.

Then, to set this device to use 4k sectors, we just issue the following commands:

~$ sudo isdct start -intelssd 0 -nvmeformat LBAformat=3 SecureEraseSetting=0 ProtectionInformation=0 MetadataSettings=0
~$ sudo isdct start -intelssd 1 -nvmeformat LBAformat=3 SecureEraseSetting=0 ProtectionInformation=0 MetadataSettings=0

This assumes you have two Intel P3700 NVMe devices that you want to update (hence the 0 and 1 for the index).  We can confirm the settings by checking with:

~$ sudo isdct show -a -intelssd | grep Sec
PhysicalSectorSize : The selected drive does not support this feature.
SectorSize : 4096

Boom – 4k sectors!

Now that we’ve updated the NVMe firmware and set the sectors properly, let’s create the SLOG vdev!  First, get your device id just like previously:

~$ sudo ls -l /dev/disk/by-id/ | grep nvme
lrwxrwxrwx 1 root root 13 Jul 28 08:56 nvme-INTEL_SSDPEDMD800G4_CVFT6484003U800CGN -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Jul 28 08:56 nvme-INTEL_SSDPEDMD800G4_CVFT64840094800CGN -> ../../nvme1n1

Then create the SLOG  vdev as part of the original pool we created:

sudo zpool add -o ashift=12 [poolname] log mirror nvme-INTEL_SSDPEDMD800G4_CVFT6484003U800CGN nvme-INTEL_SSDPEDMD800G4_CVFT64840094800CGN

Finally, let’s look at the zpool as a whole:

~$ sudo zpool status
  pool: [poolname]
 state: ONLINE
  scan: scrub repaired 0 in 0h7m with 0 errors on Sun Sep 10 00:31:16 2017
config:

        NAME                                             STATE     READ WRITE CKSUM
        [poolname]                                          ONLINE       0     0     0
          raidz2-0                                       ONLINE       0     0     0
            ata-HGST_HUH728080ALE600_R6G49Y6Y            ONLINE       0     0     0
            ...
          raidz2-1                                       ONLINE       0     0     0
            ata-HGST_HUH728080ALE600_R6G54V7Y            ONLINE       0     0     0
            ...
          raidz2-2                                       ONLINE       0     0     0
            ata-HGST_HUH728080ALE600_R6G5A7TY            ONLINE       0     0     0
            ...
          raidz2-3                                       ONLINE       0     0     0
            ata-HGST_HUH728080ALE600_R6G5JZRY            ONLINE       0     0     0
            ...
          raidz2-4                                       ONLINE       0     0     0
            ...
        logs
          mirror-5                                       ONLINE       0     0     0
            nvme-INTEL_SSDPEDMD800G4_CVFT6484003U800CGN  ONLINE       0     0     0
            nvme-INTEL_SSDPEDMD800G4_CVFT64840094800CGN  ONLINE       0     0     0

I’ve obviously truncated device id’s from the output above but you get the idea.  Because I have 52 disks and created 5 RAIDZ-2 vedvs with 10 disks each, I have 2 spares.  Let’s add the two remaining disks to the pool as spares:

~$ sudo zpool add [poolname] spare ata-HGST_HUH728080ALE600_VJGRVR1X ata-HGST_HUH728080ALE600_VJGRW57X

Alright!

If you happen to have 50 HGST 8TB disks in this configuration with two Intel P3700 800GB NVMe disks, then you can see the pool configuration should match mine with the following command:

~$ sudo zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
drpool1   362T  1.20T   361T         -     0%     0%  1.00x  ONLINE  -

The command above is showing the total zpool size with RAIDZx configuration and counting ZFS “losses”, etc.

A couple final tweaks

ZFS on Linux is set to use something like 50% of your system RAM for ARC by default.  If you’re like me, and have 256GB of RAM in a box, you don’t want to devote 128GB for the OS.  So, instead, we can edit /etc/modprobe.d/zfs.conf and add a line that reads options zfs zfs_arc_max=206158430208 which comes out to 192GB of RAM (the setting is defined in bytes) dedicated to the ARC max size.  Granted, even 64GB of RAM (remainder of 256GB – 192GB) available to the OS is a lot, but I am just being cautious.

One last thing we absolutely want to configure is Zed!  Zed is a daemon that runs and alerts us on disk failures, etc.  It’s included with ZFS on Linux and the configuration file can be found in /etc/zfs/zed.d/zed.rc by default in Ubuntu.  Let’s look at my configuration:

~$ sudo cat /etc/zfs/zed.d/zed.rc

ZED_EMAIL_ADDR="[email address to receive all the stuff]"

ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"

ZED_NOTIFY_INTERVAL_SECS=120

ZED_NOTIFY_VERBOSE=1

I’ve removed all comments from the above output so that you can see what I have set.  You can see that it’s pretty simple overall.  You want to make sure all of the fields above are set and are not commented out so that Zed runs properly.  If you’re unsure of what to set, check your config file as the comments will still be in place explaining what each option does.

Assuming you have SMTP relay/postfix/etc. configured properly, you should be able to run the follow command:

~$ sudo zpool scrub [poolname]

Because there’s no data on the pool, it should run very quickly (minutes), and you should receive the following email:

The reason we get this email is because we have ZED_NOTIFY_VERBOSE=1 which will send all output generated by Zed via email, even if not critical.

At this point, we’re now ready to start creating datasets (or zvols if that’s your thing).  But, for now, that’s a wrap.  Stay tuned for more on this topic and let me know if you are lost or want to see any aspect of this article highlighted in more detail!  Thanks everyone!

Author: Jon

Share This Post On

19 Comments

  1. I built a system like this using RHEL (Red Hat) and ZFS on Linux, but the management of it was a pain. I built another system like it two years later using FreeNAS (for the ZFS) and I must say it was very easy to setup and does all the things I need (60 x 10TB drives) and it is faster than the RHEL system is. I am moving the data off the RHEL system and plan to reconfigure it as FreeNAS also.

    Post a Reply
  2. I really like napp-it for ZFS management. I had installed a lot of home/hobby storage boxes with Oracle Solaris 11+ and nappit at consumer hardware , sometimes literally scrap 🙂 . Rock solid combo. Looks like the lack of ECC RAM is not critical when dealing with media files – photo, video, music..It’s amazing how good ZFS [under Solaris] performs from house ceiling to data center environments.

    Post a Reply
  3. Hi Jon,

    Any chance of a follow up article to show how you set up the nfs share and got it working with VMWare. I’ve got a lab happening with a box set up following along pretty much what you have here, set jumbo frames on the nics, bonded two of them on the storage device, got the data store installed on vmware esxi 6.7 hosts, but can’t make any vm’s.

    I can read iso’s off the storage tank, but the vm creation hangs at 99% and then eventually fails with an error about syncing the configurations. I’m close, very close, but figure its probably something to do with permissions (although I can see the files being created on the data store seemingly OK, as root:root). I’ve tried nfs3 and 4 without success.

    Love to see how you did the next bit.

    P.

    Post a Reply
  4. WOW, impressive, I’m working in the same business and I’m thankful for sharing your experiences! I’m not as fluent as you are in explaining things.

    I just converted a NetApp 40TB appliance into a ZFS filesystem, and since I got the invoice of the system, so I don’t trade with stolen goods, it was a $100k system I bought for $1k and sold for $10k, so your NEW 800TB system is so inexpensive with it’s $35k pricetag. I really appreciate your willingness to share your experience!

    Post a Reply
  5. ever since I updated the firmware and sector size on my Intel SSDPEDMW400G4 NVMe, I am getting a ton of errors in the syslog. Any thoughts?

    [ 555.598324] blk_update_request: I/O error, dev nvme0n1, sector 9120
    [ 585.615895] blk_update_request: I/O error, dev nvme0n1, sector 9128
    [ 615.633482] blk_update_request: I/O error, dev nvme0n1, sector 9136
    [ 645.651046] blk_update_request: I/O error, dev nvme0n1, sector 9144
    [ 675.668626] blk_update_request: I/O error, dev nvme0n1, sector 9152
    [ 705.686197] blk_update_request: I/O error, dev nvme0n1, sector 9160

    Post a Reply
    • Hi,

      We also got this problem. Did you find any solutions?
      We have 1 PC3700 NVME (PCI-Express) card (as cache). Reverting to 512b sectorsize, and it stops complaining.

      But kinda intressting that we also have 4 Intel NVME (PC3600), where this problem dosent exist.

      / Magnus

      Post a Reply
      • I did not find a fix, I had to revert to 512 myself. Please note I am still on ZoL 6.5.9.

        Post a Reply
        • I just followed this blog to set up a relatively similar server and ran into exactly errors described above when reformatting my Intel P3700
          NVME drives with 4k sectors. Errors disappeared when falling back to 512b sectors.

          I then realized that debian stretch provides out of the box zfs on linux 0.6.5, whereas backports already has 0.7.4. Since I am still in the testing phase, I updated, and the problems seem gone ..

          [Not sure I would dare this on a production system]

          Post a Reply
  6. Great tip on the sector size NVMe drives. You made me realize that for the past 2 years my Intel NVMe has been set to 512. Now I just need a service window to fix. Thanks for some great posts.

    Post a Reply
    • No problem! Just make sure that if you’re running ZFS to rebuild your LOG vdev, obviously you can’t change ashift on the fly.

      Post a Reply
    • Thanks – do you have any specific loads you want to see run?

      Post a Reply
      • Hi Jon,

        Yes, I do.
        We use to run this IO benchmark here:

        # fio –ioengine=libaio –name=ZBOOX_BENCHMARK –rw=write –direct=0 –numjobs=1 –bs=1024k –iodepth=64 –ramp_time=15 –size=128G –filename=zbooxbench
        # fio –ioengine=libaio –name=ZBOOX_BENCHMARK –rw=read –direct=0 –numjobs=1 –bs=1024k –iodepth=64 –ramp_time=15 –size=128G –filename=zbooxbench
        # fio –ioengine=libaio –name=ZBOOX_BENCHMARK –rw=randread –direct=0 –numjobs=1 –bs=1024k –iodepth=64 –ramp_time=15 –size=128G –filename=zbooxbench

        Direct=0 because fio can’t do direct IO to a ZFS pool, as far as I know.

        Best,

        Danilo

        Post a Reply
  7. I was just wondering if you can change the amount of memory a freeNAS system can use like you did above. If so do you think freeNAS os needs some memory or should all of it be dedicated to cache?

    Post a Reply
    • FreeNAS needs memory, especially if using deduplication. It also needs memory to manage L2ARC devices and basic services. You must also consider the jails/containers you may run.

      Post a Reply
    • FreeNAS is tuned pretty well to assign the RAM available as necessary. That said, I am sure you can reassign if need be – there’s a whole “tuneables” menu in FreeNAS if I recall correctly.

      Post a Reply
  8. for your setup of 832TB you are using multiple raidz2 vdevs in a single pool, right? how many disks are you using in each vdev?

    Post a Reply

Trackbacks/Pingbacks

  1. 832 TB – ZFS on Linux – Project “Cheap and Deep”: Part 1 | JonKensy.com - […] 832 TB – ZFS on Linux – Configuring Storage: Part 3 | JonKensy.com - […] Yes! Sorry if unclear…

Leave a Reply to noir Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.