832 TB – ZFS on Linux – Configuring Storage: Part 3

In this blog post I’ll be continuing the ZFS on Linux project we’ve been going over.  If you’ve found yourself on this page directly and are completely confused, no worries, just check out the earlier articles to get you going in the right direction:

832 TB – ZFS on Linux – Project “Cheap and Deep”: Part 1

832 TB – ZFS on Linux – Setting Up Ubuntu: Part 2

With that out of the way let’s talk about this phase of the project.  If you’re following along then you know we’ve already got the hardware configured, the OS (Ubuntu 16.04 LTS in my case) installed, and we’re ready to actually start setting up the ZFS side of things.

Prepare the OS

The first thing I always, always do is bring the OS up to date and let it install all updates.  For Ubuntu, it looks like this:

I keep my VM templates (relatively) up to date but if you’re installing on bare metal you’ll surely need to update a bunch.  Here’s what I am faced with:

For CentOS and RHEL, I’d run:

For RHEL this will only work with a valid subscription, etc.

The machine should update its sources, then upgrade existing packages, and reboot.  Once it comes back, log back in and we’ll start to install the ZFS packages.

It’s important to note that you should read the following page(s) in regards to installing ZFS on CentOS/RHEL in order to decide whether you’ll install kABI-tracking kmod or DKMS packages.

 

Get your ZFS on

Ok – time to get down.  If running Ubuntu 16.04 LTS, just run:

Let’s take a second to go over what we just did there.  Below is what we installed and why:

  • zfs – obviously we’ll need the ZFS packages to do anything
  • nfs-kernel-server – even though we’re installing ZFS and it supports exporting via NFS, we need the NFS server
  • snmpd/snmp – we are going to want to monitor this thing for disk space, up time, load, etc.
  • mailutils – this part is optional but I prefer to just setup postfix to setup this server as an SMTP host to relay mail off of something else in the environment
  • pv lzop mbuffer – these will be useful later on when we talk about ZFS replication using Sanoid/Syncoid
  • fio – no, this isn’t FIOS spelled wrong, this is the flexible I/O tester for Linux.  You will want to run some sort of benchmarks locally vs. over NFS or iSCSI, or, maybe you don’t

This should result in ~80MB of downloads.  There’s only one portion of this that I am not going to go into super detail about configuring and that’s this:

The reason for this is because there’s just too many assumptions to make.  That said, I will choose Satellite System because I relay off of another host.  There’s some postfix post-configuration (say that 80 times fast) that needs to take place as well but unless I receive a ton of flak, I will omit that as well.

For good measure, reboot your system after installing all of those packages.  Then, let’s see if ZFS is ready to get down:

Once you’ve got Ubuntu 16.04 recognizing ZFS commands (as above) you can start configuring stuff.  The first thing we need to do is configure our disk layout.  This is where you need to apply what you’ve read about ZFS and RAIDZ-1, RAIDZ-2, etc.  So, as you might recall, I have a bunch of disks involved in this 832TB build:

Here’s the deal as concise as I can make it – pick your RAIDZx vdev configuration, find the disks you want to involve, and then create your zpool by referencing the disk id.  Why use disk id?  If you reference the device id such as /dev/sda/dev/sdb, etc. when building the pool, you risk the system losing track of which disk is which should it decide to mount the disks in a different order upon boot.  ZFS metadata should be able to put the pieces back together, but just avoid this all together.  I have seen this happen!  I have seen where, for some reason, various Linux distributions mount devices different are reboot randomly – what really, really sucks is when your /etc/fstab file is referencing the /dev/sdx device for a mount point and it’s flip flopped with /dev/sdand all of a sudden your application is dumping data on the wrong disk.

It is for this reason that I ONLY mount (yes, even in singular disk environments) disks by XFS labels or by /dev/disk/by-id.  What does this look like?

You get the idea.  All of those ata-HGST_HUH72808… values are the id of the disk(s) and is/are usually a concatenation of the manufacturer, model, and serial number all in one.  You get the idea – this never changes.

Once you have that you’re ready to create your zpool!  This is done with the following command:

The above command would create a zpool with a given name that consists of a RAIDZ-2 vdev with the disks listed thereafter.  If you want to create multiple vdevs in the pool that’s easy, too!  Just list the disks out for each vdev and then throw in another RAIDZx type and the rest of the disks, etc.

HOLD UP WAIT A MINUTE!

You can see above we’re specifying  -o ashift=12 and the reason for this is clear but may not be obvious at first.  We’re using modern disks that have 4k sector sizes.  Almost all disks today support 4k sectors but may also support 512b sectors to remain backwards compatible with legacy systems.  That said, if you do not specify the ashift (alignment shift) above, then you will incur significant performance penalties.  I won’t bore you, but the reason for this is because 2^ashift_value is the smallest I/O allowed on the vdev.  So, match that to your sector size and you’re golden.  This cannot be retroactively set.  Do this at the creation of each vdev even if adding a new vdev to an existing pool!

How do you find out what your disks support as far as sector size?  Easy!  Run the two commands below:

You see above that our disk supports 512 byte logical and 4k physical sectors which is confirmed by the  blockdev command.

Note:  Depending on what model disk(s) you’re using, ZFS may correctly identify the sector size and create the vdev/zpool with the right alignment shift withough specifying it.  However, do not bet on this.  If you created a zpool with the default sector size for 512b (ashift=9) and future disks stop reporting 512b compatibility, you will not be able to replace failed disks with new!  Be super cautious here!

But what about SSDs?  Same process!  However, depending on your SSD you may find that the sector size is 8192 bytes which is an 8k device.  This is common on SSDs.  If this is the case, you would want to create your SSD vdev with  -o ashift=13 .

A special case with Intel NVMe devices

Ok so we know  -o ashift=13 is for SSDs if they show 8192 byte sectors.  However, what does an Intel P3700 800GB PCIe NVMe disk support?  Well, using the Intel  isdct command (available here) we can report on the sector sizes straight from the device:

Hrm… 512b sectors.  Oh well..  wait – not so fast!  Intel NVMe devices have variable sector sizes according to this article.  First, update the firmware after downloading the utility to the host:

Once complete, reboot the host.

Then, to set this device to use 4k sectors, we just issue the following commands:

This assumes you have two Intel P3700 NVMe devices that you want to update (hence the 0 and 1 for the index).  We can confirm the settings by checking with:

Boom – 4k sectors!

Now that we’ve updated the NVMe firmware and set the sectors properly, let’s create the SLOG vdev!  First, get your device id just like previously:

Then create the SLOG  vdev as part of the original pool we created:

Finally, let’s look at the zpool as a whole:

I’ve obviously truncated device id’s from the output above but you get the idea.  Because I have 52 disks and created 5 RAIDZ-2 vedvs with 10 disks each, I have 2 spares.  Let’s add the two remaining disks to the pool as spares:

Alright!

If you happen to have 50 HGST 8TB disks in this configuration with two Intel P3700 800GB NVMe disks, then you can see the pool configuration should match mine with the following command:

The command above is showing the total zpool size with RAIDZx configuration and counting ZFS “losses”, etc.

A couple final tweaks

ZFS on Linux is set to use something like 50% of your system RAM for ARC by default.  If you’re like me, and have 256GB of RAM in a box, you don’t want to devote 128GB for the OS.  So, instead, we can edit /etc/modprobe.d/zfs.conf and add a line that reads  options zfs zfs_arc_max=206158430208 which comes out to 192GB of RAM (the setting is defined in bytes) dedicated to the ARC max size.  Granted, even 64GB of RAM (remainder of 256GB – 192GB) available to the OS is a lot, but I am just being cautious.

One last thing we absolutely want to configure is Zed!  Zed is a daemon that runs and alerts us on disk failures, etc.  It’s included with ZFS on Linux and the configuration file can be found in /etc/zfs/zed.d/zed.rc by default in Ubuntu.  Let’s look at my configuration:

I’ve removed all comments from the above output so that you can see what I have set.  You can see that it’s pretty simple overall.  You want to make sure all of the fields above are set and are not commented out so that Zed runs properly.  If you’re unsure of what to set, check your config file as the comments will still be in place explaining what each option does.

Assuming you have SMTP relay/postfix/etc. configured properly, you should be able to run the follow command:

Because there’s no data on the pool, it should run very quickly (minutes), and you should receive the following email:

The reason we get this email is because we have  ZED_NOTIFY_VERBOSE=1 which will send all output generated by Zed via email, even if not critical.

At this point, we’re now ready to start creating datasets (or zvols if that’s your thing).  But, for now, that’s a wrap.  Stay tuned for more on this topic and let me know if you are lost or want to see any aspect of this article highlighted in more detail!  Thanks everyone!

Author: Jon

Share This Post On

14 Comments

  1. ever since I updated the firmware and sector size on my Intel SSDPEDMW400G4 NVMe, I am getting a ton of errors in the syslog. Any thoughts?

    [ 555.598324] blk_update_request: I/O error, dev nvme0n1, sector 9120
    [ 585.615895] blk_update_request: I/O error, dev nvme0n1, sector 9128
    [ 615.633482] blk_update_request: I/O error, dev nvme0n1, sector 9136
    [ 645.651046] blk_update_request: I/O error, dev nvme0n1, sector 9144
    [ 675.668626] blk_update_request: I/O error, dev nvme0n1, sector 9152
    [ 705.686197] blk_update_request: I/O error, dev nvme0n1, sector 9160

    Post a Reply
    • Hi,

      We also got this problem. Did you find any solutions?
      We have 1 PC3700 NVME (PCI-Express) card (as cache). Reverting to 512b sectorsize, and it stops complaining.

      But kinda intressting that we also have 4 Intel NVME (PC3600), where this problem dosent exist.

      / Magnus

      Post a Reply
      • I did not find a fix, I had to revert to 512 myself. Please note I am still on ZoL 6.5.9.

        Post a Reply
  2. Great tip on the sector size NVMe drives. You made me realize that for the past 2 years my Intel NVMe has been set to 512. Now I just need a service window to fix. Thanks for some great posts.

    Post a Reply
    • No problem! Just make sure that if you’re running ZFS to rebuild your LOG vdev, obviously you can’t change ashift on the fly.

      Post a Reply
    • Thanks – do you have any specific loads you want to see run?

      Post a Reply
      • Hi Jon,

        Yes, I do.
        We use to run this IO benchmark here:

        # fio –ioengine=libaio –name=ZBOOX_BENCHMARK –rw=write –direct=0 –numjobs=1 –bs=1024k –iodepth=64 –ramp_time=15 –size=128G –filename=zbooxbench
        # fio –ioengine=libaio –name=ZBOOX_BENCHMARK –rw=read –direct=0 –numjobs=1 –bs=1024k –iodepth=64 –ramp_time=15 –size=128G –filename=zbooxbench
        # fio –ioengine=libaio –name=ZBOOX_BENCHMARK –rw=randread –direct=0 –numjobs=1 –bs=1024k –iodepth=64 –ramp_time=15 –size=128G –filename=zbooxbench

        Direct=0 because fio can’t do direct IO to a ZFS pool, as far as I know.

        Best,

        Danilo

        Post a Reply
  3. I was just wondering if you can change the amount of memory a freeNAS system can use like you did above. If so do you think freeNAS os needs some memory or should all of it be dedicated to cache?

    Post a Reply
    • FreeNAS needs memory, especially if using deduplication. It also needs memory to manage L2ARC devices and basic services. You must also consider the jails/containers you may run.

      Post a Reply
    • FreeNAS is tuned pretty well to assign the RAM available as necessary. That said, I am sure you can reassign if need be – there’s a whole “tuneables” menu in FreeNAS if I recall correctly.

      Post a Reply
  4. for your setup of 832TB you are using multiple raidz2 vdevs in a single pool, right? how many disks are you using in each vdev?

    Post a Reply

Trackbacks/Pingbacks

  1. 832 TB – ZFS on Linux – Project “Cheap and Deep”: Part 1 | JonKensy.com - […] 832 TB – ZFS on Linux – Configuring Storage: Part 3 | JonKensy.com - […] Yes! Sorry if unclear…

Submit a Comment

Your email address will not be published. Required fields are marked *

Share This
%d bloggers like this: