832 TB – ZFS on Linux – Project “Cheap and Deep”: Part 1

When looking to store say, 800 terabytes of slow-tier/archival data my first instinct is to leverage AWS S3 (and or Glacier).  It’s hard – if not impossible – to beat the $/GB and durability that Amazon is able to provide with their object storage offering.  In fact, with the AWS Storage Gateway you can get “block storage” access to AWS for a decent price within your data center.  However, sometimes AWS is not an option.  This could be due to the application not knowing what to do with AWS API calls or maybe there is some legal or regulatory reason that the data cannot sit there.  After ruling out cloud storage options your next thought might be to add as much capacity as required, with overhead, to your existing storage infrastructure.  Hundreds of terabytes, however, can result in $500k – $1M+ of expense depending on what system you’re using.  In fact, a lot of the big players in the storage arena who support this kind of scale do so by licensing per terabyte (think Compellent, NetApp, EMC, etc.).  So while the initial hardware purchase from EMC or NetApp may seem acceptable the licensing fees will surely add up.  In this example, however, the requirement is literally “as much storage as possible, with some redundancy, for as little cost as possible…”  Let’s do it!

Choosing the OS/filesystem

If you follow my blog you may know that I experiment with different storage technology.  I have played around with different solutions such as Windows Storage Spaces, Nexenta, FreeNAS, Nutanix, unRAID, ZFS on several different operating systems, Btrfs, Gluster, Ceph, and others.  Because of the budget for this project, the first thing that popped into my head was ZFS on Linux.  The reason ZFS stood out to me was because of its redundancy and flexibility in storage pool configuration, its inherent (sane) support for large disk rebuilding, its price, and the performance it can offer.  Today, you can run ZFS on Ubuntu 16.0.4.2 LTS with standard repositories and Canonical’s Ubuntu Advantage Advanced Support.  That makes the decision easy.  You could also build this on Solaris with necessary licensing if you wanted to that route but it’d be more expensive.  Unfortunately Red Hat Enterprise Linux does not support ZFS (yet) and so that option was not in the running though I’d have gladly gone that route as well.  ZFS on Linux (ZoL) will also run on CentOS, Fedora, etc.

Hardware selection

After determining how I’d approach this solution from a software perspective, I needed to figure out the hardware component.  The only requirements I have for this project is that I need to hold as many disks as possible, support SAS2 or better (for large disks), present the disks directly to the server (no hardware RAID), and it must be affordable.  So, we’ve pretty much ruled out building a storage node using Dell, IBM, Cisco, HPE, etc. since the hardware will be at a premium combined with maintenance plans to match.  So what’s left?  There are a couple whitebox-type solutions out there, but Supermicro is obviously the “industry standard” for when you don’t want to pay a big name for a server/box.  In fact, more often than not, Supermicro is building the physical boxes that the other manufacturers are selling, anyway.

I spent some time browsing the offerings from Supermicro and came across two solutions that would work for my situation.  I ended up between the Supermicro SSG-6048R-E1CR60L or the SSG-6048R-E1CR90L – the E1CR60L is a 60-bay 4U chassis while the E1CR90L is a 90-bay 4U chassis.  This nice part is that no matter which platform you choose Supermicro sells this only as a pre-configured machine – this means that their engineers are going to make sure that the hardware you choose to put in this is all from a known compatibility list.  Basically, you cannot buy this chassis empty and jam your own parts in (boo, hiss, I know but this is for your own good).

For this build I went with two of the SSG-6048R-E1CR60L machines so that I have one in a production environment and one in a second environment that can be used for replication purposes.  The reason for choosing the 60-bay device over the 90-bay is that the 90-bay does not have any PCIe slots available.  This means that if you outgrow the 90-bay chassis you’ll need to build another whereas with the 60-bay unit, I could add a PCIe HBA with external connections (such as Broadcom SAS 9305-16e) and cable it up to an expansion chassis with another 60 disks, etc.

With the chassis selected there are only a few other configuration items I needed to decide on.  These items include spinning disks (that make up the ZFS pool), PCIe NVMe disks (optional, for pool SLOG), solid state disks (for OS install), network interface(s), CPU, and RAM.  I built each system with the following configuration:

  • 2 x Intel E5-2623 v4 – 4C/8T 2.6 GHz CPUs
  • 16 x 16GB DDR4-2400 1Rx4 ECC RDIMMs (256GB total)
  • 2 x Micro 5100 MAX 2.5″ SATA 240GB 5DWPD disks (for OS)
  • 2 x Intel DC P3700 800GB, NVMe PCIe3.0 SSDs (SLOG for ZFS pool)
  • 52 x HGST He8 8TB SATA 7.2K disks
  • 1 x integrated Broadcom 3008 SAS3 IT mode controller
  • 1 x Supermicro SIOM 4-port 10 Gbps SFP+ Intel XL710 NIC
  • 2 x Redundant Supermicro 2000W Power Supplies with PMBus

The reason for the modest CPU is because I will not be doing anything with deduplication or similar.  Compression in ZFS is almost free in terms of performance impact, so I’ll utilize that.  Deduplication is too memory intensive, especially for the amount of storage I’ll be using, to be practical.  So, in all, the CPUs will sit mostly idle and I didn’t see the benefit of using faster or higher core count models.  You’ll notice the machine is equipped with 256GB of RAM which may sound like a lot but is not too intense considering the box holds so much storage.  If you’re familiar with ZFS you’ll know that this will comprise what is referred to as the ARC (or Adaptive Replacement Cache) – the more the merrier.

At first I considered partitioning the Intel DC P3700’s and using them for both L2ARC (cache) and SLOG for the ZFS pool.  However, using a dedicated L2ARC device means that I’ll be further dipping into my ARC capacity and this would more than likely have a negative effect considering the workload I’ll be dealing with.

Speaking of which – you’re probably wondering what this machine is going to do!  I’ll be presenting large NFS datastores out of this Supermicro box to a large VMware cluster.  The VMs that will use this storage are going to have faster boot/application volumes on tiered NetApp storage and will use data volumes attached to this storage node for capacity.  Even though this will be the “slow” storage pool, it’s still going to perform pretty well considering it’ll have the PCIe NVMe SSD SLOG device, good ARC capacity, and decent spindle layout.  More on all of this later, though!

Racking it up

Alright – everyone’s favorite part – putting it together!  This part seemed fun at first, but then the reality of having to rack two 4U chassis each with 52 disks each sets in.  It’s not actually that bad – the Supermicro hardware is very nice for the price.  I was pleasantly surprised with the build quality of all of the Supermicro components.  They even include cable arms with these devices.

Shown above is the Supermicro SSG-6048R-E1CR60L.  You’ll notice that it has the typical Supermicro coloring and overall look.  One nice feature is the small color LCD screen on the front that displays statistics and informational messages about the hardware inside.

Because this 4U chassis is designed to hold 60 disks, the only real way to accomplish this is by making it a top-loading unit.  As a result, the unit needs to slide out of the rack in full (or about 90% of the way out) so that the top can be opened via the hinge as shown above.

The screen on the front will show you the health and status of all 60 drives, the DIMMs, the CPUs, various temperatures, the fans, the power supplies, etc.  It’s very nice – they could have just put a status LED on the outside and require you to check the IPMI interface for faults but went a step further.

The rear of the machine doesn’t have very much going on.  In the image above you can see that there are some hot-swap fans, two redundant hot-swap 2,000W power supplies, the 4-port 10 Gbps SIOM NIC, conventional VGA/USB, a pair of 2.5″ hot-swap bays, and three half-height PCIe slots.  There is also an IPMI management interface for out-of-band management of the device that Supermicro includes without any additional licensing located just above the USB ports.

The 2.5″ bays are typical Supermicro trays which hold ,in this case, the two 240GB Micro 5100 MAX SSDs that I am running the operating system from.  Because there is no RAID controller anywhere in this machine, I’ll be using mdadm to mirror the OS install.

The underside of the lid has a nice diagram of the disk numbering and the hinge on the lid is very stiff so that there is no risk of it slamming down or falling over, etc.  It has a very high quality feel to it which, to be honest, I didn’t expect with such a large piece of hinged sheet metal.

Looking into the chassis you can see that there are no integral trays.  Every disk must be held in a 3.5″ tray and dropped into the chassis.  There are guides that keep the disks lined up with the connectors below and a latch mechanism that pushes the drive down the final ~1/4″ into the slot.  One nice part about the trays is that they are hinged/tool-less so replacing/installing disks should be a breeze.

Shown above are the drives as they arrived from Supermicro.  They were nice enough to install each of the 104 disks into the drive tray so that when they arrived all I had to do was remove the anti-static bag and install the disks.

Unfortunately, because I am using a “spin your own” ZFS solution, I want to make sure that any failed disk is absolutely correctly identified when being replaced.  In order to accomplish this I chose to label each drive tray with the serial number of the drive installed as shown above.  The drive bay LEDs should identify the disk as well…but it’s better to be certain.

 

As mentioned, the Supermicro E1CR60L has 3 PCIe slots inside.  They’re accessible via a relatively small rectangular opening that is revealed once the lid is up.  Inside you’ll find the two Intel DC P3700 800GB NVMe PCIe 3.0 SSDs that I have spec’d.  Each box will have two P3700’s in a mirror for SLOG to assist in sync writes against the pool.  This should improve the performance significantly while also assuring integrity of all writes by maintaining the sync feature.  Since we’re using NFS to connect to the storage it is very much recommended that sync be enabled on the datasets.

Finally, all racked up!  The two units are in two disparate data centers partly so that I can replicate data from one storage node to the other, but also so as to provide storage out to the two environments.  The one unit sits conveniently beneath a Dell M1000E full of ESXi hosts.  The other unit is lurking beneath a NetApp FAS8020 cluster.  There’s a certain irony about that, considering the whole Solaris (Oracle) vs. NetApp lawsuit and all… just thought that’d be extra fun!

The Cost

Update:  I am adding this section in because a LOT of people have been messaging me asking what the actual cost is.  I purchased these units through a vendor we like to use and they hooked us up, so I won’t be able to share my specific pricing.  However, if you search the internet for “SSG-6048R-E1CR60L” you can find it on one of Supermicro’s online resellers www.ThinkMate.com.  I did not purchase through them at the end of the day, but the pricing is pretty accurate.  If you build the systems out on there you’ll find that they come in around $35,000 (USD) each.  This should give you guys an idea of what these cost.

Wrapping this post up

Now, I know what you’re all thinking, “There’s no HA in this!” and you’re correct.  Each location has a single node, with a single controller.  However, such is the cost of having to provide this initial amount of storage.  I will discuss the vdev configuration in the next post, but understand that the disks themselves are in a configuration that is in itself pretty conservative as far as usable space and redundancy are concerned.  That, combined with replication (and backups), will result in the required availability.  If you are using this as a guide to build out your large, affordable ZFS storage and you require Active/Passive nodes, then you’ll need to adjust accordingly.

That’s it for now!  I will go further into the OS, networking, ZFS, and storage provisioning soon!  Thanks for reading and as always subscribe and feel free to comment or ask questions!

Author: Jon

Share This Post On

49 Comments

  1. I have a question about network share of ZFS. I already have a 3TB ZFS vol and would like to share it on my LAN. You’ve chosen to share via NFS. Because I have Linux, FreeBSD, Apple and Windows on the network, there would be less client maintenance using Samba export instead.

    Have you (or anyone else reading this article) evaluated the IO throughput for NFS vs Samba? Let’s assume a 10GB LAN and my concern is read-only mostly, with some write back.

    Post a Reply
    • With my recent experience with Samba it’s quite difficult figuring out why it’s slow, when it’s slow..
      I had a VM I used as a samba share for my workstation, it was extremely slow. After trying to fix it for a few days I decided installing a Windows 2012R2 instead, fixed the speed problem right away.
      Based on this I think I’d recommend doing the sharing from a real Windows box – at least if you already have a licence for it (guess it could be run virtual on the Linux hosting the ZFS)

      Post a Reply
      • I wouldn’t recommend Windows because Samba share in Linux was slow. You probably needed to tweak some of the many settings. People have samba working very well.

        Post a Reply
  2. actually why not use a cluster solution to replace such centre soultion?

    Post a Reply
    • Cost – it’d mean two head unit nodes w/ HBAs and all

      Post a Reply
  3. Very nice!

    Minor typos:

    “can result in $500k – $1M+ of expensive” ->
    “can result in $500k – $1M+ of expense”

    “because of it’s redundancy” ->
    “because of its redundancy”

    (there are a couple more it’s vs. its problems).

    Post a Reply
  4. Isn’t this way too expensive? I mean that’s only doable for medium to large companies right? Individuals and smaller companies can hardly reasoning for the investment.

    Post a Reply
  5. How are your HDD temperatures looking in that chassis? Are you getting any kind of “hot spots” with it?

    Post a Reply
  6. Hi,

    great article, very informative. Can’t wait for the next in the series. One question: how would you handle the need to reboot this whole machine in case there is a kernel upgrade that requires a reboot? I assume you can’t fail over to the secondary device because it is in another data center and the latency would be too high? Seems like with a lot of VMs using this storage it would not be feasible to shut them all down for storage maintenance.

    Thanks again for a great article!

    Post a Reply
    • There are solutions for rebootless kernel upgrades, which work just fine, so this one is not a problem.

      Post a Reply
  7. Have you ever heard of Syneto? It runs a customized SunOS and relies on ZFS. Check it out at http://www.syneto.eu
    BTW I am not a sales representative, just an happy end user 🙂

    Post a Reply
  8. Thank you for interesting post, can’t wait for the next part!
    Some questions (hope to find answers in the next part):
    – did you consider any distributed FS, for example CephFS?
    – what’s the network speed between locations (how long will replication in case of failure take)?
    – how you will replicate the data? rsync?

    Post a Reply
  9. Very interesting! A couple of questions:

    1. How do you backup this thing?

    2. Aren’t you afraid of having a single filesystem this size? Filesystem structures become dangerous single point of failure, and they can easily be corrupted due to a faulty RAM or a kernel bug.

    Post a Reply
  10. What kind of RAID backing did you end up going with, and is any aspect of that setup on the VMware Hardware Compatibility Guide for NFS?

    A while back I put together a comparable setup for a fairly large ESXi cluster, about 80 hardware hosts and 2000+ VMs. OpenSolaris (pre-Oracle days) fibre channel target, had access to the internal team at Sun who was integrating all of the fibre channel target stuff with ZFS before Oracle came in and fired everyone. Qlogic HBAs on Dell MD5000s with flash drives for ARC and zlog, but still had to dump RAID5 for RAID10 on the ZFS backing as the parity hit for RAID5 writes would kill the cluster. The price still ended up being a tenth of what it would have cost for NetApp NFS, even with RAID10 on everything.

    ZFS is really an amazing file system, there is nothing out there like it especially if you are using snapshots. I was able to tear down and rebuild 2K+ VMs in minutes using IPMI interfacing for bare metal power control over the ESXi hosts via the baseboard management controller (hard power operations, changing boot targets on multiple host groups at once, fiber channel boot from SAN with snapshotted boot LUNs for 80 hypervisors and the ability to choose between multiple versions of ESXi etc).

    Post a Reply
  11. Could you also add some performance benchmarks like max oops and throughput etc..

    Post a Reply
  12. 800GB is massively oversized for a SLOG, isn’t it? I mean, the SLOG only has to hold sync write data until it’s written to the pool’s data disks as part of the next transaction group… hard to see that being more than an handful of GB.

    Post a Reply
    • Yes it is – it’s about 100x too large lol. The problem is that there is an SSD shortage going on – between the 800GB NVMe and SSDs with 5 DWPD, neither are necessary but were available through vendors. The nice part is that I can technically partition it and use it for L2ARC or another NVMe pool if needed.

      Post a Reply
      • Further to this, there’s actually a heap of benefit “short-stroking” these kinds of SSDs. If you underprovision them the NAND controller can use the rest of the NAND as spare area, which has the effect of keeping your I/O latency as low as the drive is able to provide. If you’re using them for logdevs, this is almost certainly what you want to do.

        On another note, using optane drives for logdevs is also a huge win, the I/O latency on those things is stellar for the price.

        Post a Reply
  13. Probably a stupid question, but I’m curious, what’s the goal ?

    Post a Reply
    • I can’t say specifically, but this is to support an environment that needs to store a large amount of data. It doesn’t need to be the highest IOPS or throughput, but just needs the capacity.

      Post a Reply
  14. Why not Solaris, it’s fairly cheap (like $1500 for a license) and their version of ZFS is vastly improved over OpenZFS.

    People shit on Oracle, but if you are building a system out like this it’s actually a pretty damn good fit.

    Post a Reply
    • “their version of ZFS is vastly improved over OpenZFS”

      No, not really. OpenZFS has added a lot more than Oracle’s version has, since they diverged.

      Post a Reply
      • Solaris is quicker for resilver, a *lot* quicker.

        We have about 50 petabytes of storage systems at my work that use ZFS, moved off FreeBSD and Illumos to Oracle Solaris or Oracle ZS branded NAS, and honestly it’s completely worth the cost.

        In our experience we got much more performance from Oracle Solaris than we ever did from FreeBSD or Illumos.

        The licensing for Solaris is not expensive, probably a similar cost to Red Hat or SuSE.

        I get the impression people have dealt with Oracle for databases and got shafted, I’ve only ever dealt with them buying their ZS storage, servers and Solaris and never really found their pricing or licensing to be excessive, at least when you compare them with crooks like EMC or Cisco.

        Post a Reply
    • You can also use one of the Illuoms derivatives: OmniOS, Nexenta, SmartOS, OpenIndiana. To get a rock-solid, battle tested, ZFS implementation for free.

      Post a Reply
      • You’re forgetting that some of these are not free. Nexenta is not free – it’s free up to 18TB and then licensed after. OmniOS is on its way out. OpenIndiance is just a community Illumos so no support there, etc.

        Post a Reply
        • OmniOS is not on its way out, it’s just undergoing a move to community management at omniosce.org

          Post a Reply
    • But… Oracle. LOL. If you have to deal with their licensing you know what I mean.

      Post a Reply
      • There is significant uncertainty facing the entire systems group at Oracle. Jon Fowler (ex-Sun, head of Systems) has left. Rumor mill has layoffs happening in the Solaris and ZFS teams. I’m not sure I’d put my money in that bank right now.

        Post a Reply
  15. I hope you installed the newest version of MDADM. 16.04 comes with a version that does not boot if one of the mirror drives fails.

    This may have changed, but 16.10 was out for a while (with the newest MD build) and 16.04 was still behaving that way.

    Also, ZFS on Linux 0.7 was released not that long ago (10 days ago?). Looks really good.

    Post a Reply
    • All packages are up to date from the official repos. Yes, ZFS 0.7.0 was released but hasn’t made its way to the repositories yet.

      Post a Reply
  16. FYI, the TrueNAS X series is limited to 36 bays. The Z series is expandable.
    The Z35 maxes out at 8x ES60 external 60-bay drive bay modules, for a total of up to 4.8PB (if filled with 10TB drives).

    Post a Reply
  17. You mention “cheap” in the title, but there’s no mention of your actual spend. Are you at liberty to share the prices you paid for your build?

    Post a Reply
    • I don’t feel comfortable sharing the actual prices I paid through my vendor, but I updated the post with a link to one of the Supermicro online resellers which should give you a fair idea.

      Post a Reply
  18. Also, why the 5 DPWD OS SSDs? If this is just going to be a storage server, isn’t that a bit excessive?

    Post a Reply
    • I should mention in the article – you might be aware there is a flash shortage going on. The drives I originally requested had a 6+ week backorder. So, my vendor was able to swap out for 5 DPWDs… no complaints here!

      Post a Reply
  19. Great article. Looking forward to seeing how you get to 832TB with your vdev and compression settings.

    Post a Reply
    • Well it’s 832 TB raw (104 x 8TB disks). With vdev configuration and ZFS overhead it obviously comes down a fair bit.

      Post a Reply
      • Oh, I see. I wasn’t counting the replicated system. Still interested to hear how you setup your vdevs, etc. in the next article.

        Post a Reply
        • I see – yes, this is not “all” DR – there is some stuff that will run from either side exclusively. I hope to get the next part up soon.

          Post a Reply
      • I would hope that you didn’t put in all the disks in one giant pool. Years ago I added a 48-disk JBOD to a Solaris 10 and made a giant RAID-Z2 + two hotspare pool and it worked OK until a disk died. It took over a week to re-sliver the failed disk. These were 1TB disks on an 8-core AMD server with a 32gb of memory, and a dedicated SSD L2ARC cache.

        http://open-zfs.org/wiki/Performance_tuning#Pool_Geometry_2

        What I should have done is something like this:

        https://everycity.co.uk/alasdair/2008/11/sun-fire-x4500-thumper-recommended-zfs-zpool-layout/

        Post a Reply
        • Don’t worry – there is no 50-disk raidz2 here. But, it is one pool. I will be documenting my configuration decisions in the next post!

          Post a Reply
        • VDEV sizing is ZFS 101, it’s common knowledge that you should stick to smaller VDEVs, 50 disks in a VDEV is terrible planning.

          Not just from a performance perspective.

          Need to upgrade your pool, you need to buy another 50 disks to add more storage to that pool.

          Post a Reply
  20. Why was zol better for you then just loading freenas on these boxes?

    Post a Reply
    • I went the ZoL route because I can get enterprise support. FreeNAS would work fine, mostly (I worry about some drivers but not a huge deal), but I can’t run that in production. I looked at TrueNAS through ixSystems but their largest node is 36 bays and no 10 Gbps support, etc. It’s surely more expensive as well.

      Post a Reply
      • Is the past you would but software support only, it was a gold account. Surly that is still an option

        Post a Reply
      • Is there no enterprise support for FreeBSD (the base used for FreeNAS)? I’d trust that more than Linux for big ZFS storage box like this.

        Post a Reply
        • No there isn’t – there are consultants that will support it, but I cannot tell a prospective customer that was have “a guy who knows FreeBSD on retainer”. I can say that I have support through Canonical, though.

          Post a Reply

Submit a Comment

Your email address will not be published. Required fields are marked *

Share This
%d bloggers like this: