This is Part 1 of my “832 TB – ZFS on Linux” series – if you’re looking for Part 2, follow the link below:
When looking to store say, 800 terabytes of slow-tier/archival data my first instinct is to leverage AWS S3 (and or Glacier). It’s hard – if not impossible – to beat the $/GB and durability that Amazon is able to provide with their object storage offering. In fact, with the AWS Storage Gateway you can get “block storage” access to AWS for a decent price within your data center. However, sometimes AWS is not an option. This could be due to the application not knowing what to do with AWS API calls or maybe there is some legal or regulatory reason that the data cannot sit there. After ruling out cloud storage options your next thought might be to add as much capacity as required, with overhead, to your existing storage infrastructure. Hundreds of terabytes, however, can result in $500k – $1M+ of expense depending on what system you’re using. In fact, a lot of the big players in the storage arena who support this kind of scale do so by licensing per terabyte (think Compellent, NetApp, EMC, etc.). So while the initial hardware purchase from EMC or NetApp may seem acceptable the licensing fees will surely add up. In this example, however, the requirement is literally “as much storage as possible, with some redundancy, for as little cost as possible…” Let’s do it!
Choosing the OS/filesystem
If you follow my blog you may know that I experiment with different storage technology. I have played around with different solutions such as Windows Storage Spaces, Nexenta, FreeNAS, Nutanix, unRAID, ZFS on several different operating systems, Btrfs, Gluster, Ceph, and others. Because of the budget for this project, the first thing that popped into my head was ZFS on Linux. The reason ZFS stood out to me was because of its redundancy and flexibility in storage pool configuration, its inherent (sane) support for large disk rebuilding, its price, and the performance it can offer. Today, you can run ZFS on Ubuntu 188.8.131.52 LTS with standard repositories and Canonical’s Ubuntu Advantage Advanced Support. That makes the decision easy. You could also build this on Solaris with necessary licensing if you wanted to that route but it’d be more expensive. Unfortunately Red Hat Enterprise Linux does not support ZFS (yet) and so that option was not in the running though I’d have gladly gone that route as well. ZFS on Linux (ZoL) will also run on CentOS, Fedora, etc.
After determining how I’d approach this solution from a software perspective, I needed to figure out the hardware component. The only requirements I have for this project is that I need to hold as many disks as possible, support SAS2 or better (for large disks), present the disks directly to the server (no hardware RAID), and it must be affordable. So, we’ve pretty much ruled out building a storage node using Dell, IBM, Cisco, HPE, etc. since the hardware will be at a premium combined with maintenance plans to match. So what’s left? There are a couple whitebox-type solutions out there, but Supermicro is obviously the “industry standard” for when you don’t want to pay a big name for a server/box. In fact, more often than not, Supermicro is building the physical boxes that the other manufacturers are selling, anyway.
I spent some time browsing the offerings from Supermicro and came across two solutions that would work for my situation. I ended up between the Supermicro SSG-6048R-E1CR60L or the SSG-6048R-E1CR90L – the E1CR60L is a 60-bay 4U chassis while the E1CR90L is a 90-bay 4U chassis. This nice part is that no matter which platform you choose Supermicro sells this only as a pre-configured machine – this means that their engineers are going to make sure that the hardware you choose to put in this is all from a known compatibility list. Basically, you cannot buy this chassis empty and jam your own parts in (boo, hiss, I know but this is for your own good).
For this build I went with two of the SSG-6048R-E1CR60L machines so that I have one in a production environment and one in a second environment that can be used for replication purposes. The reason for choosing the 60-bay device over the 90-bay is that the 90-bay does not have any PCIe slots available. This means that if you outgrow the 90-bay chassis you’ll need to build another whereas with the 60-bay unit, I could add a PCIe HBA with external connections (such as Broadcom SAS 9305-16e) and cable it up to an expansion chassis with another 60 disks, etc.
With the chassis selected there are only a few other configuration items I needed to decide on. These items include spinning disks (that make up the ZFS pool), PCIe NVMe disks (optional, for pool SLOG), solid state disks (for OS install), network interface(s), CPU, and RAM. I built each system with the following configuration:
- 2 x Intel E5-2623 v4 – 4C/8T 2.6 GHz CPUs
- 16 x 16GB DDR4-2400 1Rx4 ECC RDIMMs (256GB total)
- 2 x Micro 5100 MAX 2.5″ SATA 240GB 5DWPD disks (for OS)
- 2 x Intel DC P3700 800GB, NVMe PCIe3.0 SSDs (SLOG for ZFS pool)
- 52 x HGST He8 8TB SATA 7.2K disks
- 1 x integrated Broadcom 3008 SAS3 IT mode controller
- 1 x Supermicro SIOM 4-port 10 Gbps SFP+ Intel XL710 NIC
- 2 x Redundant Supermicro 2000W Power Supplies with PMBus
The reason for the modest CPU is because I will not be doing anything with deduplication or similar. Compression in ZFS is almost free in terms of performance impact, so I’ll utilize that. Deduplication is too memory intensive, especially for the amount of storage I’ll be using, to be practical. So, in all, the CPUs will sit mostly idle and I didn’t see the benefit of using faster or higher core count models. You’ll notice the machine is equipped with 256GB of RAM which may sound like a lot but is not too intense considering the box holds so much storage. If you’re familiar with ZFS you’ll know that this will comprise what is referred to as the ARC (or Adaptive Replacement Cache) – the more the merrier.
At first I considered partitioning the Intel DC P3700’s and using them for both L2ARC (cache) and SLOG for the ZFS pool. However, using a dedicated L2ARC device means that I’ll be further dipping into my ARC capacity and this would more than likely have a negative effect considering the workload I’ll be dealing with.
Speaking of which – you’re probably wondering what this machine is going to do! I’ll be presenting large NFS datastores out of this Supermicro box to a large VMware cluster. The VMs that will use this storage are going to have faster boot/application volumes on tiered NetApp storage and will use data volumes attached to this storage node for capacity. Even though this will be the “slow” storage pool, it’s still going to perform pretty well considering it’ll have the PCIe NVMe SSD SLOG device, good ARC capacity, and decent spindle layout. More on all of this later, though!
Racking it up
Alright – everyone’s favorite part – putting it together! This part seemed fun at first, but then the reality of having to rack two 4U chassis each with 52 disks each sets in. It’s not actually that bad – the Supermicro hardware is very nice for the price. I was pleasantly surprised with the build quality of all of the Supermicro components. They even include cable arms with these devices.
Shown above is the Supermicro SSG-6048R-E1CR60L. You’ll notice that it has the typical Supermicro coloring and overall look. One nice feature is the small color LCD screen on the front that displays statistics and informational messages about the hardware inside.
Because this 4U chassis is designed to hold 60 disks, the only real way to accomplish this is by making it a top-loading unit. As a result, the unit needs to slide out of the rack in full (or about 90% of the way out) so that the top can be opened via the hinge as shown above.
The screen on the front will show you the health and status of all 60 drives, the DIMMs, the CPUs, various temperatures, the fans, the power supplies, etc. It’s very nice – they could have just put a status LED on the outside and require you to check the IPMI interface for faults but went a step further.
The rear of the machine doesn’t have very much going on. In the image above you can see that there are some hot-swap fans, two redundant hot-swap 2,000W power supplies, the 4-port 10 Gbps SIOM NIC, conventional VGA/USB, a pair of 2.5″ hot-swap bays, and three half-height PCIe slots. There is also an IPMI management interface for out-of-band management of the device that Supermicro includes without any additional licensing located just above the USB ports.
The 2.5″ bays are typical Supermicro trays which hold ,in this case, the two 240GB Micro 5100 MAX SSDs that I am running the operating system from. Because there is no RAID controller anywhere in this machine, I’ll be using mdadm to mirror the OS install.
The underside of the lid has a nice diagram of the disk numbering and the hinge on the lid is very stiff so that there is no risk of it slamming down or falling over, etc. It has a very high quality feel to it which, to be honest, I didn’t expect with such a large piece of hinged sheet metal.
Looking into the chassis you can see that there are no integral trays. Every disk must be held in a 3.5″ tray and dropped into the chassis. There are guides that keep the disks lined up with the connectors below and a latch mechanism that pushes the drive down the final ~1/4″ into the slot. One nice part about the trays is that they are hinged/tool-less so replacing/installing disks should be a breeze.
Shown above are the drives as they arrived from Supermicro. They were nice enough to install each of the 104 disks into the drive tray so that when they arrived all I had to do was remove the anti-static bag and install the disks.
Unfortunately, because I am using a “spin your own” ZFS solution, I want to make sure that any failed disk is absolutely correctly identified when being replaced. In order to accomplish this I chose to label each drive tray with the serial number of the drive installed as shown above. The drive bay LEDs should identify the disk as well…but it’s better to be certain.
As mentioned, the Supermicro E1CR60L has 3 PCIe slots inside. They’re accessible via a relatively small rectangular opening that is revealed once the lid is up. Inside you’ll find the two Intel DC P3700 800GB NVMe PCIe 3.0 SSDs that I have spec’d. Each box will have two P3700’s in a mirror for SLOG to assist in sync writes against the pool. This should improve the performance significantly while also assuring integrity of all writes by maintaining the sync feature. Since we’re using NFS to connect to the storage it is very much recommended that sync be enabled on the datasets.
Finally, all racked up! The two units are in two disparate data centers partly so that I can replicate data from one storage node to the other, but also so as to provide storage out to the two environments. The one unit sits conveniently beneath a Dell M1000E full of ESXi hosts. The other unit is lurking beneath a NetApp FAS8020 cluster. There’s a certain irony about that, considering the whole Solaris (Oracle) vs. NetApp lawsuit and all… just thought that’d be extra fun!
Update: I am adding this section in because a LOT of people have been messaging me asking what the actual cost is. I purchased these units through a vendor we like to use and they hooked us up, so I won’t be able to share my specific pricing. However, if you search the internet for “SSG-6048R-E1CR60L” you can find it on one of Supermicro’s online resellers www.ThinkMate.com. I did not purchase through them at the end of the day, but the pricing is pretty accurate. If you build the systems out on there you’ll find that they come in around $35,000 (USD) each. This should give you guys an idea of what these cost.
Wrapping this post up
Now, I know what you’re all thinking, “There’s no HA in this!” and you’re correct. Each location has a single node, with a single controller. However, such is the cost of having to provide this initial amount of storage. I will discuss the vdev configuration in the next post, but understand that the disks themselves are in a configuration that is in itself pretty conservative as far as usable space and redundancy are concerned. That, combined with replication (and backups), will result in the required availability. If you are using this as a guide to build out your large, affordable ZFS storage and you require Active/Passive nodes, then you’ll need to adjust accordingly.
That’s it for now! I will go further into the OS, networking, ZFS, and storage provisioning soon! Thanks for reading and as always subscribe and feel free to comment or ask questions!