Posted By Jon on Aug 23, 2017 | 4 comments

As you’ve probably guessed, this post is going to cover installing the OS in the case of my ZFS on Linux build(s). This is heavily subjective. There are right and wrong ways of doing things, but there is also a ton of “yeah, that’ll work” in the middle. There are Linux admins out there who, first things first, will change the terminal colors, unify UIDs and GIDs across multiple systems, create aliases for their favorite complex commands, and setup symlinks all over the place because “they changed it years ago and all the paths are stupid” – that’s not me. If you have a preference or layout that you prefer please let me know as I am always looking out to see what other people are doing and am by no means above changing my ways.

In case you missed the first part of this semi-series, here is a link:

832 TB – ZFS on Linux – Project “Cheap and Deep”: Part 1

832 TB – ZFS on Linux – Configuring Storage: Part 3

Choosing your OS

This topic received a lot of criticism in my first blog post. A lot of people asked, “Why not [insert distribution or OS]?” People questioned why I didn’t use OpenBSD, Nexenta, Illumos, etc. The reason I went with Ubuntu is because I am most familiar with it and it supports ZoL with a kernel module. I have run RHEL and CentOS a bunch as well, but my roots (ha) are with Ubuntu/Debian (OK that’s technically a lie… I started out on “Linux-Mandrake” and SUSE circa 1998). Also, RHEL won’t support ZFS and CentOS is community-based support (yes I know you can get a contract with a third-party support firm). Ubuntu, however, has Canonical support and since ZFS is in the default repository it won’t be difficult to reach out to support, etc. It’s true that support does not mean they’ll fix your issue, but it’s nice to have a team to reach out to should the need arise. That said, pick whatever distribution you prefer as ZoL behaves (mostly) the same on any distribution.

As previously mentioned, my decision to use Ubuntu really comes down to Canonical supporting the base install along with what’s in their repositories. Further, when it comes to Linux distributions, I generally regard Ubuntu as the “comfortably progressive” (my term) distribution. Ubuntu has always offered the more current versions of different packages even in it’s LTS builds and in its standard repositories. I’ve been an Ubuntu server user since 7.04 (Feisty Fawn, circa 2007) and haven’t been bitten by any “oops we shouldn’t have released that” pushes to repositories (that I know of!). I am sure there are instances out there where Ubuntu jumped the shark but I have no personal experience with that. As mentioned, ZoL can be installed on Arch, Debian, Fedora, Gentoo, openSUSE, RHEL, and CentOS… pick what you’re familiar with or what you can get support on!

I am not going to walk you through a mostly vanilla install of Ubuntu 16.04 LTS (Xenial Xerus) as I am sure you can find any number of tutorials on that. Instead, I am only going to highlight the “interesting” or less-common steps of the install that you might find interesting in this scenario. So, with that said, it all starts like this:

As you can see I am installing via the IPMI remote console from Supermicro. I was pleasantly surprised with the reliability and ease of using the Supermicro remote management stuff. It uses Java, of course, but other than that it was great to deal with (on both systems). Once you pick “Install Ubuntu Server” above there’s no turning back! Actually, there is, but that’s not as dramatic…

Above is the first real decision to be made – fortunately, I’ve already been thinking about this network configuration for a bit. Because the Supermicro SIOM (Super I/O Module… not sure how I feel about that name) I am using is the Intel XL710, which has 4 x 10Gbps SFP+ ports, I have to decide how exactly I want to use them. Most people would probably 2 or 4 of the interfaces in a team for both load-balanced I/O and redundancy. My situation is that I don’t actually need to have redundancy on the “management” or “SSH interface(s)” and, quite frankly, it’d be a shame to waste two 10 Gbps NICs for SSH access. So, what then?

Because I am going to be providing NFS exports out of this server to my cloud vSphere platform over a dedicated NFS network (VLAN), I wanted to put as many NICs as possible in that VLAN. What I decided to do was use enp4s0f0 (the first port on the SIOM) for SSH access and enp4s0f1, enp4s0f2, and enp4s0f3 for NFS connectivity. I also decided to setup an active bond (802.3AD LACP port-channel) for the three NFS interfaces. The reason I could sacrifice redundancy on the SSH interface is that in addition to enp4s0f0 I also have the IPMI interface with direct console access and since nothing is mounting through that network there’s no loss (except for monitoring) should it go down. Note that I do not believe you can bind NFS to a specific interface, but you can limit what networks are allowed to connect to your NFS export and so that’s how I am keeping NFS traffic on the bond.

The end result is 30 Gbps of aggregated bandwidth to the storage within the box. As for redundancy, enp4s0f0 and enp4s0f2 are cabled to Switch1 in a stack, and enp4s0f1 and enp4s0f3 are cabled to Switch2. This way, should a switch go down, we still maintain at least 10 Gbps of throughput to the box. The IPMI interface is cabled to a completely different switch stack since it’s a 1 Gbps connection and not SFP+.

I know, I know. You’re out there throwing your hands up! “Why not just trunk the ‘SSH network‘ and ‘NFS network‘ on all interfaces and put them all in a LACP and do tagging from within Ubuntu?!?” The reason is – complexity. I want this to be stupid simple. While tagging VLANs within Ubuntu is not a foreign concept, it requires installing the vlan package along with doing a modprobe 8021q. Further, we then need to add vconfig to create a VLAN interface and echo it into /etc/modules, etc., etc. Then, we’re still going to have to configure ethX.vlan, and only then can we start building out the bond and hope the networking side is correct, and on we go down Over-complication Avenue. It’s really not a lack of understanding or comfort (I assure you!), as we practically did the configuration in this paragraph! Instead, I want it simple. If this box experiences an issue I want to be able to stand up a new installation as quickly as possible. I want someone else managing it to be able to troubleshoot it. I want other Linux administrators or storage engineers to come in behind me and recognize what’s going on.

After you’ve deliberated your network configuration and come to terms with your decisions, you can setup your partitions:

You can see above I am choosing Manual partitioning. Ordinarily, especially with VM configuration, I’d choose Guided – use entire disk or Guided – use the entire disk and set up LVM. LVM is not super interesting to me in this situation because if you recall we only have two 2.5″ bays for SSDs to install the OS on and there will be no expansion of the root disks anytime soon. LVM supports snapshots, and that’s cool, but we’re going to be extremely cautious with patching and such so we’re not too concerned there either. Further, I don’t want the Guided partitioning because I really want to prevent the disk from filling by partitioning /tmp and /log and providing more-than-adequate space there. Finally, choosing Manual mdadm for software RAID mirroring will let us configure of the OS disks.

Why not put root on ZFS? Again, keeping it simple. You can absolutely configure your Ubuntu 16.04 LTS installation to run from root on ZFS but there’s obviously additional configuration to account for. Of course the ZFS on Linux documentations does cover doing this, so you can follow that if you wish!

Above, you’ll see one of the somewhat confusing aspects of having 50+ disks in one system presented straight through to the OS. You’ll notice that our disk labels go all the way up to sdaz. The reason for this is because after /dev/sdz comes /dev/sdaa! You can also see that our Micro 5100 MAX SSDs show up on /dev/sdb which is nice (this is because they’re on a different controller). This will make it a little bit easier when keeping track of devices. Next we’ll setup partitions and mdadm.

In the image above you’ll find my final partition configuration. Again, I am not showing you how to set up mdadm step-by-step because you can find that all over the internet. Instead, I am just showing you the decisions I’ve made for partitioning. Again, this is like a pair of shoes – I may prefer something you find ridiculous and vice versa. As usual, I am always open to suggestions, but this has worked for me in the past.

The partition layout I decided on is pretty legible there but just in case:

/ (root mount) is formatted ext4 and is 99.9GB
/home is formatted ext4 and is 32GB
/tmp is formatted ext4 and is 50GB
/var is formatted ext4 and is 50GB
swap is swap, and we’re not going to be using any swap (I hope) so we’ve made it 8GB

Back in the day, it was recommended that your swap partition be the size of your RAM x 2 – clearly that made sense when a “fancy” desktop or workstation had 2GB of RAM. This machine has 256GB of RAM – if we start swapping we’re doing it wrong and we’re sure as hell not giving the swap 512GB of space!

That’s that for the partitioning part. The biggest concern I had is that I don’t want some service to run away with logs in /var and put the root device in read-only some day. The same applies to /tmp. We don’t want updates or software to download over the years and build up filling up the root device. Again, we could give / much more space and put /var and /tmp down around 10GB a piece if we really wanted. But, this is were I ended up. 100GB should be way more than enough for years and years of service and 50GB in each /var and /tmp should not fill, either.

If you don’t limit /var you should – I just fixed a VMware vCenter Appliance and two other Linux VMs that were stuck with a read-only file system due to /var filling up because a cron job stopped pruning logs.

Once you’re done partitioning, write the changes to disk. The only remaining option is to install OpenSSH Server on the Software Selection screen.

After the OS installation is complete I always run sudo apt-get update && sudo apt-get upgrade. This ensures that all packages already installed are current. Once that is done, we just need to create bond0 out of the three interfaces as mentioned earlier. You can refer to this link for creating active bonds in Ubuntu. My /etc/network/interfaces configuration file looks like this:

~$ sudo cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

auto enp4s0f1
iface enp4s0f1 inet manual
bond-master bond0

auto enp4s0f2
iface enp4s0f2 inet manual
bond-master bond0

auto enp4s0f3
iface enp4s0f3 inet manual
bond-master bond0

auto bond0
iface bond0 inet static
        address [redacted]
        netmask [redacted]
        #network [redacted]
        #broadcast [redacted]
        gateway [redacted]
        #create bond with LACP mode
        bond-mode 4
        bond-miimon 100
        bond-lacp-rate 1
        bond-slaves enp4s0f1 enp4s0f2 enp4s0f3


# The primary network interface
auto enp4s0f0
iface enp4s0f0 inet static
        address [redacted]
        netmask [redacted]
        network [redacted]
        broadcast [redacted]
        gateway [redacted]
        # dns-* options are implemented by the resolvconf package, if installed
        dns-nameservers [redacted]
        dns-search [redacted].local

Any place I [redacted] is either a hostname, domain name, or IP address – so you can figure that out on your own. Because I am using LACP (identified by bond-mode 4 above) each of the three enp4s0fx interfaces are considered a bond-master. When creating a bond, you configure the IP address for the group on bond0 itself. Pretty straightforward, right? Again, you can do this with VLAN tagging inside, but that’s another layer on top of bonding.

Remember, I did this twice because I have two nodes to deal with. Everything went perfectly smooth during install of both systems.

We’re Done!

Next up I’ll discuss the actual ZFS installation, configuration, and pool creation and configuration. We’ve only discussed things that specifically pertain to the OS installation in this post. As already stated there are still a lot of decisions to be made. If you are the one both designing and supporting the solution then you can be as creative as you’d like. If, however, you are designing the solution with the intent of other people supporting it, you might want to pull back on the creativity handle so that intuition can help people along. Either way, let me know what you think and if you would do something differently I’d like to hear!

Thanks all and stay tuned for Part 3!

4 Comments

George P
December 4, 2017

Is there not an extra Gb interface for ssh management in the motherboard? Or it’s only the out-of-band IPMI?

In any case I ‘d still prefer VLANs. More tidy and future proof in case of security domains, etc. And yes, apparently you can arrange your NFS exports accordingly, but still it’s a nice to have thing IMHO.

Moreover, I don’t see any issue with complexity. Sure, it’s another layer that you ‘d have to deal with, but as long as everything is documented there should be no issue. Unfortunately, since many of us are busy sys admins we tend to forget the documentation part :p

Anyway, good work 🙂

George
Post a Reply
Aguy
August 26, 2017

Most open network switches support MLAG now so you can do it with almost anything.
Post a Reply
Henrik
August 23, 2017

Which switches are you using that’s supporting LACP across two switches?
Post a Reply
- Jon
  August 23, 2017
  
  Hi Henrik – this specific setup is utilizing Cisco Nexus switches for the LACP side of things.
  Post a Reply