DIY SAN/NAS – fast, reliable, shared storage, with FreeNAS and switchless 10 Gbps! (Part 2)

Wow does time fly!  I posted the initial build of this DIY SAN/NAS solution over a year ago and I sincerely apologize for not following up with the details on my solution sooner!  I am providing a link to the original article below.

DIY SAN/NAS – quest for fast, reliable, shared storage with a twist of ZFS! (Part 1)

I have two excuses for such a delay.  First is the typical “work and life has been busy” but the other is a bit more genuine – I didn’t want to publish this article too quickly if the solution did not work reliably.  Granted, 3 – 6 months would have probably established reliability, but refer back to the first excuse.

The hardware

As you may recall from Part 1, I chose a Dell R510 II 12-Bay server for my storage node.  At the time, E5-2670-based machines were still too expensive and acquiring a 24-bay Supermicro SC846 (like I have in my other lab) would have been much more expensive than it is currently.  I wanted something with remote access, like iDRAC, because I’d be experimenting with different storage configurations.  The specifications of my Dell R510 12-bay storage node are as follows:

  • Dell PowerEdge R510 II 2U Server
  • 2 x Intel(R) Xeon(R) CPU E5620 @ 2.40GHz – 4 cores with 8 threads per socket
  • 2 x Dell 80 Plus Gold 750W Switching Power Supply FN1VT
  • 64GB of DDR3L ECC Memory
  • Dell K869T J675T Remote Access Card iDRAC6 Enterprise
  • Chelsio T420-CR Dual Port 10GbE PCI-E Unified Wire Adapter
  • Dell Intel PRO/1000 VT Quad-Port PCI-e Gigabit Card YT674
  • Dell Perc H200 047MCV PCIe SAS/SATA 6GB/s Storage Controller
  • 12 x Western Digital Re 1TB 7200 RPM 3.5″ WD1003FBYZ Enterprise Drive
  • 2 x Samsung 850 Evo 250GB SSDs
  • 2 x SanDisk Cruzer Fit CZ33 32GB USB 2.0 Low-Profile Flash Drive

Whew, I think that about does it.  I’ll explain later why I chose certain pieces of hardware in the list above.

Picking the OS/filesystem

With that out of the way, my pursuit of fast, reliable, shared storage has landed me in a somewhat unexpected position.  I imagined using either Nexenta, NetApp ONTAP Select, heck, even ZFS on Linux for my storage solution.  I even flirted with the idea of presenting the storage via iSCSI out of Windows Server 2012 R2.  For about 20 seconds I even considered using StarWinds on top of Windows.

I tested Nexenta and wasn’t impressed – especially since the community edition was limited in RAW capacity and offered no plugins or additional features.  That, and the performance was very, very average.  It was further complicated by the fact that I could not deduce whether or not it would accommodate my somewhat atraditional 10 Gbps network configuration.

NetApp ONTAP Select was still in new beginnings.  In fact, I don’t know that it was even available then.  I love NetApp and its similarity to ZFS in the way it implements its WAFL (Write Anywhere File Layout) system, snapshots, etc. and I know that NetApp Clustered ONTAP 9 supports pretty funky/unusual networking configuration.  NetApp ONTAP Select was just too new and unavailable to really lean on.

The Windows 2012 R2 solutions were eliminated because I really didn’t want to have the overhead of a full Windows operating system running my storage, and I know that without special considerations I wouldn’t get very good performance.  I’d also be limited to block storage (iSCSI) if dealing with Windows.

All of these considerations left me wanting the features of NetApp ONTAP with the convenience of something pre-built, but it had to be flexible.  While ZFS on Linux was pretty stable, there were (and still are) some limitations.  What’s all this rambling end up with?  FreeNAS.  I know I am stating that in a sort of negative fashion – the reason for that is because I had, for so long, disregarded FreeNAS as a viable solution because I only saw it as a sort of DIY Synology solution.

The reality is that FreeNAS is built on FreeBSD so it’s secure and reliable.  It utilizes ZFS which will provide not redundancy, snapshot capability, performance (using ARC and L2ARC cache tiers), and can provide storage via NFS, iSCSI, CIFS, etc.  Further, because FreeBSD is a full-fledged enterprise operating system, there is no real limit on the network configuration underneath.

 

10 Gbps of convenience

I picked the Chelsio T420-CR Dual Port 10GbE PCI-E Unified Wire adapter for my 10 Gbps connectivity because it was supported in FreeBSD, but also because it is actually used in NetApp systems so I know that it’s an enterprise, reliable part.  The T420-CR has two SFP+ ports that can take transceivers with fiber cable but since I am doing this on the cheap, I used 2 x 2M DAC (Direct-Attached Cables) from Dell that I got new on eBay for like $15/each.  In each of my ESXi hosts, I installed a Mellanox ConnectX-T2 which is also a 10 Gbps adapter with SFP+ ports.  I went with the Mellanox cards in the ESXi hosts because I know they’re supported by VMware.  I believe the Chelsio T420-CR is as well, but the Mellanox ConnectX-T2, being single-port, are extremely cheap.

The beauty of this FreeNAS/ESXi setup is that I have no 10 Gbps switch.  While this would be an issue ordinarily, I only need 10 Gbps connectivity between the ESXi hosts and the storage node so there’s really no need for any more than 3 ports in the whole configuration.  Essentially, the FreeNAS node (R510 II) will be my “switch”.  By putting the two ports on the Chelsio T420-CR inside the FreeNAS node in bridge mode, port 1 will forward all frames to port 2, and vice versa.  So, essentially, I’ll have one IP address on the Chelsio T420-CR that is both listening for packets addresses to it, but also passing packets that are no addressed to it.  What does this mean?  Instant 10 Gbps storage connectivity as well as inter-ESXi 10 Gbps connectivity.  Take a look below to understand better:

In the diagram above you’ll see that I am using two ports on each Intel Pro/1000 NIC for iSCSI connectivity.  This is more for the sake of compatibility and flexibility, allowing me to test VAAI and block storage if need be.  The iSCSI configuration here allows for a total of 2 Gbps total throughput.  In practice, I am using NFS for storage since it’s thin-provisioned and allows for compression, etc.  You can see that each iSCSI vmkernel on either ESXi host are configured for a different subnet (everything in the diagram is masked with by a /24).  This provides multi-pathing between the FreeNAS node and the ESXi hosts.  One thing I discovered in this configuration is that you cannot IP two interfaces within the same subnet in FreeNAS/FreeBSD due to the way the TCP stack is designed.  It’s actually improper from a standards perspective to even allow multiple NICs on the same host to have IPs in the same subnet… I learned a lot here!  But, this is all boring stuff.  Read further.

The more important, convenient aspect of this setup is that not only do I have 10 Gbps connectivity from each ESXi host to the FreeNAS box for storage, but because of how the bridge acts, I have 10 Gbps connectivity between hosts as well!  Granted, for this to work, you need the FreeNAS node to be up and available.  If I reboot my FreeNAS node (which would be an issue anyway since it’s not HA and all my VMs run from it), I will get “Network Connectivity Lost” alarms withing vCenter because the link goes down since there is no switch between the hosts.  However, by utilizing the same vmkernel for vMotion as I already do for NFS connectivity, I gain vMotion over 10 Gbps.  This performs extremely well and is so simple because there is no switch involved!

Further, jumbo frames between nodes and FreeNAS is fully supported so long as everything is configured properly end-to-end.  It’s pretty much the most convenient setup you can accomplish without introducing expensive 10 Gbps switches.  There’s not very much CPU overhead on the FreeNAS server during vMotion events since the NIC is really just forwarding anyway and even if there were, I have found that the E5620 2.4 GHz CPUs are total overkill for this storage device as-is.  Obviously this switchless 10 Gbps solution will not work as-is if you have a third ESXi host.  I do think you’d be able to add an additional Chelsio T420-CR Dual Port card and bridge all four SFP+ ports, allowing for a total of four ESXi hosts in the setup w/ single SFP+ NICs.

You can see that when I vMotion VMs from one host to the other (which has to move the memory of the VM along with the NVRAM file, etc.) I am getting good throughput on the FreeNAS node:

I’ve even created a video to demonstrate what vMotion/Maintenance Mode might look like for you with this network/storage configuration:

(If you find the video above useful, be sure to like and subscribe…)

 

The storage layout

I hate divulging this portion of any storage configuration to people because it’s extremely subjective.  There is always going to be a “better” and “worse” configuration for a specific work-load.  For instance, if you look at ZFS storage pool configurations with 12 x 1TB drives you’ll likely find people recommending a single RAIDZ2 vdev in a pool.  If you find someone with 12 x 4TB drives people will recommend multiple RAIDZ2 vdevs or even a RAIDZ3 vdev.  The problem is while these configurations are conservative in terms of data preservation, they also offer the slowest write performance.  Naturally, you could do many mirrors vdevs and create, essentially, a RAID10, but your usable space will suffer.

I decided that I need moderate redundancy and good throughput.  As a result, I ended up with 4 RAIDZ1 vdevs.  The result is basically similar to a “RAID50” in non-ZFS world.  It looks like this:

Pretty simple overall.  Each vdev has a usable capacity of two disks because of parity.  So, with 12 x 1TB disks, I have a usable capacity of 8TB (before formatting).  This layout should give me better write performance than would a RAIDZ2 because of write penalties, etc.  It also allows me to lose up to 4 disks so long as they’re not part of the same vdev.

Because of how ZFS works, performance is oftentimes much better than the spindle layout would suggest.  This is because of what ZFS refers to as ARC and L2ARC.  ARC (adaptive replacement cache) utilizes system RAM and in my case I have 64GB (less some overhead) of really, really fast buffering/caching space of most read data.  Because the R510 II has only 8 DIMM slots, I could only add 8 x 8GB DIMMs in order to remain affordable.  While 64GB of ARC isn’t bad, more would be better.  That’s where L2ARC comes in.  I am sort of breaking the rules by using cheap SSDs for this.  You really want to have MLC SSDs (like Intel S3710) for L2ARC.  However, the Samsung 850 Evo 250GB I am using is better than nothing.  When the frequently read data doesn’t fit in ARC, it gets put in L2ARC.  You don’t need mirrored L2ARC drives because the data still resides on spinning disk should the L2ARC SSD fail.

There’s another concept in ZFS which can significantly increase performance and that’s the ZIL (or ZFS Intent Log) drive – this is a write-caching drive.  Since ZFS uses HBAs, instead of hardware RAID controllers, there is no controller write-cache.  Conventional hardware RAID configurations use ~1-2 GB of DDR3 memory on the controller itself to store data to while it places it on spinning disk when possible.  If you want to significantly improve the write performance of your pool and you’re not using SSDs as primary storage then add a mirrored pair of ZIL drives.  You want the ZIL mirrored because this is the write intent log.  If the data falls out of the ZIL it’s lost forever, which leads to corruption.

Sizing L2ARC and ZIL

Since ZFS allows you assign SSDs for L2ARC and ZIL, respectively, you can pick any size for each!  With hardware RAID controllers you can spend a real pretty penny upgrading to the model that has 2GB of write-cache instead of 512MB, etc.  But, how much space do you need?  This is kind of easy, maybe.

For ZIL, consider the maximum write-speed of your network and SSD.  ZFS issues a new transaction group (and updates the pool) every 5 seconds (or sooner).  As one transaction is written to ZIL the other is likely being written to disk.  So, at worst case, you have two full transaction groups you need to store in ZIL while the pool finishes.  Simple math says that if 5 seconds is the longest interval on transaction group creation, then we need the capacity of a full-write stream for 10 seconds.  So, if you find an SSD that can write at 250 MB/s and your network is > 250 MB/s capable, then you need a total of 2.5GB of ZIL capacity.  Obviously a 2.5GB SSD doesn’t exist today, so almost anything will work.

L2ARC is a little more random.  L2ARC, as mentioned earlier, is your read-cache.  There are some clever tools you can use to measure this.  FreeNAS provides you a RRD graph you can reference just log into your WebUI, choose Reporting, then ZFS:

As you can see my L2ARC is ~267GB in total and my ARC is about 56GB in total.  Ideally, your ARC hit ratio would have your L2ARC higher than your ARC but because I am using my FreeNAS solution exclusively for VM storage there are not a TON of requests for read-cached pieces of data.

For your L2ARC sizing, I’d recommend about 5 – 10% of your total usable capacity as L2ARC.  Only 11.5% of my IO hits my L2ARC so I am fine.  In a situation where you are storing either more data or have frequent requests for similar files over and over (think about an HR or Finance share on a corporate CIFS share around open-enrollment or tax season) you would likely need closer to 10-20% if you want to keep from hitting the spindles.

Keeping an eye on things

Naturally you’ll want to setup email/SMTP support in FreeNAS, NTP, etc.  One thing I found a little lacking from the start was reporting.  Because this is now important data, I want to make sure I know when a spindle fails.  I am not local to this storage node so I will not see the amber lights, etc. on the tray.

With the help of the FreeNAS forums, a co-worker, and some patience, I came up with a modified version of some SMART reports that come out.

I setup a cron job that runs the following script:

[root@krcsan1] /mnt/StoragePool/scripts# cat esmart.sh
#!/bin/bash
#
# Place this in /mnt/pool/scripts
# Call: sh esmart.sh
(
echo “To: me@myaddress.com”
echo “Subject: SMART Drive Results for all drives”
echo “Content-Type: text/html”;
echo “MIME-Version: 1.0″;
echo ” ”
echo “<html>”
) > /var/cover.html

c=0
for i in /dev/da1 /dev/da2 /dev/da3 /dev/da4 /dev/da5 /dev/da6 /dev/da6 /dev/da7 /dev/da8 /dev/da9 /dev/da10 /dev/da11 /dev/da12 /dev/da13; do

results=$(smartctl -i -H -A -n standby -l error $i | grep ‘test result’)
badsectors=$(smartctl -i -H -A -n standby -l error $i | grep ‘Reallocated_Sector’ | awk ‘/Reallocated_Sector_Ct/ {print $10}’)
temperature=$(smartctl -i -H -A -n standby -l error $i | grep ‘Temperature_Cel’ | awk ‘/Temperature_Cel/ {print $10}’)
((c=c+1))
#echo $c
if [[ $results == *”PASSED”* ]]
then
status[$c]=”Passed”
color=”green”
else
status[$c]=”Failed”
color=”red”
fi

echo “$i status is ${status[$c]} with $badsectors bad sectors. Disk temperature is $temperature.”
echo “<div style=’color: $color’> $i status is ${status[$c]} with $badsectors bad sectors. Disk temperature is $temperature.</div>” >> /var/cover.html
done

echo “</html>” >> /var/cover.html

sendmail -t < /var/cover.html

exit 0
[root@krcsan1] /mnt/StoragePool/scripts#

The result is a very, very basic HTML email that looks like this:

Clearly you can see /dev/da11 doesn’t show a sector count – that’s because da11 is an SSD.  You can beautify the output if you want, but this works for me.  In fact, this very SMART script saved me from data loss when it let me know about a failed disk in my storage node which I made a blog post about some time ago.

One more useful tip is to use the following command to identify which slot has which serial number disk:

[root@krcsan1] ~# sas2ircu 0 DISPLAY | grep -vwE “(SAS|State|Manufacturer|Model|Firmware|Size|GUID|Protocol|Drive)”

Once you have correlated all of your slots to the serial numbers, go ahead and edit your disk info by opening the FreeNAS WebUI, going to Storage, then View Disks.  Select each disk and click edit and populate the “description” field with the serial number:

This will make identifying the slot number much easier when you get a report that says /dev/da6 has failed – you can look up your table and see that dawas slot #10 and you can be sure to pull the correct disk.  I’ve even gone and made a very, very quick video replacing a failed disk on this node:

I will be following up, yet again, with some actual storage performance metrics.  Don’t worry, it won’t be another year or more for that post!  Look for it soon – most likely later this week.  Thanks for reading!

 

Author: Jon

Share This Post On

14 Comments

  1. I use your script at FreeNAS-11.0 but get wrong output in red:

    SMART Drive Results for all drives
    /dev/da2 status is Failed with bad sectors. Disk temperature is .
    /dev/da3 status is Failed with bad sectors. Disk temperature is .
    /dev/da4 status is Failed with bad sectors. Disk temperature is .
    /dev/da5 status is Failed with bad sectors. Disk temperature is .

    Also no temperature ?

    # zpool status
    pool: freenas-boot
    state: ONLINE
    scan: resilvered 744M in 0h0m with 0 errors on Thu Jun 15 12:28:59 2017
    config:

    NAME STATE READ WRITE CKSUM
    freenas-boot ONLINE 0 0 0
    mirror-0 ONLINE 0 0 0
    ada0p2 ONLINE 0 0 0
    ada1p2 ONLINE 0 0 0

    errors: No known data errors

    pool: tank
    state: ONLINE
    scan: none requested
    config:

    NAME STATE READ WRITE CKSUM
    tank ONLINE 0 0 0
    raidz2-0 ONLINE 0 0 0
    gptid/b4c0c8b5-51b2-11e7-bbfe-d05099c01bd9 ONLINE 0 0 0
    gptid/b585409c-51b2-11e7-bbfe-d05099c01bd9 ONLINE 0 0 0
    gptid/b6463ab5-51b2-11e7-bbfe-d05099c01bd9 ONLINE 0 0 0
    gptid/b70952c9-51b2-11e7-bbfe-d05099c01bd9 ONLINE 0 0 0

    errors: No known data errors

    Any idea what is wrong?

    Post a Reply
  2. How did you connect the Samsung 850 EVO SSD drives in the R510? Is there a separate drive cage for them inside the server itself considering all 12 SAS bays are already occupied by the Western Digital drives?

    And what is the purpose of the Sandisk Cruzer USB flash drives that you have listed in the hardware section at the beginning of the article?

    Thanks!

    Post a Reply
  3. Would this work with iSCSI as well?

    Post a Reply
    • My apologies. Specifically, will the bridge work for iSCSI as well? I’ve been reading some of your posts on the Freenas forum.

      Post a Reply
      • Or is this idea flawed due to the bridge interface not showing up in the Freenas GUI and therefore unable to configure iSCSI with it?

        Post a Reply
        • It will not work for iSCSI. Remember, iSCSI will only let you share block storage out through an IP address that the UI can see (when configuring the iSCSI Portal or discovery IP). You can probably hack the config from the back-end but as you may have seen in my FreeNAS threads you need to be careful because FreeNAS might step on this configuration at some point.

          Post a Reply
          • Thank you so much. I figured as much so I migrated everything to NFS.

  4. So I am looking into doing something similar, and I was wondering if you would share what re-seller you used to get your R520?

    Post a Reply
  5. WOW,you teach me a lot!

    AND I setup a ZFS HomeLab server too, with 4T*4 in raidz1, everything is going fine , just one…

    when you rsync or copy a big file (~10G)to within the zpool , speed is much more slow than only write or only read,about 80MB/S.

    In my case ,4T*4 in raidz1, write speed can reach 300MB/S…

    I notice that, when copy or rsync big file within the same zpool,”iostat -dmx 1″ show the disk is busy:
    https://www.xargs.cn/lib/exe/fetch.php/linux:benchmark:iostat-zfs-raidz-copy.png

    the other filesystem ,like xfs,do the same thing, iostat show first read all data, and then write to disk,so the performance is good..

    Post a Reply
    • I think you are on to something, however you can’t get around fact and probability. Raidz1 is not much safer than raid 5. With the size of drives the probability of error for rebuilds increase. With drives being over 2tb available cheaply, it would be better to invest in enough drives to run raidz2 at minimum, and preference to raidz3. I am a victim to rebuild fail on raidz1 rebuild failure and always run ecc ram. http://serverfault.com/questions/369331/is-a-large-raid-z-array-just-as-bad-as-a-large-raid-5-array

      Post a Reply
      • In all reality RAIDZ1 is basically RAID5. And yes, probability being what it is, it’s not the most robust method of storing data. However, you need to weigh your requirements. If it’s super important/critical data, then yep, totally agree, RAIDZ3 would be great especially with large drives. However, RAIDZ3 is a bit of a waste on a pool like mine. It only took about 35 – 40 minutes to rebuild my one vdev when a 1TB disk failed. If the vdev consisted of a RAIDZ1 with 8 4TB disks yes you’re absolutely right that probability would not be in my favor and I agree RAIDZ2 would be much, much better. NetApp recently introduced RAID-TEC which is (basically) their RAIDZ3 and I believe it’s mandatory on any aggregate involving 6TB+ disks.

        Post a Reply
    • Do remember that that the RAM on the storage node is going to receive the data first. If you have 16GB of RAM in your ZFS server, it’s possible that a 10GB file could, almost, entirely fit in memory. Granted, we know that ZFS will prepare transactions to disk, but still, the memory can really offset actual spindle performance. Further, you’ll be able to write a large, single file a lot faster than many small files as I am sure you know. Thanks for following! Good luck on your ZFS build!

      Post a Reply

Submit a Comment

Your email address will not be published. Required fields are marked *

Share This
%d bloggers like this: