Tea & Tech (šŸµ)

ZFS File Server: A History and Parts List

September 02, 2020

Iā€™d love to tell you when I built my first file server, but I honestly canā€™t remember. Hereā€™s what I do remember:

  • It ran OpenSolaris, which was the only distro supporting ZFS at the time
  • It lived under my bed in college
  • It was cobbled together from used parts (like every other server Iā€™d ever built).

I have memories of putting a failed hard drive in the freezer or the fridge in high school with the hopes that I could recover some data from it (homework and songs from the era of Napster and Kazaa). I couldnā€™t. I think this feeling of unrecoverable loss was what drove me to build my first file server.

I have data dating back to my senior year of high school, but nothing prior to 2004. OpenSolaris came along in June 2005. I donā€™t remember bringing a file server to school my freshman year, but I do remember replacing a failed drive my senior year, so my first file server happened between 2006 and 2008, I think.

That failed drive is what has made me a lifelong ZFS fan: a hard drive failure without any data loss? Nice! Whatā€™s more, I donā€™t need any special RAID hardware to build giant pools of data with ZFS; the file system does all the heavy lifting.

OpenSolaris was very different from the Linux OSes I was used to using, and it was difficult to administer. I had to look up the simplest commands, and the internet was much less well-cataloged in the mid aughts. When I ran out of hard drive space (and SATA connectors on my first server), I decided it was time to upgrade. Unfortunately, and this was around 2012 I think, the only other OS that supported ZFS was FreeBSD. FreeBSD is not Linux, but it is closer to it than OpenSolaris, so with fresh drives and spare parts, my second (current) file server was born.

The Problem

A couple months ago, I started getting data errors on my zpool. I decided to ignore them (to my peril). Eventually, one of the drives just gave up and stopped working, bringing the pool offline. I could bring the pool back online by clearing the data errors, but it would revert back into a failed state within a day or two.

It was a this point that I decided to replace the failed drive, and I added a spare disk to the zpool. ZFS worked its magic and began resilvering all the data to the new drive, but I think all the data restoration activity caused a SECOND drive to fail.

ZFS reported that I had 1.8 million data errors, and that most of the data was unrecoverable. Sure enough, I was unable to read most of my files, including important things like 2fa recovery codes.

Could ZFS recover from two drive failures in a single-parity zpool?? Surely my neglect of file server maintenance would come back punish me!

zpool scrub

In a last-ditch attempt to save my files, I ran a full scrub on the pool. It took two full days of mechanical drives grinding, somehow bringing the failed drive that I had ā€œreplacedā€ back online (in a mirrored sub-pool??), but afterwardsā€¦ all the errors were gone. All my data was there. I could read every file.

ZFS: Praise be! šŸ™

(The bad drive has since fallen back into a ā€œremovedā€ state šŸ˜¬ but all data remains intact!)

The Solution

At this point Iā€™m thinking that my hardware is in a pretty perilous state, and that it might be time for an upgrade. In the years since I built my last file server, ZFS finally arrived on Linux, which gives me the option of consolidating servers.

Iā€™m currently running two (excluding raspberry pis) servers in the basement: the FreeBSD file server, and a Linux server. The ability to run ZFS in Linux gives me the option of decommissioning both of these servers and replacing them with a single Linux box.

While weā€™re upgrading, letā€™s talk mechanical hard drives. Theyā€™re bulky, prone to mechanical failure, noisy, and consume 2-3x the electricity as solid state drives. Theyā€™re also too slow to keep up with the Ethereum blockchain, if you keep that data on ZFS (something I tried back in 2014, which is why I had ā€œspare drivesā€ floating around).

Letā€™s do some shopping.

Picking Parts

When I was building my ā€œgaming rigā€ back in 2018, I bought my first NVMe drive, and I was delighted. There were no wires! No moving parts! And it connected directly to the PCIe lanes for blazing speeds! Most motherboards only had a couple slots, but thatā€™s ok, because they make expansion cards that you can stick into the system to get more m2 slots.

Thereā€™s a caveat, though: each NVMe drive takes 4x PCIe lanes. The expansion card plugs into the same 16x PCIe slot as a GPU ā€” four lanes per card. You need to look at your motherboard chipset and do your homework: entry-level boards have only enough lanes for a single GPU and NVMe drive. Even the ā€œgamerā€ class boards typically have only enough lanes for two NVMe drives plus two GPUs, and adding the 2nd GPU will halve the number of PCIe lanes available to each (or just the second) GPU ā€” 8 lanes instead of 16. If an NVMe expansion card does not get the all 16 lanes, it doesnā€™t reduce transfer speeds; it simply ā€œwonā€™t seeā€ the other NVMe drives.

I wanted the option of being able to grow my storage pool later. Until this point, Iā€™ve never been able to add more storage to a file server, because my motherboard (and case!) has never had room (SATA ports) to support more drives. If I get a motherboard with enough PCIe lanes and 16x slots on the motherboard, then (for the first time in my life) Iā€™ll be able to add more storage without having to buy or scavenge parts for a whole new system. Not only that, but flash memory prices will continue to fall, so incrementally growing my storage as needed will be more cost-effective than trying to anticipate how much storage Iā€™ll need for the next 5-10 years and buying storage upfront.

I spent hours researching over the course of a few weeks, and the conclusion I arrived at is: the only motherboards with enough PCIe lanes to future-proof my NVMe storage needs are AMD Threadripper boards. And Threadrippers are expensive.

Third-generation Threadrippers recently hit the market, and the ā€œentry levelā€ 3960X processor ā€” a 24-core 3.8 GHz behemoth ā€” will set you back a cool $1350. Thatā€™s just the processor (and the least-expensive one at that). The motherboards start at half a grand and work up.

A brand-new blazing-fast Threadripper-based file server would set me back $3000 ā€” a cost I canā€™t even justify to myself, much less my wife.

As I continued to chew on the problem over the course of a few days, I realized something else: with a 3rd generation Threadripper, which uses PCIe 4.0, NVMe speeds reach 7000 MB/s (read; 6800 MB/s write). This means disk IO is no longer the bottleneck ā€” network speeds are. PCIe 3.0 speeds (half the speed of PCIe 4.0) in a ZFS zpool array will do just fine for network-attached storage!

Optimizing Bang for Buck

3rd generation Threadripper boards give me all the lanes I need to upgrade my system in the future, but are way faster than I need, and cost much more than I want to spend.

Non-Threadripper boards do not allow me the option of growing my storage pool; they limit me to 4, maybe 5 NVMe drives total.

I wanted to eat my cake and have it, too, and Iā€™m rather proud of the solution I settled on.

AMD is very good about CPU socket backwards-compatibility, and while 3rd generation Threadripper boards are not backwards-compatible with earlier Threadripper processors, 2nd generation boards are.

I decided to go with a 2nd generation Threadripper motherboard (x399 chipset) ā€” 64 PCIe lanes ā€” and I found a first-generation Threadripper processor for $170 (shipped!) from an online retailer in the Netherlands. The best of all worlds:

  • All the PCIe lanes I need for current and future endeavors,
  • Support for up to 128GB of RAM,
  • Network speeds of up to 2.5 Gbps,
  • 3 NVMe slots built right into the mobo,
  • 3 PCIe x16 slots for NVMe expansion cards (or GPUs if I want to mine cryptocurrencies), and
  • More processing power than my gaming rig has.

All this for a price thatā€™s an order of magnitude less than a third generation Threadripper!

Parts List

No affiliate links here, just links to the parts I bought. If you find this useful, Iā€™ll add cryptocurrency addresses to the bottom of the post.

Total Cost: $840

I am repurposing a 750W power supply from another server, but not either of the cases, which are over 10yrs old and sport failing fans, warped plastic, poor airflow, and non-extant cable management.

You may also notice that I did not buy any storage drives! This is because there are reports of an oversupply of flash memory that may lead to significantly lower prices around Black Friday. I am holding out for deals!

Current Status

All parts except the processor (which I imagine is on a boat in the Atlantic somewhere) have arrived. The parts in my possession are assembled, but I am not able to find any information about whether or not the CPU comes with its own cooling fan, so I may have a surprise expense in the form of a Noctua CPU Cooler.

Really impressed with the Carbide Air case, highly recommend that one. I am withholding judgement on the other parts, because I havenā€™t actually seen them in action.

In a future post, Iā€™ll talk about bringing everything online, and setting up ZFS on Ubuntu.

Thanks for reading! If you learned something useful and want to buy me a few grams of tea, please use one of the addresses below šŸ¤—

  • BTC: 121NtsFHrjjfxTLBhH1hQWrP432XhNG7kB
  • ETH: 0x3e06A19668c14c2695B1BceA32aF293F7d1b68AA
  • LTC: MNsPptYeBejaxSKshnHDeQc8MMHrMe8P5D
  • XLM: GDQP2KPQGKIHYJGXNUIYOMHARUARCA7DJT5FO2FFOOKY3B2WSQHG4W37 ; Memo: 721817113
  • XRP: rw2ciyaNshpHe7bCHo4bRWq6pqqynnWKQg ; Tag: 4804159

Andrew J. Pierce collects Yixing teapots and lives in Virginia with his wife, son, and Ziggy the cat. You can follow him on Twitter.

BTC: 121NtsFHrjjfxTLBhH1hQWrP432XhNG7kB
Ā© 2020 Andrew J. Pierce