Gears & Gadgets

ZFS versus RAID: Eight Ironwolf disks, two filesystems, one winner

Neither the stopwatch nor the denim jacket is strictly necessary, if we're being honest about it.
Enlarge / Neither the stopwatch nor the denim jacket is strictly necessary, if we’re being honest about it.
Aurich Lawson / Getty

This has been a long while in the making—it’s test results time. To truly understand the fundamentals of computer storage, it’s important to explore the impact of various conventional RAID (Redundant Array of Inexpensive Disks) topologies on performance. It’s also important to understand what ZFS is and how it works. But at some point, people (particularly computer enthusiasts on the Internet) want numbers.

First, a quick note: This testing, naturally, builds on those fundamentals. We’re going to draw heavily on lessons learned as we explore ZFS topologies here. If you aren’t yet entirely solid on the difference between pools and vdevs or what ashift and recordsize mean, we strongly recommend you revisit those explainers before diving into testing and results.

And although everybody loves to see raw numbers, we urge an additional focus on how these figures relate to one another. All of our charts relate the performance of ZFS pool topologies at sizes from two to eight disks to the performance of a single disk. If you change the model of disk, your raw numbers will change accordingly—but for the most part, their relation to a single disk’s performance will not.

Equipment as tested

We used the eight empty bays in our Summer 2019 Storage Hot Rod for this test. It’s got oodles of RAM and more than enough CPU horsepower to chew through these storage tests without breaking a sweat.

Specs at a glance: Summer 2019 Storage Hot Rod, as tested
OS Ubuntu 18.04.4 LTS
CPU AMD Ryzen 7 2700X—$ 250 on Amazon
RAM 64GB ECC DDR4 UDIMM kit—$ 459 at Amazon
Storage Adapter LSI-9300-8i 8-port Host Bus Adapter—$ 148 at Amazon
Storage 8x 12TB Seagate Ironwolf—$ 320 ea at Amazon
Motherboard Asrock Rack X470D4U—$ 260 at Amazon
PSU EVGA 850GQ Semi Modular PSU—$ 140 at Adorama
Chassis Rosewill RSV-L4112—Typically $ 260, currently unavailable due to CV19

The Storage Hot Rod’s also got a dedicated LSI-9300-8i Host Bus Adapter (HBA) which isn’t used for anything but the disks under test. The first four bays of the chassis have our own backup data on them—but they were idle during all tests here and are attached to the motherboard’s SATA controller, entirely isolated from our test arrays.

How we tested

As always, we used fio to perform all of our storage tests. We ran them locally on the Hot Rod, and we used three basic random-access test types: read, write, and sync write. Each of the tests was run with both 4K and 1M blocksizes, and I ran the tests both with a single process and iodepth=1 as well as with eight processes with iodepth=8.

For all tests, we’re using ZFS on Linux 0.7.5, as found in main repositories for Ubuntu 18.04 LTS. It’s worth noting that ZFS on Linux 0.7.5 is two years old now—there are features and performance improvements in newer versions of OpenZFS that weren’t available in 0.7.5.

We tested with 0.7.5 anyway—much to the annoyance of at least one very senior OpenZFS developer—because when we ran the tests, 18.04 was the most current Ubuntu LTS and one of the most current stable distributions in general. In the next article in this series—on ZFS tuning and optimization—we’ll update to the brand-new Ubuntu 20.04 LTS and a much newer ZFS on Linux 0.8.3.

Initial setup: ZFS vs mdraid/ext4

When we tested mdadm and ext4, we didn’t really use the entire disk—we created a 1TiB partition at the head of each disk and used those 1TiB partitions. We also had to invoke arcane arguments—mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0—to avoid ext4’s preallocation from contaminating our results.

Using these relatively small partitions instead of the entire disks was a practical necessity, since ext4 needs to grovel over the entire created filesystem and disperse preallocated metadata blocks throughout. If we had used the full disks, the usable space on the eight-disk RAID6 topology would have been roughly 65TiB—and it would have taken several hours to format, with similar agonizing waits for every topology tested.

ZFS, happily, doesn’t need or want to preallocate metadata blocks—it creates them on the fly as they become necessary instead. So we fed ZFS each 12TB Ironwolf disk in its entirety, and we didn’t need to wait through lengthy formatting procedures—each topology, even the largest, was ready for use a second or two after creation, with no special arguments needed.

ZFS vs conventional RAID

A conventional RAID array is a simple abstraction layer that sits between a filesystem and a set of disks. It presents the entire array as a virtual “disk” device that, from the filesystem’s perspective, is indistinguishable from an actual, individual disk—even if it’s significantly larger than the largest single disk might be.

ZFS is an entirely different animal, and it encompasses functions that normally might occupy three separate layers in a traditional Unixlike system. It’s a logical volume manager, a RAID system, and a filesystem all wrapped into one. Merging traditional layers like this has caused many a senior admin to grind their teeth in outrage, but there are very good reasons for it.

There is an absolute ton of features ZFS offers, and users unfamiliar with them are highly encouraged to take a look at our 2014 coverage of next-generation filesystems for a basic overview as well as our recent ZFS 101 article for a much more comprehensive explanation.

Megabytes vs Mebibytes

As in the last article, our units of performance measurement here are kibibytes (KiB) and mebibytes (MiB). A kibibyte is 1,024 bytes, a mebibyte is 1,024 kibibytes, and so forth—in contrast to a kilobyte, which is 1,000 bytes, and a megabyte, which is 1,000 kilobytes.

Kibibytes and their big siblings have always been the standard units for computer storage. Prior to the 1990s, computer professionals simply referred to them as K and M—and used the inaccurate metric prefixes when they spelled them out. But any time your operating system refers to GB, MB, or KB—whether in terms of free space, network speed, or amounts of RAM—it’s really referring to GiB, MiB, and KiB.

Storage vendors, unfortunately, eventually seized upon the difference between the metrics as a way to more cheaply produce “gigabyte” drives and then “terabyte” drives—so a 500GB SSD is really only 465 GiB, and 12TB hard drives like the ones we’re testing today are really only 10.9TiB each.

Testing and analysis, using ZFS default settings

As we did with the mdraid performance tests, we used fio to test our Ironwolf disks under ZFS. Once again, we’re focusing entirely on random access, in two block sizes: 4KiB and 1MiB, which we test for read, write, and synchronous write on all topologies.

We also have some additional variables in play when testing ZFS. We wanted to show what happens when you misconfigure the ashift value (which sets the sector size) and also what happens when you tune recordsize to better reflect your workload. In order to focus on all this juicy and relevant data, we needed to cut some fluff—so we’re only looking at multi-process operations this time, with fio set to numjobs=8 and iodepth=8.

The major reason we tested single-process operations in the mdraid performance article was to very directly demonstrate that multiple-disk topologies tend not to accelerate single-threaded workloads. That’s still true here—and while we did test single-threaded workloads against ZFS, there was nothing to be seen there that you don’t see in the otherwise much more interesting multi-threaded, 8-process workloads. So that’s what we’re focusing on.

Performance scales with vdevs

One of the most common—and most pernicious—myths I encounter when talking to people about storage is the idea that performance scales well with the number of disks in a vdev, rather than the number of vdevs in a pool.

The very mistaken idea—which seems reasonable on the surface—is that as the number of data chunks in a stripe go up, performance goes up with it. If you have an eight-disk RAIDz2, you’ve got six “data disks” per stripe, so six times the performance, give or take—right? Meanwhile, an 8-disk pool of mirrors only has four “data disks” per stripe, so—lower performance!

In the above charts, we show performance trends per vdev for single-disk, 2-disk mirror, and 4-disk RAIDz2 vdevs. These are shown in solid lines, and the “joker”—a single RAIDz2 vdev, becoming increasingly wide—is the dark, dashed line. So at n=4, we’re looking at four single-disk vdevs, four 2-disk mirror vdevs—and a single, 6-wide RAIDz2 vdev. Remember, n for the joker isn’t the total number of disks—it’s the total number of disks, minus parity.

Let’s take a look at n=2 on the 1M Async Write chart above. For single disk vdevs, two-wide mirror vdevs, and four-wide RAIDz2 vdevs, we see just the scaling we’d expect—at 202 percent, 203 percent, and 207 percent of the performance of a single vdev, of the same class.

The RAIDz2 “joker” at n=2, on the other hand, is four disks wide—and since RAIDz2 is dual parity, that means it’s got two “data disks,” hence n=2. And we can see that it’s underperforming badly compared to the per-vdev lines, with only 160 percent the performance of a single “data disk.” The trend only gets worse as the single RAIDz gets wider, while the trend-lines for per-vdev scaling keep a clean, positive linear slope.

We see similar trends in 4K writes and 1M reads alike. The closest the increasingly wide single-RAIDz2 vdev comes to linear scale is on 1MiB reads, where at first its per-disk scale appears to be keeping up with per-vdev scale. But it sharply falls off after n=4—and that trend will continue to get worse, as the vdev gets wider.

We’ll talk about how and why those reads are falling off so sharply later—but for now, let’s get into some simple performance tests, with raw numbers.

RAIDz2 vs RAID6—default settings

If you carefully test RAIDz2 versus RAID6, the first thing that stands out is just how fast the writes are. Even sync writes, which you might intuitively expect to be slower on ZFS—since they must often be “double-committed,” once to ZIL and once again to main storage—are significantly faster than they were on mdraid6.

Uncached reads, unfortunately, trend the other way—RAIDz vdevs tend to pay for their fast writes with slow reads. This disadvantage tends to be strongly offset in the real world due to the ARC’s higher cache hit ratio as compared to the simple LRU cache used by the kernel, but that effect is difficult or impossible to estimate in simple, synthetic tests like these.

On servers with heavy, concurrent mixed read/write workloads, the effect of RAIDz’s slow reads can also be offset by how much faster the writes are going—remember, storage is effectively half-duplex; you can’t read and write at the same time. Any decrease in utilization on the write side will show up as increased availability on the read side. Storage is a complex beast!

Caveats, hedges, and weasel words aside, let’s make this clear—RAIDz is not a strong performer for uncached, pure read workloads.

For the most part, we’re seeing the same phenomenon in the 4KiB scale that we did at 1MiB—RAIDz2 handily beats RAID6 at writes, but it gets its butt handed to it on reads.

What might not be quite so obvious is that RAIDz2 is suffering badly from a misconfiguration here—most users real-world experience of 4KiB I/O is due to small files, such as the dotfiles in a Linux user’s home directory, the similar .INIs and what have you in a Windows user’s home directory, and the hordes of small .INI or .conf files most systems are plagued with.

But the 4KiB RAIDz2 performance you’re seeing here is not the RAIDz2 performance you’d be seeing with 4KiB files. You see, fio writes single, very large files and seeks inside them. Since we’re testing on the default recordsize of 128KiB, that means that each RAIDz2 4KiB read is forced to pull in an additional 124KiB of useless, unwanted data.

Later, we’ll see what happens when we properly tune our ZFS system for an appropriate recordsize—which much more closely approximates real-world experience with 4KiB I/O in small files, even without tuning. But for now, let’s keep everything on defaults and move ahead to performance when using two-wide mirror vdevs.

ZFS mirror vdevs vs RAID10—default settings

The comparison between mirror vdevs and RAID10 is a fun one, because these are easily the highest-performing topologies. Who doesn’t like big numbers? At first blush, the two seem pretty evenly matched.

In 1MiB write, sync write, and uncached read, both systems exhibit near-linear scale and positive slope, and they tend to be pretty close. RAID10 clearly has the upper hand when it comes to pure uncached reads—but unlike RAIDz2 vs RAID6, the lead isn’t enormous.

We’re leaving a lot of performance on the table, though, by sticking to default settings. We’ll revisit that, but for now let’s move on to see how mirrors and RAID10 fare with 4KiB random I/O.

Although the curve is a little tweaky, we see Linux RAID10 clearly outperform ZFS mirrors on 4KiB writes—a first, for mdraid and ext4. But when we move on to sync 4KiB writes, the trend reverses. RAID10 is unable to keep up with even a single ext4 disk, while the pool of mirrors soars to better than 500 percent a single ext4 disk’s performance.

Moving onto 4KiB reads, we once again see the ZFS topology suffering due to misconfiguration. Having left our recordsize at the default 128KiB, we’re reading in an extra 124KiB with every 4KiB we actually want.

In some cases, we can get some use out of that unnecessary data later—by caching it and servicing future read requests from the cache instead of from the metal. But we’re working with a large dataset, so very few of those cached “extras” are ever of any use.

Retesting ZFS with recordsize set correctly

We believe it’s important to test things the way they come out of the box. The shipping defaults should be sane defaults, and that’s a good place for everyone to start from.

While the ZFS defaults are reasonably sane, fio doesn’t interact with disks in quite the same way most users normally do. Most user interaction with storage can be characterized by reading and writing files in their entirety—and that’s not what fio does. When you ask fio to show you random read and write behavior, it creates one very large file for each testing process (eight of them, in today’s tests), and that process seeks within that large file.

With the default recordsize=128K, ZFS will store a 4KiB file in an undersized record, which only occupies a single 4KiB sector—and reads of that file later will also only need to light up a single 4KiB sector on disk. But when performing 4KiB random I/O with fio, since the 4KiB requests are pieces of a very large file, ZFS must read (and write) the requests in full-sized 128KiB increments.

Although the impact is somewhat smaller, the default 128KiB recordsize also penalizes large file access. After all, it’s not exactly optimal to store and retrieve an 8MiB digital photo in 64 128KiB blocks, rather than only 8 1MiB blocks.

In this section, we’re going to zfs set recordsize=4K test for the 4KiB random I/O tests, and zfs set recordsize=1M for the 1MiB random I/O tests.

Is ZFS getting “special treatment” here?

An experienced sysadmin might reasonably object to ZFS being given special treatment while mdraid is left to its default settings. But there’s a reason for that, and it’s not just “we really like ZFS.” While you can certainly tune chunk size on a kernel RAID array, any such tuning affects the entire device globally.

If you tune a 48TB mdraid10 for 4KiB I/O, it’s going to absolutely suck at 1MiB I/O—and similarly, a 48TB mdraid10 tuned for 1MiB I/O will perform horribly at 4KiB I/O. To fix that, you must destroy the entire array and any filesystems and data on it, recreate everything from scratch, and restore your data from backup—and it can still only be tuned for one performance use case.

In sharp contrast, if you’ve got a 48TB ZFS pool, you can set recordsize per dataset—and datasets can be created and destroyed as easily as folders. If your ZFS server has 20TiB of random user-saved files (most of which are several MiB, such as photos, movies, and office documents) along with a 2TiB MySQL database, each can coexist peacefully and simply:

root@server:~# zfs create pool/samba root@server:~# zfs set recordsize=1M pool/samba root@server:~# zfs create pool/mysql root@server:~# zfs set recordsize=16K pool/mysql

Just like that, you’ve created what look like “folders” on the server which are optimized for the workloads to be found within. If your users create a bunch of 4KiB files, that’s fine—the 4KiB files will still only occupy one sector, while the larger files reap the benefit of similarly large logical block sizes. Meanwhile, the MySQL database gets a recordsize which perfectly matches its own internal 16KiB pagesize, optimizing performance there without hurting it on the rest of the server.

If you install a PostgreSQL instance later, you can tune for its default 8KiB page size just as easily:

root@server:~# zfs create pool/postgres root@server:~# zfs set recordsize=8K pool/postgres

And if you later re-tune your MySQL instance to use a larger or smaller page size, you can re-tune your ZFS dataset to match. If all you do is change recordsize, the already-written data won’t change, but any new writes to the database will follow the dataset’s new recordsize parameter. (If you want to re-write the existing data structure, you also need to do a block for block copy of it, eg with the mv command.)

ZFS recordsize=1M—large blocks for large files

We know everybody loves to see big performance numbers, so let’s look at some. In this section, we’re going to re-run our earlier 1MiB read, write, and sync write workloads against ZFS datasets with recordsize=1M set.

We want to reiterate that this is a pretty friendly configuration for any normal “directory full of files” type of situation—ZFS will write smaller files in smaller blocks automatically. You really only need smaller recordsize settings in special cases with a lot of random access inside large files, such as database binaries and VM images.

RAIDz2 vs RAID6—1MiB random I/O, recordsize=1M

RAIDz2 writes really take off when its recordsize is tuned to fio‘s workload. Its 1MiB asynchronous writes leap from 568MiB/sec to 950MiB/sec, sometimes higher. These are rewrites of an existing fio workload file: The first fio write test always goes significantly faster on ZFS storage than additional test runs re-using the same file do.

This effect doesn’t increase, however. The second, 20th, and 200th run will always be the same. But in the interests of fairness, we throw away that first, significantly higher test run for ZFS. In this case, that first “throwaway” async write run was at 1,238MiB/sec.

Sync writes are similarly boosted, with RAIDz2 turning in an additional 54MiB/sec over its untuned results, more than doubling RAID6’s already-lagging performance.

Unfortunately, tuning recordsize didn’t help our 1MiB uncached reads—although the eight-wide RAIDz2 vdev improved by just under 100MiB/sec, RAIDz2 still lags significantly behind mdraid6 with five or more disks in the vdev. A pool with two four-wide RAIDz2 vdevs (not shown above) comes much closer, pulling 406MiB/sec to eight-wide mdraid6’s 485MiB/sec.

RAIDz2 vs RAID6—4KiB random I/O, recordsize=4K

Moving along to 4KiB random access I/O, we see the same broad trends observed above. In both sync and async writes, RAIDz2 drastically outperforms RAID6. When committing small writes, RAID6 arrays fail to keep up with even a single ext4 disk.

This trend reverses, again, when shifting to uncached reads. Wide RAIDz2 vdevs do at least manage to outperform a single ext4 disk, but they lag significantly behind mdraid and ext4, with an eight-wide RAIDz2 vdev being outperformed roughly 4:1.

When deciding between these two topologies on a performance basis, the question becomes whether you’d prefer to have a 20:1 increase in write performance at the expense of a 4:1 decrease in reads, or vice versa. On the surface of it, this sounds like a no-brainer—but different workloads are different.

A cautious—and wise!—admin would be well advised to do workload-specific testing both ways before making a final decision, if performance is the only metric that matters.

ZFS mirror vdevs vs RAID10—1MiB random I/O, recordsize=1M

If you were looking for an unvarnished performance victory for team ZFS, here’s where you’ll find it. RAID10 is the highest performing conventional RAID topology on every metric we test, and a properly-tuned ZFS pool of 2-wide mirror vdevs outperforms it everywhere.

We can already hear conventional RAID fans grumbling that their hardware RAID would outrun that silly ZFS, since it’s got a battery or supercapacitor backed cache and will therefore handle sync writes about as rapidly as standard asynchronous writes—but the governor’s not entirely off on the ZFS side of that race, either. (We’ll cover the use of a LOG vdev to accelerate sync writes in a different article.)

ZFS mirror vdevs vs RAID10—4KiB random I/O, recordsize=4K

Down in the weeds at 4KiB blocksize, our pool of mirrors strongly outperforms RAID10 on both synchronous and asynchronous writes. It does, however, lose to RAID10 on uncached 4KiB reads.

Much like the comparison of RAIDz2 vs RAID6, however, it’s important to look at the ratios. Although a pure 4KiB uncached random read workload performs not quite twice as well on RAID10 as on ZFS mirrors, such a workload is probably fairly rare—and the write performance advantage swings 5:1 in the other direction (or 12:1, for sync writes).

Most 4KiB-heavy workloads will also be constantly saturated workloads, as the disks are constantly thrashing trying to keep up with demand—meaning that it wouldn’t take many write operations to overwhelm RAID10’s 4KiB read performance benefits with its write performance discrepancies.

4KiB random read workloads also tend to heavily favor better cache algorithms. We did not attempt to test cache efficiency here, but an ARC can safely be assumed to strongly outperform a simple LRU on nearly any workload.

Why are RAIDz2 reads so slow?

There’s one burning question that needs to be answered after all this testing: Why are RAIDz reads so much slower than conventional RAID6 reads? With recordsize tuned appropriately for workload, RAIDz2 outperforms RAID6 on writes by as much as 20:1—which makes it that much more confusing why reads would be slower.

The answer is fairly simple, and it largely amounts to the flip side of the same coin. Remember the RAID hole? Conventional RAID6 is not only willing but effectively forced to pack multiple blocks/files into the same stripe, since it doesn’t have a variable stripe width. In addition to opening up the potential for corruption due to partial stripe write, this subjects RAID6 arrays to punishing read-modify-write performance penalties when writing partial stripes.

RAIDz2, on the other hand, writes every block or file as a full stripe—even very small ones, by adjusting the width of the stripe. We captured the difference between the two topologies’ 1MiB reads in the series of screenshots above.

When recordsize is set to 1M, a 1MiB block gets carved into roughly 176KiB chunks and distributed among six of the eight disks in an eight-wide RAIDz2, with the other two disks each carrying a roughly 176KiB parity chunk. So when we read a 1MiB block from our eight-wide RAIDz2, we light up six of eight disks to do so.

By contrast, the same disks in an eight-wide RAID6 default to a 512KiB chunk size—which means 512KiB of data (or parity) is written to each disk during a RAID6 write. When we go back to read that data from the RAID6, we only need to light up two of our eight disks for each block, as compared to RAIDz2’s six.

In addition, the two disks we light up on RAID6 are performing a larger, higher-efficiency operation. They’re reading 128 contiguous 4KiB on-disk sectors, as compared to RAIDz2’s six disks only reading 44 contiguous 4KiB on-disk sectors for each operation.

If we want to get deep into the weeds, we could more extensively tune the RAIDz2 to work around this penalty: we could set zfs_max_recordsize=4194304  in /etc/modprobe.d/zfs.conf, export all pools and reload the ZFS kernel module, then zfs set recordsize=3M on the dataset.

Setting a 3MiB recordsize would mean that each disk gets 512KiB chunks, just like RAID6 does, and performance would go up accordingly—if we write a 1MiB record, it gets stored on two disks in 512KiB chunks. And when we read it back, we read that 1MiB record by lighting up only those two disks, just like we did on RAID6.

Unfortunately, that also means storage efficiency goes down—because that 1MiB record was written as an undersized stripe, with two chunks of data and two chunks of parity. So now we’re performing as well or better than RAID6, but we’re at 50 percent storage efficiency while it’s still at 75 percent (six chunks out of every eight are data).

To be fair, this is a problem for the eight-process, iodepth=8 reads we tested—but not for single-process, iodepth=1 reads, which we tested but did not graph. For single-process reads, RAIDz2 significantly outperforms RAID6 (at 129MiB/sec to 47MiB/sec), and for the exact same reason. It lights up three times as many disks to read the same 1MiB of data.

TANSTAAFL—There Ain’t No Such Thing As A Free Lunch.

Conclusions

If you’re looking for raw, unbridled performance it’s hard to argue against a properly-tuned pool of ZFS mirrors. RAID10 is the fastest per-disk conventional RAID topology in all metrics, and ZFS mirrors beat it resoundingly—sometimes by an order of magnitude—in every category tested, with the sole exception of 4KiB uncached reads.

ZFS’ implementation of striped parity arrays—the RAIDz vdev type—are a bit more of a mixed bag. Although RAIDz2 decisively outperforms RAID6 on writes, it underperforms it significantly on 1MiB reads. If you’re implementing a striped parity array, 1MiB is hopefully the blocksize you’re targeting in the first place, since those arrays are particularly awful with small blocksizes.

When you add in the wealth of additional features ZFS offers—incredibly fast replication, per-dataset tuning, automatic data healing, high-performance inline compression, instant formatting, dynamic quota application, and more—we think it’s difficult to justify any other choice for most general-purpose server applications.

ZFS still has more performance options to offer—we haven’t yet covered the support vdev classes, LOG, CACHE, and SPECIAL. We’ll cover those—and perhaps experiment with recordsize larger than 1MiB—in another fundamentals of storage chapter soon.

Let’s block ads! (Why?)

Tech – Ars Technica

Leave a Reply

Your email address will not be published. Required fields are marked *