Linux md-RAID scalability on a 10 Gigabit network

June 25, 2015 by Ivan Zahariev 5 Comments

The question for today is – does Linux md-RAID scale to 10 Gbit/s?

I wanted to build a proof of concept for a scalable, highly available, fault tolerant, distributed block storage, which utilizes commodity hardware, runs on a 10 Gigabit Ethernet network, and uses well-tested open-source technologies. This is a simplified version of Ceph. The only single point of failure in this cluster is the client itself, which is inevitable in any solution.

Here is an overview diagram of the setup:

My test lab is hosted on AWS:

3x “c4.8xlarge” storage servers
- each of them has 5x 50 GB General Purpose (SSD) EBS attached volumes which provide up to 160 MiB/s and 3000 IOPS for extended periods of time; practical tests shown 100 MB/s sustained sequential read/write performance per volume
- each EBS volume is managed via LVM and there is one logical volume with size 15 GB
- each 15 GB logical volume is being exported by iSCSI to the client machine
1x “c4.8xlarge” client machine
- the client machine initiates an iSCSI connection to each single 15 GB logical volume, and thus has 15 identical iSCSI block devices (3 storage servers x 5 block devices = 15 block devices)
- to achieve a 3x replication factor, the block devices from each storage server are grouped into 5x mdadm software RAID-1 (mirror) devices; each RAID-1 device “md1” to “md5” contains three disks from a different storage server, so that if one or two of the storage servers fail, this won’t affect the operation of the whole RAID-1 device
- all RAID-1 devices “md1” to “md5” are grouped into a single RAID-0 (stripe), in order to utilize the full bandwidth of all devices into a single block device, namely the “md99” RAID-0 device, which also combines the size capacity of all “md1” to “md5” devices and it equals to 75 GB
10 Gigabit network in a VPC using Jumbo frames
- “iperf” measured a 10.07 Gbits/s maximum throughput between the servers; at the same time, “iftop” displayed 9.20 Gbit/s
- VPC does not support network broadcasts, so ATA over Ethernet was not an option
the storage servers and the client machine were limited on boot to 4 CPUs and 2 GB RAM, in order to minimize the effect of the Linux disk cache
only sequential and random reading were benchmarked
Linux md RAID-1 (mirror) does not read from all underlying disks by default, so I had to create a RAID-1E (mirror) configuration; more info here and here; the “mdadm create” options follow: --level=10 --raid-devices=3 --layout=o3

Performance results for sequential read/write benchmarked using “dd“:

Single EBS volume (disk) on the storage server: 100 MB/s read/write
Single iSCSI exported disk on the client machine: 100 MB/s read/write (no performance loss)
“md1” RAID-1 (mirror) on the client machine which reads from all three disks simultaneously: about 300 MB/s read (no performance loss)
“md90” RAID-0 (stripe) from all RAID-1 (mirror) devices on the client machine: 1.2 GB/s read (saturates the 10 Gigabit network)

Performance results for random read benchmarked using “fio“:

Single EBS volume (disk) on the storage server: 3000 IOPS as promised by AWS; no tests done
“md90” RAID-0 (stripe) from all RAID-1 (mirror) devices on the client machine: 47000 IOPS random-read of 4k blocks using an IO depth of 256 and spawning 6 processes (utilizes the maximum combined throughput of all 15 EBS devices)

Technical notes:

monitor the CPU usage of your benchmarking processes – if they use 100% CPU, they are probably starving for CPU and cannot measure the I/O throughput properly
use the “direct” (O_DIRECT) option because the Linux block cache does not scale well for a single-threaded process
- our tests were limited to 800-900 MB/s unless the Linux cache was bypassed; two or more processes waive this limit and can saturate the 10 Gigabit network, if your underlying block devices can handle the random I/O load
- random I/O dropped from 47k IOPS to 12k IOPS without using “direct”, as measured by “fio” running 6 simultaneous processes
use a big enough block size when benchmarking the sequential reading, but not too big; we got 1.2 GB/s reading using a block size of 200 MB; a block size of 700 MB or bigger increased network usage but slightly decreased the overall reading speed
see where your system limits are; sequential reading from “/dev/null” yielded 8.1 GB/s, which is about 7 times more than what we would need in our use-case
network utilization for 1.2 GB/s sequential reading was 9.02 Gbit/s measured by “iftop” which is a usage of 98% from the absolute maximum that we can achieve; great efficiency
“iostat” rMB/s measurements were showing 1150 MB/s for “md90”, which is the same number that we got by “dd”

Open questions:

Will TRIM propagate through this nested “mdadm”, LVM, iSCSI stack?
Will barriers be supported properly in this setup? This is required to prevent data loss or complete file-system corruption on sudden server reboots, power loss, etc.

Conclusion: Linux “mdadm” software RAID is able to completely utilize 10 Gigabit (network) bandwidth, which equals to 1.2 GB/s of useful data. Furthermore, it can scale to at least 47k IOPS.

Author: Ivan Zahariev

An experienced Linux & IT enthusiast, Engineer by heart, Systems architect & developer.

/contrib/famzah

Enthusiasm never stops

Linux md-RAID scalability on a 10 Gigabit network

Author: Ivan Zahariev

5 thoughts on “Linux md-RAID scalability on a 10 Gigabit network”

Leave a comment Cancel reply

Share this:

Related

Author: Ivan Zahariev

5 thoughts on “Linux md-RAID scalability on a 10 Gigabit network”

Leave a comment Cancel reply