The question for today is – does Linux md-RAID scale to 10 Gbit/s?
I wanted to build a proof of concept for a scalable, highly available, fault tolerant, distributed block storage, which utilizes commodity hardware, runs on a 10 Gigabit Ethernet network, and uses well-tested open-source technologies. This is a simplified version of Ceph. The only single point of failure in this cluster is the client itself, which is inevitable in any solution.
My test lab is hosted on AWS:
- 3x “c4.8xlarge” storage servers
- each of them has 5x 50 GB General Purpose (SSD) EBS attached volumes which provide up to 160 MiB/s and 3000 IOPS for extended periods of time; practical tests shown 100 MB/s sustained sequential read/write performance per volume
- each EBS volume is managed via LVM and there is one logical volume with size 15 GB
- each 15 GB logical volume is being exported by iSCSI to the client machine
- 1x “c4.8xlarge” client machine
- the client machine initiates an iSCSI connection to each single 15 GB logical volume, and thus has 15 identical iSCSI block devices (3 storage servers x 5 block devices = 15 block devices)
- to achieve a 3x replication factor, the block devices from each storage server are grouped into 5x mdadm software RAID-1 (mirror) devices; each RAID-1 device “md1” to “md5” contains three disks from a different storage server, so that if one or two of the storage servers fail, this won’t affect the operation of the whole RAID-1 device
- all RAID-1 devices “md1” to “md5” are grouped into a single RAID-0 (stripe), in order to utilize the full bandwidth of all devices into a single block device, namely the “md99” RAID-0 device, which also combines the size capacity of all “md1” to “md5” devices and it equals to 75 GB
- 10 Gigabit network in a VPC using Jumbo frames
- the storage servers and the client machine were limited on boot to 4 CPUs and 2 GB RAM, in order to minimize the effect of the Linux disk cache
- only sequential and random reading were benchmarked
- Linux md RAID-1 (mirror) does not read from all underlying disks by default, so I had to create a RAID-1E (mirror) configuration; more info here and here; the “mdadm create” options follow:
--level=10 --raid-devices=3 --layout=o3
Performance results for sequential read/write benchmarked using “dd“:
- Single EBS volume (disk) on the storage server: 100 MB/s read/write
- Single iSCSI exported disk on the client machine: 100 MB/s read/write (no performance loss)
- “md1” RAID-1 (mirror) on the client machine which reads from all three disks simultaneously: about 300 MB/s read (no performance loss)
- “md90” RAID-0 (stripe) from all RAID-1 (mirror) devices on the client machine: 1.2 GB/s read (saturates the 10 Gigabit network)
Performance results for random read benchmarked using “fio“:
- Single EBS volume (disk) on the storage server: 3000 IOPS as promised by AWS; no tests done
- “md90” RAID-0 (stripe) from all RAID-1 (mirror) devices on the client machine: 47000 IOPS random-read of 4k blocks using an IO depth of 256 and spawning 6 processes (utilizes the maximum combined throughput of all 15 EBS devices)
- monitor the CPU usage of your benchmarking processes – if they use 100% CPU, they are probably starving for CPU and cannot measure the I/O throughput properly
- use the “direct” (O_DIRECT) option because the Linux block cache does not scale well for a single-threaded process
- our tests were limited to 800-900 MB/s unless the Linux cache was bypassed; two or more processes waive this limit and can saturate the 10 Gigabit network, if your underlying block devices can handle the random I/O load
- random I/O dropped from 47k IOPS to 12k IOPS without using “direct”, as measured by “fio” running 6 simultaneous processes
- use a big enough block size when benchmarking the sequential reading, but not too big; we got 1.2 GB/s reading using a block size of 200 MB; a block size of 700 MB or bigger increased network usage but slightly decreased the overall reading speed
- see where your system limits are; sequential reading from “/dev/null” yielded 8.1 GB/s, which is about 7 times more than what we would need in our use-case
- network utilization for 1.2 GB/s sequential reading was 9.02 Gbit/s measured by “iftop” which is a usage of 98% from the absolute maximum that we can achieve; great efficiency
- “iostat” rMB/s measurements were showing 1150 MB/s for “md90”, which is the same number that we got by “dd”
- Will TRIM propagate through this nested “mdadm”, LVM, iSCSI stack?
- Will barriers be supported properly in this setup? This is required to prevent data loss or complete file-system corruption on sudden server reboots, power loss, etc.
Conclusion: Linux “mdadm” software RAID is able to completely utilize 10 Gigabit (network) bandwidth, which equals to 1.2 GB/s of useful data. Furthermore, it can scale to at least 47k IOPS.