The question for today is – does Linux md-RAID scale to 10 Gbit/s?
I wanted to build a proof of concept for a scalable, highly available, fault tolerant, distributed block storage, which utilizes commodity hardware, runs on a 10 Gigabit Ethernet network, and uses well-tested open-source technologies. This is a simplified version of Ceph. The only single point of failure in this cluster is the client itself, which is inevitable in any solution.
Here is an overview diagram of the setup:
My test lab is hosted on AWS:
- 3x “c4.8xlarge” storage servers
- each of them has 5x 50 GB General Purpose (SSD) EBS attached volumes which provide up to 160 MiB/s and 3000 IOPS for extended periods of time; practical tests shown 100 MB/s sustained sequential read/write performance per volume
- each EBS volume is managed via LVM and there is one logical volume with size 15 GB
- each 15 GB logical volume is being exported by iSCSI to the client machine
- 1x “c4.8xlarge” client machine
- the client machine initiates an iSCSI connection to each single 15 GB logical volume, and thus has 15 identical iSCSI block devices (3 storage servers x 5 block devices = 15 block devices)
- to achieve a 3x replication factor, the block devices from each storage server are grouped into 5x mdadm software RAID-1 (mirror) devices; each RAID-1 device “md1” to “md5” contains three disks from a different storage server, so that if one or two of the storage servers fail, this won’t affect the operation of the whole RAID-1 device
- all RAID-1 devices “md1” to “md5” are grouped into a single RAID-0 (stripe), in order to utilize the full bandwidth of all devices into a single block device, namely the “md99” RAID-0 device, which also combines the size capacity of all “md1” to “md5” devices and it equals to 75 GB
- 10 Gigabit network in a VPC using Jumbo frames
- “iperf” measured a 10.07 Gbits/s maximum throughput between the servers; at the same time, “iftop” displayed 9.20 Gbit/s
- VPC does not support network broadcasts, so ATA over Ethernet was not an option
- the storage servers and the client machine were limited on boot to 4 CPUs and 2 GB RAM, in order to minimize the effect of the Linux disk cache
- only sequential and random reading were benchmarked
- Linux md RAID-1 (mirror) does not read from all underlying disks by default, so I had to create a RAID-1E (mirror) configuration; more info here and here; the “mdadm create” options follow:
--level=10 --raid-devices=3 --layout=o3
Performance results for sequential read/write benchmarked using “dd“:
- Single EBS volume (disk) on the storage server: 100 MB/s read/write
- Single iSCSI exported disk on the client machine: 100 MB/s read/write (no performance loss)
- “md1” RAID-1 (mirror) on the client machine which reads from all three disks simultaneously: about 300 MB/s read (no performance loss)
- “md90” RAID-0 (stripe) from all RAID-1 (mirror) devices on the client machine: 1.2 GB/s read (saturates the 10 Gigabit network)
Performance results for random read benchmarked using “fio“:
- Single EBS volume (disk) on the storage server: 3000 IOPS as promised by AWS; no tests done
- “md90” RAID-0 (stripe) from all RAID-1 (mirror) devices on the client machine: 47000 IOPS random-read of 4k blocks using an IO depth of 256 and spawning 6 processes (utilizes the maximum combined throughput of all 15 EBS devices)
Technical notes:
- monitor the CPU usage of your benchmarking processes – if they use 100% CPU, they are probably starving for CPU and cannot measure the I/O throughput properly
- use the “direct” (O_DIRECT) option because the Linux block cache does not scale well for a single-threaded process
- our tests were limited to 800-900 MB/s unless the Linux cache was bypassed; two or more processes waive this limit and can saturate the 10 Gigabit network, if your underlying block devices can handle the random I/O load
- random I/O dropped from 47k IOPS to 12k IOPS without using “direct”, as measured by “fio” running 6 simultaneous processes
- use a big enough block size when benchmarking the sequential reading, but not too big; we got 1.2 GB/s reading using a block size of 200 MB; a block size of 700 MB or bigger increased network usage but slightly decreased the overall reading speed
- see where your system limits are; sequential reading from “/dev/null” yielded 8.1 GB/s, which is about 7 times more than what we would need in our use-case
- network utilization for 1.2 GB/s sequential reading was 9.02 Gbit/s measured by “iftop” which is a usage of 98% from the absolute maximum that we can achieve; great efficiency
- “iostat” rMB/s measurements were showing 1150 MB/s for “md90”, which is the same number that we got by “dd”
Open questions:
- Will TRIM propagate through this nested “mdadm”, LVM, iSCSI stack?
- Will barriers be supported properly in this setup? This is required to prevent data loss or complete file-system corruption on sudden server reboots, power loss, etc.
Conclusion: Linux “mdadm” software RAID is able to completely utilize 10 Gigabit (network) bandwidth, which equals to 1.2 GB/s of useful data. Furthermore, it can scale to at least 47k IOPS.
June 25, 2015 at 5:02 am
Hi Ivan,
nice writeup, thanks for sharing your results.
A side note:
The difference in network results reported by iperf vs. iftop is quite striking. So I went and tested it and I could not get it to differ more than 3%, yet your difference is above 8.6%.
I did a little writeup here, thou most of conclusions are still missing:
http://blog.a2o.si/2015/06/25/network-speed-testing-discrepancy-iperf-vs-iftop-vs-iptraf/
Initial conclusion is: tools should provide raw data (bits/s or bytes/s).
June 25, 2015 at 12:54 pm
Hi, the answer turned out a bit long, so I summarized it in a blog article: https://blog.famzah.net/2015/06/25/iperf-and-iftop-accuracy/
Pingback: “iperf” and “iftop” accuracy | /contrib/famzah
July 14, 2015 at 2:16 am
Extremely cool benchmarks! Thanks for sharing!
Tried smaller IO dephs? AWS are suggesting smaller values and negative impact of bigger depths: http://www.slideshare.net/AmazonWebServices/maximizing-amazon-ec2-and-amazon-ebs-performance (slide 28).
Also would be nice to add AWS’s “recommended” setup of 8 EBS volumes (in raid0, single instance) as a baseline. It wouldn’t have the fault tolerance or HA properties, but might offer comparable performance at a lower price.
July 14, 2015 at 8:30 pm
I was able to max-out both the bandwidth of 10 Gbit/s (of the client machine), and the IOPS (47k = 15 EBS drives * 3k IOPS): http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#IOcredit
“Each volume (of General Purpose SSD) receives an initial I/O credit balance of 5,400,000 I/O credits, which is enough to sustain the maximum burst performance of 3,000 IOPS for 30 minutes.”
No matter what other block size / queue depth we use, we can’t go over this performance of total bandwidth and IOPS using this network setup.
In terms of bandwidth, we could theoretically deliver 1200 MB/s using 10, instead of 15 EBS volumes: “General Purpose (SSD) volumes have a throughput limit range of 128 MiB/s for volumes less than or equal to 170 GiB”. In this setup of 3 replicas, this means delivering about 400 MB/s from each storage server, for which we need 4 EBS volumes (each working at 128 MB/s). Thus the actual theoretically required number is at least 12 EBS volumes when we need a replication factor of 3.
My tests show that an EBS volume can deliver 100+ MB/s reading when testing with “dd” using block size of 1 MB. Note that you need to pre-warm the EBS volume as stated in the following documentation: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-prewarm.html
They use “bs=1M” in the examples 🙂
To be honest, I didn’t focus on a non-fault tolerance setup here. Using a single instance, you would hit the “Max throughput/instance” limit, which is 800 MB/s for all EBS types: http://aws.amazon.com/ebs/details/
Some instances provide a dedicated EBS connection between your EC2 instance and your EBS volume: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-ec2-config.html
I don’t know if this dedicated connection is used first, and if saturated, the normal connection is shared for both network and EBS traffic. It seems that with the EBS-optimized instances you get what is advertised in the above “Amazon EC2 Instance Configuration” page, and the 10-gigabit link is used _exclusively_ for non-EBS traffic.
In a nutshell, the greatest performance you can get from a single EC2 instance in regards to bandwidth is 800 MB/s. Which is roughly 100% utilized when using 8 EBS volumes x 100 MB/s (which should theoretically deliver 128 MB/s but in reality we get around 100 MB/s, for reading). The “Max IOPS/instance” is 48k IOPS.