/contrib/famzah

Enthusiasm never stops


5 Comments

Linux md-RAID scalability on a 10 Gigabit network

The question for today is – does Linux md-RAID scale to 10 Gbit/s?

I wanted to build a proof of concept for a scalable, highly available, fault tolerant, distributed block storage, which utilizes commodity hardware, runs on a 10 Gigabit Ethernet network, and uses well-tested open-source technologies. This is a simplified version of Ceph. The only single point of failure in this cluster is the client itself, which is inevitable in any solution.

Here is an overview diagram of the setup:
Linux md-RAID scalability on a 10 Gigabit network

My test lab is hosted on AWS:

  • 3x “c4.8xlarge” storage servers
    • each of them has 5x 50 GB General Purpose (SSD) EBS attached volumes which provide up to 160 MiB/s and 3000 IOPS for extended periods of time; practical tests shown 100 MB/s sustained sequential read/write performance per volume
    • each EBS volume is managed via LVM and there is one logical volume with size 15 GB
    • each 15 GB logical volume is being exported by iSCSI to the client machine
  • 1x “c4.8xlarge” client machine
    • the client machine initiates an iSCSI connection to each single 15 GB logical volume, and thus has 15 identical iSCSI block devices (3 storage servers x 5 block devices = 15 block devices)
    • to achieve a 3x replication factor, the block devices from each storage server are grouped into 5x mdadm software RAID-1 (mirror) devices; each RAID-1 device “md1” to “md5” contains three disks from a different storage server, so that if one or two of the storage servers fail, this won’t affect the operation of the whole RAID-1 device
    • all RAID-1 devices “md1” to “md5” are grouped into a single RAID-0 (stripe), in order to utilize the full bandwidth of all devices into a single block device, namely the “md99” RAID-0 device, which also combines the size capacity of all “md1” to “md5” devices and it equals to 75 GB
  • 10 Gigabit network in a VPC using Jumbo frames
  • the storage servers and the client machine were limited on boot to 4 CPUs and 2 GB RAM, in order to minimize the effect of the Linux disk cache
  • only sequential and random reading were benchmarked
  • Linux md RAID-1 (mirror) does not read from all underlying disks by default, so I had to create a RAID-1E (mirror) configuration; more info here and here; the “mdadm create” options follow: --level=10 --raid-devices=3 --layout=o3 Continue reading