10 Gigabit | /contrib/famzah

While working on my latest pet project which involved 10 GigE transfers, I noticed a significant difference between the results shown by “iperf” and “iftop“. A fellow blogger also noticed this discrepancy. In order to get to the bottom of this, I did some additional tests using different MTU sizes, and observing the output of “iperf”, “iftop”, “iptraf”, and the raw Linux network device counters as seen by “ifconfig”.

The tests results are summarized in an online spreadsheet: https://goo.gl/MvJC8K

Some notes about each application:

iperf – this tool measures the TCP performance, as per documentation; therefore it counts the useful payload in a TCP/IP transfer; this is layer4 in the OSI model
iftop – this tool counts all IP packets, as per documentation; my tests show that it also operates on layer4, just as “iperf”, because ARP traffic (on layer3) is not counted at all; the fact that “iftop” cares about connections+ports also suggests that it operates at layer4
iptraf – this tool seems to be too old now, and its results were off by a multiple of 4 to 5
ifconfig – shows the most low-level statistics, namely bytes that passed as RX or TX through the network device; the most trusted source of performance data

We notice that both “iperf” and “iftop” measure the useful payload data that we can transfer per second. Since all OSI layers have some overhead, let’s take a look at what theory says about bandwidth efficiency in Ethernet:

with a standard MTU frame of 1500 bytes, we get 94.93% efficiency (5.07% overhead)
with a jumbo MTU frame of 9000 bytes, we get 99.14% efficiency (0.86% overhead)

Those numbers correspond very closely with the results shown by “iperf”.

It’s only “iftop” which differs a lot. Analysis of its source code reveals the reason for this and how we must interpret the displayed results:

#
# ui.c
#

void ui_print() {
...
    mvaddstr(y, COLS - 8 * HISTORY_DIVISIONS - 8, "rates:");

    draw_totals(&totals);
}

void draw_totals(host_pair_line* totals) {
    for(j = 0; j < HISTORY_DIVISIONS; j++) {
        readable_size((totals->sent[j] + totals->recv[j]) , buf, 10, 1024, options.bandwidth_in_bytes);
...
}

#
# ui_common.c
#

/*
 * Format a data size in human-readable format
 */
void readable_size(float n, char* buf, int bsize, int ksize, int bytes) {
    float size = 1;
...
    while(1) {
      size *= ksize;
...
        snprintf(buf, bsize, " %4.2f%s", n / size, bytes ? unit_bytes[i] : unit_bits[i]);

The authors of “iftop” decided to round to Gigibit (multiple of 1024), instead of the more common Gigabit (multiple of 1000). This makes the difference by “iftop” bigger as the transfer rate gets higher. For Gigabit the difference is 7%.

Once the “iftop” values are converted from Gigibit to Gigabit, they also match the results by “iperf” and the raw Linux network device counters.

The question for today is – does Linux md-RAID scale to 10 Gbit/s?

I wanted to build a proof of concept for a scalable, highly available, fault tolerant, distributed block storage, which utilizes commodity hardware, runs on a 10 Gigabit Ethernet network, and uses well-tested open-source technologies. This is a simplified version of Ceph. The only single point of failure in this cluster is the client itself, which is inevitable in any solution.

Here is an overview diagram of the setup:

My test lab is hosted on AWS:

3x “c4.8xlarge” storage servers
- each of them has 5x 50 GB General Purpose (SSD) EBS attached volumes which provide up to 160 MiB/s and 3000 IOPS for extended periods of time; practical tests shown 100 MB/s sustained sequential read/write performance per volume
- each EBS volume is managed via LVM and there is one logical volume with size 15 GB
- each 15 GB logical volume is being exported by iSCSI to the client machine
1x “c4.8xlarge” client machine
- the client machine initiates an iSCSI connection to each single 15 GB logical volume, and thus has 15 identical iSCSI block devices (3 storage servers x 5 block devices = 15 block devices)
- to achieve a 3x replication factor, the block devices from each storage server are grouped into 5x mdadm software RAID-1 (mirror) devices; each RAID-1 device “md1” to “md5” contains three disks from a different storage server, so that if one or two of the storage servers fail, this won’t affect the operation of the whole RAID-1 device
- all RAID-1 devices “md1” to “md5” are grouped into a single RAID-0 (stripe), in order to utilize the full bandwidth of all devices into a single block device, namely the “md99” RAID-0 device, which also combines the size capacity of all “md1” to “md5” devices and it equals to 75 GB
10 Gigabit network in a VPC using Jumbo frames
- “iperf” measured a 10.07 Gbits/s maximum throughput between the servers; at the same time, “iftop” displayed 9.20 Gbit/s
- VPC does not support network broadcasts, so ATA over Ethernet was not an option
the storage servers and the client machine were limited on boot to 4 CPUs and 2 GB RAM, in order to minimize the effect of the Linux disk cache
only sequential and random reading were benchmarked
Linux md RAID-1 (mirror) does not read from all underlying disks by default, so I had to create a RAID-1E (mirror) configuration; more info here and here; the “mdadm create” options follow: --level=10 --raid-devices=3 --layout=o3 Continue reading →

/contrib/famzah

Enthusiasm never stops

Tag Archives: 10 Gigabit

“iperf” and “iftop” accuracy

Linux md-RAID scalability on a 10 Gigabit network