/contrib/famzah

Enthusiasm never stops


Leave a comment

Speed up SSH connections by splitting known_hosts per host

For a while, my mpssh runs were getting slow. I use it daily against about 1400 Linux hosts, and a trivial true command across 999 parallel SSH sessions had drifted to roughly two minutes. During the run, my desktop would get a sharp CPU spike, and the mpssh executions started interfering with interactive work. I started wondering whether newer OpenSSH packages, the growing host count, or even ssh-agent were to blame.

It turned out that the biggest win was splitting my 2.1 MB ~/.ssh/known_hosts into one small file per host. The ssh_config(5) documentation says that UserKnownHostsFile accepts runtime tokens such as %h, so a path like ~/.ssh/known_hosts_single/%h is valid.

I did not prove the exact lookup algorithm OpenSSH uses internally, so I will not speculate too much there. But the benchmark was clear enough: once I stopped feeding SSH a monolithic known_hosts file, the runtime dropped from about two minutes to about thirty seconds with the same host list and the same default 50 ms delay between forks.

Benchmark Summary

SetupBest timeWhat it showed
Baseline, default SSH behavior, monolithic known_hosts, parallelism of 9992m03.482sThis was the original pain point.
Per-host known_hosts, default 50 ms delay26.840sAbout 4.6x faster without any aggressive client-side tuning.
Same per-host setup, but 0 ms delay16.228sFaster again, but much harsher on local CPU.
Per-host setup plus agent/key experimentsRoughly 27-32s at 50 msDisabling ssh-agent or switching RSA to Ed25519 did not materially change the result.

The spawn delay also mattered, but in a different way. Reducing it from the default 50 ms to 5 ms or 0 ms shaved off more seconds, but it also pushed much harder on local CPU. In one 0 ms run, CPU idle dropped to 0% for about five seconds. That is why I kept the default 50 ms in normal use. Getting down to about 27 to 30 seconds while keeping the machine responsive was already good enough.

I also chased a couple of dead ends. I saw ssh-agent spike to 100% CPU often enough that it looked suspicious, so I tested a temporary passwordless key and also forced IdentityAgent=none. I also tried Ed25519 instead of my older RSA key. Neither changed the overall picture in a meaningful way.

My ~/.ssh/config is also fairly large. I even tried splitting the alias-heavy part into a separate include file of about 78 KB, guarded by a Match originalhost stanza, because mpssh uses the full hostnames and those aliases are irrelevant for the benchmarked hosts. That did not help either. OpenSSH still reads the included file in order to parse it, even if it does not end up matching the current host. I still keep that Match stanza around, though, because it may become useful in the future if OpenSSH ever starts handling this case more efficiently.

# mpssh uses full hostnames, so this alias file is irrelevant here
Match originalhost ??,???,????
Include config.short-host-aliases

How To Split known_hosts Per Host

I wrote a small helper script for this and put it in the mpssh repository. The script reads hostnames from standard input or from a file, resolves hostnames to IP addresses, extracts matching entries from the monolithic file with ssh-keygen -F, and writes one small file per host into ~/.ssh/known_hosts_single. It also handles custom-port entries such as [git.example.com]:7999.

If HashKnownHosts was enabled in your SSH configuration, converting usually requires a plain-text list of all your servers, because the monolithic file does not contain readable hostnames anymore. If HashKnownHosts was disabled, you can usually extract that list from the existing monolithic known_hosts file with a simple cat and awk pipeline.

Here is the migration flow I used, rewritten with generic hostnames and paths:

mv ~/.ssh/known_hosts ~/.ssh/known_hosts.monolith
mkdir -p ~/.ssh/known_hosts_single

python3 known_hosts_single/convert.py \
  --known-hosts-file ~/.ssh/known_hosts.monolith \
  --input-file ./servers.list \
  --progress

If you want to test a couple of entries first, the script can also read from standard input:

printf '%s\n' example.com '[git.example.com]:7999' 203.0.113.10 | \
python3 known_hosts_single/convert.py \
  --known-hosts-file ~/.ssh/known_hosts.monolith \
  --progress

Then edit ~/.ssh/config so that SSH uses the per-host files. I explicitly disable GlobalKnownHostsFile because my setup does not rely on a system-wide known_hosts file. If yours does, do not copy that line. I also set HashKnownHosts no, because once the host identity is already visible in the %h filename, hashing the contents of the tiny per-host file no longer buys much. I kept strict host key checking enabled because this was a performance optimization, not a security shortcut:

Host *
    GlobalKnownHostsFile none
    UserKnownHostsFile ~/.ssh/known_hosts_single/%h
    HashKnownHosts no
    StrictHostKeyChecking yes

The important part is %h. SSH expands it to the target hostname, so each connection only opens the tiny file for that host instead of making every connection consult one large shared file.

Reproducing The Benchmark

For an apples-to-apples comparison, these are the important commands. I kept -p 999 because that was the clean baseline I measured before and after the change:

# Baseline
time mpssh -p 999 -u root -f ./servers.list true

# Same host list, but with per-host known_hosts files
time mpssh -p 999 -u root -f ./servers.list \
  -O 'o UserKnownHostsFile=~/.ssh/known_hosts_single/%h' \
  -O 'o StrictHostKeyChecking=yes' \
  true

# More aggressive spawning
time mpssh -p 999 -d 0 -u root -f ./servers.list \
  -O 'o UserKnownHostsFile=~/.ssh/known_hosts_single/%h' \
  -O 'o StrictHostKeyChecking=yes' \
  true

If you want to experiment further, mpssh also lets you adjust the delay between SSH forks with -d MSEC. In my case, lower values were useful for benchmarks but not for everyday use because they pushed too much CPU pressure back onto the local machine.

One more thing worth keeping in mind is ControlMaster with ControlPersist. That OpenSSH feature can reuse an already established connection to the same host for later sessions. I have not benchmarked it for this workload, but for repeated connections to the same machines it has the potential to reduce SSH connection setup overhead a lot.

Long story short, if you fan out SSH connections to hundreds or thousands of hosts, do not assume that the network or the private key type is the only thing worth checking. A large known_hosts file can be enough to waste more than a minute and a lot of CPU per batch. Splitting it per host kept host key verification in place and made mpssh feel fast again.


5 Comments

posix_spawn() performance benchmarks and usage examples

The glibc library has an efficient posix_spawn() implementation since glibc version 2.24 (2016-08-05). I have awaited this feature for a long time.

TL;DR: posix_spawn() in glibc 2.24+ is really fast. You should replace the old system() and popen() calls with posix_spawn().

Today I ran all benchmarks of the popen_noshell() library, which basically emulates posix_spawn(). Here are the results:

Test Uses pipes User CPU System CPU Total CPU Slower with
vfork() + exec(), standard Libc No 7.4 1.6 9.0
the new noshell, default clone(), compat=1 Yes 7.7 2.1 9.7 8%
the new noshell, default clone(), compat=0 Yes 7.8 2.0 9.9 9%
posix_spawn() + exec() no pipes, standard Libc No 9.4 2.0 11.5 27%
the new noshell, posix_spawn(), compat=0 Yes 9.6 2.7 12.3 36%
the new noshell, posix_spawn(), compat=1 Yes 9.6 2.7 12.3 37%
fork() + exec(), standard Libc No 40.5 43.8 84.3 836%
the new noshell, debug fork(), compat=1 No 41.6 45.2 86.8 863%
the new noshell, debug fork(), compat=0 No 41.6 45.3 86.9 865%
system(), standard Libc No 67.3 48.1 115.4 1180%
popen(), standard Libc Yes 70.4 47.1 117.5 1204%

The fastest way to run something externally is to call vfork() and immediately exec() after it. This is the best solution if you don’t need to capture the output of the command, nor you need to supply any data to its standard input. As you can see, the standard system() call is about 12 times slower in performing the same operation. The good news is that posix_spawn() + exec() is almost as fast as vfork() + exec(). If we don’t care about the 27% slowdown, we can use the standard posix_spawn() interface.

It gets more complicated and slower if you want to capture the output or send data to stdin. In such a case you have to duplicate stdin/stdout descriptors, close one of the pipe ends, etc. The popen_noshell.c source code gives a full example of all this work.

We can see that the popen_noshell() library is still the fastest option to run an external process and be able to communicate with it. The command popen_noshell() is just 8% slower than the absolute ideal result of a simple vfork() + exec().

There is another good news — posix_spawn() is also very efficient! It’s a fact that it lags with 36% behind the vfork() + exec() marker, but still it’s 12 times faster than the popen() old-school glibc alternative. Using the standard posix_spawn() makes your source code easier to read, better supported for bugs by the mainstream glibc library, and you have no external library dependencies.

The replacement of system() using posix_spawn() is rather easy as we can see in the “popen-noshell/performance_tests/fork-performance.c” function posix_spawn_test():

# the same as system() but using posix_spawn() which is 12 times faster
void posix_spawn_test() {
	pid_t pid;
	char * const argv[] = { "./tiny2" , NULL };

	if (posix_spawn(&pid, "./tiny2", NULL, NULL, argv, environ) != 0) {
		err(EXIT_FAILURE, "posix_spawn()");
	}

	parent_waitpid(pid);
}

If you want to communicate with the external process, there are a few more steps which you need to perform like creating pipes, etc. Have a look at the source code of “popen_noshell.c“. If you search for the string “POPEN_NOSHELL_MODE”, you will find two alternative blocks of code — one for the standard way to start a process and manage pipes in C, and the other block will show how to perform the same steps using the posix_spawn() family functions.

Please note that posix_spawn() is a completely different implementation than system() or popen(). If it’s not safe to use the faster way, posix_spawn() may fall back to the slow fork().


Leave a comment

“dd” sequential write performance tests on a raw block device may be incorrect

…if you use the inappropriate bytes size (bs) option. See the man page of dd for details on this option.

Hard disks have a typical block size of 512 bytes. LVM on the other hand creates its block devices with a block size of 4096 bytes. So it’s easy to get confused – even if you know that disks should be tested with blocks of 512 bytes, you shouldn’t test LVM block devices with a 512-bytes but with a 4096-bytes block size.

What happens if you make a write performance test by writing directly on the raw block device and you use the wrong bytes size (bs) option?

If you look at the “iostat” statistics, they will show lots of read requests too, when you are only writing. This is not what is expected when you do only writing.
The problem comes by the fact that when you are not using the proper block size for the raw block device, instead of writing whole blocks, you are writing partial blocks. This is however physically not possible – the block device can only write one whole block at a time. In order to update the data in only a part of a block, this block needs to be read back first, then modified with the new partial data in memory and finally written back as a whole block.

The total performance drop is about 3 times on the systems I tested. I’ve tested this on some hard disks and on an Areca RAID-6 volume.

So what’s the lesson here?

When you do sequential write performance tests with “dd” directly on the raw block device, make sure that you use the proper bytes size option, and verify that during the tests you see only write requests in the “iostat” statistics.

Physical hard disk example:

# Here is a bad example for a hard disk device
dd if=/dev/zero of=/dev/sdb1 bs=256 count=5000000

# Here is the proper usage, because /dev/sda physical block size is 512 bytes
dd if=/dev/zero of=/dev/sdb1 bs=512 count=5000000 

LVM block device example:

# Another bad example, this time for an LVM block device
dd if=/dev/zero of=/dev/sdb-vol/test bs=512 count=1000000

# Here is the proper usage, because the LVM block size is 4096 bytes
dd if=/dev/zero of=/dev/sdb-vol/test bs=4k count=1000000

Understanding the “iostat” output during a “dd” test:

Here is what “iostat” displays when you are not using the proper bytes size option (lots of read “r/s” and “rsec/s” requests):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  5867.40 3573.20   46.40 28585.60 47310.40 20.97   110.38   30.61   0.28 100.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
sdb2              0.00  5867.40 3572.80   46.40 28582.40 47310.40 20.97   110.38   30.61   0.28 100.00
dm-2              0.00     0.00 3572.80 5913.80 28582.40 47310.40 8.00 13850.92 1465.43   0.11 100.00 

Here is what it should display (no read “r/s” or “rsec/s” requests at all):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00 16510.00    0.00  128.60     0.00 131686.40 1024.00   107.82  840.32   7.78 100.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
sdb2              0.00 16510.00    0.00  128.60     0.00 131686.40 1024.00   107.82  840.32   7.78 100.00
dm-2              0.00     0.00    0.00 16640.00     0.00 133120.00 8.00 13674.86  823.73   0.06 100.00 

How to be safe?

Fortunately, file systems are smart enough and pay attention to the block size of the block devices they were mounted on. So if you do a “dd” write performance test and write to a file, you should be fine. Though in this case there are some other complications like journaling, commit intervals, barriers, mount options, etc.


11 Comments

A much faster popen() and system() implementation for Linux

This project is now hosted on GitHub: https://github.com/famzah/popen-noshell


Problem definition
As we already discussed it, fork() is slow. What do we do if we want to make many popen() calls and still spend less money on hardware?

The parent process calling the popen() function communicates with the child process by reading its standard output. Therefore, we cannot use vfork() to speed things up, because it doesn’t allow the child process to close its standard output and duplicate the passed file descriptors from the parent to its standard output before exec()’uting the command. A child process created by vfork() can only call exec() right away, nothing more.

If we used threads to re-implement popen(), because the creation of a thread is very light-weight, we couldn’t then use exec(), because invoking exec() from a thread terminates the execution of all other threads, including the parent one.

Problem resolution
We need a fork mechanism which is similar to threads and vfork() but still allows us to execute commands other than just exec().

The system call clone() comes to the rescue. Using clone() we create a child process which has the following features:

  • The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is created. As a result of this, any change to any non-stack variable made by the child is visible by the parent process. This is similar to threads, and therefore completely different from fork(), and also very dangerous – we don’t want the child to mess up the parent.
  • The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
  • The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
  • The most important: This thread-like child process can call exec().

In a nutshell, by calling clone in the following way, we create a child process which is very similar to a thread but still can call exec():

pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);

The child starts at the function fn(arg). We have allocated some memory for the stack which must be aligned. There are some important notes (valid at the time being) which I learned by reading the source of libc and the Linux kernel:

  • On all supported Linux platforms the stack grows down, except for HP-PARISC. You can grep the kernel source for “STACK_GROWSUP”, in order to get this information.
  • On all supported platforms by GNU libc, the stack is aligned to 16 bytes, except for the SuperH platform which is aligned to 8 bytes. You can grep the glibc source for “STACK_ALIGN”, in order to get this information.

Note that this trick is tested only on Linux. I failed to make it work on FreeBSD.

Usage
Once we have this child process created, we carefully watch not to touch any global variables of the parent process, do some file descriptor magic, in order to be able to bind the standard output of the child process to a file descriptor at the parent, and execute the given command with its arguments.

You will find detailed examples and use-cases in the source code. A very simplified example follows with no error checks:

fp = popen_noshell("ls", (const char * const *)argv, "r", &pclose_arg, 0);
while (fgets(buf, sizeof(buf)-1, fp)) {
    printf("Got line: %s", buf);
}
status = pclose_noshell(&pclose_arg);

There is a more compatible version of popen_noshell() which accepts the command and its arguments as one whole string, but its usage is discouraged, because it tries to very naively emulate simple shell arguments interpretation.

Benchmark results
I’ve done several tests on how fast is popen_noshell() compared to popen() and even a bare fork()+exec(). All the results are similar and therefore I’m publishing only one of the benchmark results:
Tested functions on Linux - popen_noshell(), fork(), vfork(), popen(), system()


Here are the resources which you can download:

I will appreciate any comments on the library.


2 Comments

fork() gets slower as parent process uses more memory

Background information
Forking is an important, fundamental part of Unix, critical to the support of its design philosophy. For example, if a process wants to execute a command, it has to fork() a child process which then immediately calls exec(). And since the philosophy of Unix involves executing many small commands/programs, in order to achieve something meaningful, it turns out that fork() is called pretty often.

There are two main fork() patterns when a parent process wants to execute a command:

  • The parent does not need to communicate with the child process in any way – the child process executes, and the parent gets its exit code back. No input/output with the child is done at all, only the inital command line arguments are passed.
  • The parent needs to communicate with the child process – either it needs to supply something at the standard input or some other file descriptor of the child, or it wants to get some information from the standard output or some other file descriptor of the child, or both.

For the case when there is no communication involved, Unix guys developed the vfork() call. It is a very light-weight version of fork(), very close to threading. The gotcha here is that a child process which was created by vfork() cannot modify any variables in its space at all or do something else, because it has no own stack. The child process is only allowed to call exec() right after it was born, nothing more. This speeds up the usual fork()-then-exec() model, because often the parent does not need to communicate with the child process – the parent just wants the command executed with the given command line arguments.

For all other cases when the parent communicates with the child internally using file descriptors (anonymous pipes, etc.), the standard fork() system call is used.

Problem definition
It turns out that when the parent process allocates some memory, the fork() call takes longer to execute if a bigger amount of this allocated memory is being used, that is – if the parent process writes something there. Linux and probably the other Unix systems employ a copy-on-write feature and don’t physically copy the allocated memory from the parent into the child process initially on each fork(). Not until the child modifies it. Nevertheless, the fork() call gets slower and slower as more and more memory is being used (not just allocated) in the parent process. It seems that even though the data of the allocated/used memory itself is not being copied, thanks to the copy-on-write feature, the internal virtual memory structures in the kernel, which hold the information about how much and what memory the parent process has allocated, are being copied in an inefficient way while the child process is being created by fork().

Currently available options
So why don’t we then just vfork() always? It is very fast. And the answer is – because we cannot communicate with the child process when it was created by vfork().

Okay, so why don’t we use threads then? They are similar to vfork(), only that the child process (thread) has its own stack and shares the data segment (allocated memory) of the parent process. We can even use these shared data variables for inter-process communication. And the answer is – because a thread cannot invoke exec() by definition. This is not supported by the threading libraries, as required by POSIX.1.

Talk to me in numbers, how slower does fork() get in regards to memory allocation and usage
Here are some benchmark results, the forking is done a few thousand times, in order to accumulate an accountable CPU time. The program which is being exec()’ed is a very tiny binary which contains only two system calls – write(“Hello world”) and then _exit(0). The benchmark results follow:

System info Allocated memory Usage ratio vfork() + exec() fork() + exec()
Linux 2.6.28-15-generic i686 20MB 1:2 (10MB) 1.49 12.08
Linux 2.6.28-15-generic i686 20MB 1:1 (20MB) 1.53 21.60
Linux 2.6.28-15-generic i686 40MB 1:2 (20MB) 1.59 21.23
FreeBSD 7.1-RELEASE-p4 i386 20MB 1:2 (10MB) 2.26 20.22
FreeBSD 7.1-RELEASE-p4 i386 40MB 1:2 (20MB) 2.44 33.94

As we can see from the test results, the vfork() call is not affected by the amount of memory usage. This does not apply for fork() though. On Linux we observe almost two times more CPU usage as the memory usage is increased twice. On FreeBSD the results are similar, only a bit better – if the memory usage is increased twice, the CPU usage of fork() is increased with 50% (vs. 100% on Linux). Even though, the difference in CPU time between the vfork() and fork() calls is significant on both operating systems – fork() is more than 1000% slower.

You can read my next article which describes a solution for Linux which allows a parent process to communicate with its child, similar to fork(), but is as fast as vfork(). The article also contains more detailed information about the benchmark tests we did, in order to populate the above table.