/contrib/famzah

Enthusiasm never stops


5 Comments

posix_spawn() performance benchmarks and usage examples

The glibc library has an efficient posix_spawn() implementation since glibc version 2.24 (2016-08-05). I have awaited this feature for a long time.

TL;DR: posix_spawn() in glibc 2.24+ is really fast. You should replace the old system() and popen() calls with posix_spawn().

Today I ran all benchmarks of the popen_noshell() library, which basically emulates posix_spawn(). Here are the results:

Test Uses pipes User CPU System CPU Total CPU Slower with
vfork() + exec(), standard Libc No 7.4 1.6 9.0
the new noshell, default clone(), compat=1 Yes 7.7 2.1 9.7 8%
the new noshell, default clone(), compat=0 Yes 7.8 2.0 9.9 9%
posix_spawn() + exec() no pipes, standard Libc No 9.4 2.0 11.5 27%
the new noshell, posix_spawn(), compat=0 Yes 9.6 2.7 12.3 36%
the new noshell, posix_spawn(), compat=1 Yes 9.6 2.7 12.3 37%
fork() + exec(), standard Libc No 40.5 43.8 84.3 836%
the new noshell, debug fork(), compat=1 No 41.6 45.2 86.8 863%
the new noshell, debug fork(), compat=0 No 41.6 45.3 86.9 865%
system(), standard Libc No 67.3 48.1 115.4 1180%
popen(), standard Libc Yes 70.4 47.1 117.5 1204%

The fastest way to run something externally is to call vfork() and immediately exec() after it. This is the best solution if you don’t need to capture the output of the command, nor you need to supply any data to its standard input. As you can see, the standard system() call is about 12 times slower in performing the same operation. The good news is that posix_spawn() + exec() is almost as fast as vfork() + exec(). If we don’t care about the 27% slowdown, we can use the standard posix_spawn() interface.

It gets more complicated and slower if you want to capture the output or send data to stdin. In such a case you have to duplicate stdin/stdout descriptors, close one of the pipe ends, etc. The popen_noshell.c source code gives a full example of all this work.

We can see that the popen_noshell() library is still the fastest option to run an external process and be able to communicate with it. The command popen_noshell() is just 8% slower than the absolute ideal result of a simple vfork() + exec().

There is another good news — posix_spawn() is also very efficient! It’s a fact that it lags with 36% behind the vfork() + exec() marker, but still it’s 12 times faster than the popen() old-school glibc alternative. Using the standard posix_spawn() makes your source code easier to read, better supported for bugs by the mainstream glibc library, and you have no external library dependencies.

The replacement of system() using posix_spawn() is rather easy as we can see in the “popen-noshell/performance_tests/fork-performance.c” function posix_spawn_test():

# the same as system() but using posix_spawn() which is 12 times faster
void posix_spawn_test() {
	pid_t pid;
	char * const argv[] = { "./tiny2" , NULL };

	if (posix_spawn(&pid, "./tiny2", NULL, NULL, argv, environ) != 0) {
		err(EXIT_FAILURE, "posix_spawn()");
	}

	parent_waitpid(pid);
}

If you want to communicate with the external process, there are a few more steps which you need to perform like creating pipes, etc. Have a look at the source code of “popen_noshell.c“. If you search for the string “POPEN_NOSHELL_MODE”, you will find two alternative blocks of code — one for the standard way to start a process and manage pipes in C, and the other block will show how to perform the same steps using the posix_spawn() family functions.

Please note that posix_spawn() is a completely different implementation than system() or popen(). If it’s not safe to use the faster way, posix_spawn() may fall back to the slow fork().


Leave a comment

“dd” sequential write performance tests on a raw block device may be incorrect

…if you use the inappropriate bytes size (bs) option. See the man page of dd for details on this option.

Hard disks have a typical block size of 512 bytes. LVM on the other hand creates its block devices with a block size of 4096 bytes. So it’s easy to get confused – even if you know that disks should be tested with blocks of 512 bytes, you shouldn’t test LVM block devices with a 512-bytes but with a 4096-bytes block size.

What happens if you make a write performance test by writing directly on the raw block device and you use the wrong bytes size (bs) option?

If you look at the “iostat” statistics, they will show lots of read requests too, when you are only writing. This is not what is expected when you do only writing.
The problem comes by the fact that when you are not using the proper block size for the raw block device, instead of writing whole blocks, you are writing partial blocks. This is however physically not possible – the block device can only write one whole block at a time. In order to update the data in only a part of a block, this block needs to be read back first, then modified with the new partial data in memory and finally written back as a whole block.

The total performance drop is about 3 times on the systems I tested. I’ve tested this on some hard disks and on an Areca RAID-6 volume.

So what’s the lesson here?

When you do sequential write performance tests with “dd” directly on the raw block device, make sure that you use the proper bytes size option, and verify that during the tests you see only write requests in the “iostat” statistics.

Physical hard disk example:

# Here is a bad example for a hard disk device
dd if=/dev/zero of=/dev/sdb1 bs=256 count=5000000

# Here is the proper usage, because /dev/sda physical block size is 512 bytes
dd if=/dev/zero of=/dev/sdb1 bs=512 count=5000000 

LVM block device example:

# Another bad example, this time for an LVM block device
dd if=/dev/zero of=/dev/sdb-vol/test bs=512 count=1000000

# Here is the proper usage, because the LVM block size is 4096 bytes
dd if=/dev/zero of=/dev/sdb-vol/test bs=4k count=1000000

Understanding the “iostat” output during a “dd” test:

Here is what “iostat” displays when you are not using the proper bytes size option (lots of read “r/s” and “rsec/s” requests):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  5867.40 3573.20   46.40 28585.60 47310.40 20.97   110.38   30.61   0.28 100.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
sdb2              0.00  5867.40 3572.80   46.40 28582.40 47310.40 20.97   110.38   30.61   0.28 100.00
dm-2              0.00     0.00 3572.80 5913.80 28582.40 47310.40 8.00 13850.92 1465.43   0.11 100.00 

Here is what it should display (no read “r/s” or “rsec/s” requests at all):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00 16510.00    0.00  128.60     0.00 131686.40 1024.00   107.82  840.32   7.78 100.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
sdb2              0.00 16510.00    0.00  128.60     0.00 131686.40 1024.00   107.82  840.32   7.78 100.00
dm-2              0.00     0.00    0.00 16640.00     0.00 133120.00 8.00 13674.86  823.73   0.06 100.00 

How to be safe?

Fortunately, file systems are smart enough and pay attention to the block size of the block devices they were mounted on. So if you do a “dd” write performance test and write to a file, you should be fine. Though in this case there are some other complications like journaling, commit intervals, barriers, mount options, etc.


11 Comments

A much faster popen() and system() implementation for Linux

This project is now hosted on GitHub: https://github.com/famzah/popen-noshell


Problem definition
As we already discussed it, fork() is slow. What do we do if we want to make many popen() calls and still spend less money on hardware?

The parent process calling the popen() function communicates with the child process by reading its standard output. Therefore, we cannot use vfork() to speed things up, because it doesn’t allow the child process to close its standard output and duplicate the passed file descriptors from the parent to its standard output before exec()’uting the command. A child process created by vfork() can only call exec() right away, nothing more.

If we used threads to re-implement popen(), because the creation of a thread is very light-weight, we couldn’t then use exec(), because invoking exec() from a thread terminates the execution of all other threads, including the parent one.

Problem resolution
We need a fork mechanism which is similar to threads and vfork() but still allows us to execute commands other than just exec().

The system call clone() comes to the rescue. Using clone() we create a child process which has the following features:

  • The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is created. As a result of this, any change to any non-stack variable made by the child is visible by the parent process. This is similar to threads, and therefore completely different from fork(), and also very dangerous – we don’t want the child to mess up the parent.
  • The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
  • The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
  • The most important: This thread-like child process can call exec().

In a nutshell, by calling clone in the following way, we create a child process which is very similar to a thread but still can call exec():

pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);

The child starts at the function fn(arg). We have allocated some memory for the stack which must be aligned. There are some important notes (valid at the time being) which I learned by reading the source of libc and the Linux kernel:

  • On all supported Linux platforms the stack grows down, except for HP-PARISC. You can grep the kernel source for “STACK_GROWSUP”, in order to get this information.
  • On all supported platforms by GNU libc, the stack is aligned to 16 bytes, except for the SuperH platform which is aligned to 8 bytes. You can grep the glibc source for “STACK_ALIGN”, in order to get this information.

Note that this trick is tested only on Linux. I failed to make it work on FreeBSD.

Usage
Once we have this child process created, we carefully watch not to touch any global variables of the parent process, do some file descriptor magic, in order to be able to bind the standard output of the child process to a file descriptor at the parent, and execute the given command with its arguments.

You will find detailed examples and use-cases in the source code. A very simplified example follows with no error checks:

fp = popen_noshell("ls", (const char * const *)argv, "r", &pclose_arg, 0);
while (fgets(buf, sizeof(buf)-1, fp)) {
    printf("Got line: %s", buf);
}
status = pclose_noshell(&pclose_arg);

There is a more compatible version of popen_noshell() which accepts the command and its arguments as one whole string, but its usage is discouraged, because it tries to very naively emulate simple shell arguments interpretation.

Benchmark results
I’ve done several tests on how fast is popen_noshell() compared to popen() and even a bare fork()+exec(). All the results are similar and therefore I’m publishing only one of the benchmark results:
Tested functions on Linux - popen_noshell(), fork(), vfork(), popen(), system()


Here are the resources which you can download:

I will appreciate any comments on the library.


2 Comments

fork() gets slower as parent process uses more memory

Background information
Forking is an important, fundamental part of Unix, critical to the support of its design philosophy. For example, if a process wants to execute a command, it has to fork() a child process which then immediately calls exec(). And since the philosophy of Unix involves executing many small commands/programs, in order to achieve something meaningful, it turns out that fork() is called pretty often.

There are two main fork() patterns when a parent process wants to execute a command:

  • The parent does not need to communicate with the child process in any way – the child process executes, and the parent gets its exit code back. No input/output with the child is done at all, only the inital command line arguments are passed.
  • The parent needs to communicate with the child process – either it needs to supply something at the standard input or some other file descriptor of the child, or it wants to get some information from the standard output or some other file descriptor of the child, or both.

For the case when there is no communication involved, Unix guys developed the vfork() call. It is a very light-weight version of fork(), very close to threading. The gotcha here is that a child process which was created by vfork() cannot modify any variables in its space at all or do something else, because it has no own stack. The child process is only allowed to call exec() right after it was born, nothing more. This speeds up the usual fork()-then-exec() model, because often the parent does not need to communicate with the child process – the parent just wants the command executed with the given command line arguments.

For all other cases when the parent communicates with the child internally using file descriptors (anonymous pipes, etc.), the standard fork() system call is used.

Problem definition
It turns out that when the parent process allocates some memory, the fork() call takes longer to execute if a bigger amount of this allocated memory is being used, that is – if the parent process writes something there. Linux and probably the other Unix systems employ a copy-on-write feature and don’t physically copy the allocated memory from the parent into the child process initially on each fork(). Not until the child modifies it. Nevertheless, the fork() call gets slower and slower as more and more memory is being used (not just allocated) in the parent process. It seems that even though the data of the allocated/used memory itself is not being copied, thanks to the copy-on-write feature, the internal virtual memory structures in the kernel, which hold the information about how much and what memory the parent process has allocated, are being copied in an inefficient way while the child process is being created by fork().

Currently available options
So why don’t we then just vfork() always? It is very fast. And the answer is – because we cannot communicate with the child process when it was created by vfork().

Okay, so why don’t we use threads then? They are similar to vfork(), only that the child process (thread) has its own stack and shares the data segment (allocated memory) of the parent process. We can even use these shared data variables for inter-process communication. And the answer is – because a thread cannot invoke exec() by definition. This is not supported by the threading libraries, as required by POSIX.1.

Talk to me in numbers, how slower does fork() get in regards to memory allocation and usage
Here are some benchmark results, the forking is done a few thousand times, in order to accumulate an accountable CPU time. The program which is being exec()’ed is a very tiny binary which contains only two system calls – write(“Hello world”) and then _exit(0). The benchmark results follow:

System info Allocated memory Usage ratio vfork() + exec() fork() + exec()
Linux 2.6.28-15-generic i686 20MB 1:2 (10MB) 1.49 12.08
Linux 2.6.28-15-generic i686 20MB 1:1 (20MB) 1.53 21.60
Linux 2.6.28-15-generic i686 40MB 1:2 (20MB) 1.59 21.23
FreeBSD 7.1-RELEASE-p4 i386 20MB 1:2 (10MB) 2.26 20.22
FreeBSD 7.1-RELEASE-p4 i386 40MB 1:2 (20MB) 2.44 33.94

As we can see from the test results, the vfork() call is not affected by the amount of memory usage. This does not apply for fork() though. On Linux we observe almost two times more CPU usage as the memory usage is increased twice. On FreeBSD the results are similar, only a bit better – if the memory usage is increased twice, the CPU usage of fork() is increased with 50% (vs. 100% on Linux). Even though, the difference in CPU time between the vfork() and fork() calls is significant on both operating systems – fork() is more than 1000% slower.

You can read my next article which describes a solution for Linux which allows a parent process to communicate with its child, similar to fork(), but is as fast as vfork(). The article also contains more detailed information about the benchmark tests we did, in order to populate the above table.