Today I’ll share my real-world experience with popen_noshell() on the Nagios monitoring server which we run at work. We are actively monitoring 1166 hosts and 14250 services. The machine has 6 GB RAM and a single Intel Core i7-950 CPU with enabled multi-threading (8 total threads) and slight overclock. Besides running Nagios, this machine also handles the incoming data from our custom monitoring systems, processes RRD database storage, and generates web interface status + charts output. So it’s a pretty busy machine which does a lot of network activity and where the Nagios daemon is just a part of the CPU load. For example, since boot the main “nagios3″ process has used only 20% of the CPU. The other part has been used by the fork()’ed Perl scripts (we use a lot of them for the active checks), the Nagios standard network checks, and the Apache/PHP web server handling the incoming data.
Recently the machine started to exhaust its CPU resources. First we overclocked it a bit which gave us 10% more CPU idle time. Then we decided to try to compile Nagios with the popen-noshell library. This gave us another 10% CPU idle and now the machine is working great again.
I’ll focus on the popen-noshell integration and results, since CPU overclocking is a well-known topic. Here is the chart which shows the CPU usage before and after we re-compiled Nagios with the popen-noshell library:
As we can see, the system-CPU usage dropped from 38% to 31%, which is an 18% improvement. The user-CPU usage dropped from 44% to 41%, which is a 7% improvement. Overall, we gained a 12% speed-up for our workload by just re-compiling Nagios with the popen-noshell library. I’m stressing out that the speed-up depends a lot on your workload. If this machine was busy only with Nagios and the active checks were more CPU efficient (i.e. not written in Perl but in C), then the speed-up could have been much higher, since popen_noshell() is about 10 times faster than the standard popen().
A list with the other machine metrics which were also affected by the workload change:
- Used memory: 39% => 24% (38% less)
- Load average: 39 => 46 (18% higher)
- Forks rates: 8*61 => 8*61 (created processes/second – no change)
Here are the steps that you need to perform, in order to re-compile the Nagios Debian package by integrating it with the popen-noshell library:
apt-get install devscripts apt-get build-dep nagios3-core # No need to run as "root" from here on apt-get source nagios3-core svn checkout http://popen-noshell.googlecode.com/svn/trunk/ popen-noshell cd nagios3-3.2.1/ # BEGIN: patch Nagios to use popen_noshell_compat() cp ../popen-noshell/popen_noshell.* base/ vi base/Makefile.in OBJS=$(BROKER_O) popen_noshell.o vi base/utils.c #include "popen_noshell.h" /* run the command */ struct popen_noshell_pass_to_pclose pclose_arg; fp=(FILE *)popen_noshell_compat(cmd,"r",&pclose_arg); /* close the command and get termination status */ status=pclose_noshell(&pclose_arg); vi base/checks.c 2x the same as above # END: patch Nagios to use popen_noshell_compat() EDITOR=vim dch -i # 3.2.1-2+squeeze1 -> 3.2.1-2+squeeze1-noshell1 # you must have a trailing number in the added version name # after exit, this renames the original directory name cd .. mv nagios3_3.2.1.orig.tar.gz nagios3_3.2.1-2+squeeze1.orig.tar.gz # the source directory was renamed by "dch" cd nagios3-3.2.1-2+squeeze1/ DEB_BUILD_OPTIONS=nocheck debuild -us -uc cd .. sudo dpkg -i nagios3-core_3.2.1-2+squeeze1-noshell1_i386.deb \ nagios3-common_3.2.1-2+squeeze1-noshell1_all.deb \ nagios3-cgi_3.2.1-2+squeeze1-noshell1_i386.deb \ nagios3-doc_3.2.1-2+squeeze1-noshell1_all.deb \ nagios3_3.2.1-2+squeeze1-noshell1_i386.deb