Google App Engine Datastore benchmark

I admire the idea of Google App Engine — a platform as a service where there is “no worrying about DBAs, servers, sharding, and load balancers”. And you can “auto scale to 7 billion requests per day”. I wanted to try the App Engine for a pet project where I had to collect, process and query a huge amount of time series. The fact that I needed to do fast queries over tens of 1000’s of records however made me wonder if the App Engine Datastore would be fast enough. Note that in order to reduce the amount of entities which are fetched from the database, couples of data entries are consolidated into a single database entity. This however imposes another limitation — fetching big data entities uses more memory on the running instance.

My language of choice is Java, because its performance for such computations is great. I am using the the Objectify interface (version 4.0rc2), which is also one of the recommended APIs for the Datastore.

Unfortunately, my tests show that the App Engine is not suitable for querying of such amount of data. For example, fetching and updating 1000 entries takes 1.5 seconds, and additionally uses a lot of memory on the F1 instance. You can review the Excel sheet file below for more detailed results.

Basically each benchmark test performs the following operations and then exits:

  1. Adds a bunch of entries.
  2. Gets those entries from the database and verifies them.
  3. Updates those entries in the database.
  4. Gets the entries again from the database and verifies them.
  5. Deletes the entries.

All Datastore operations are performed in a batch and thus in an asynchronous parallel way. Furthermore, no indexes are used but the entities are referenced directly by their key, which is the most efficient way to query the Datastore. The tests were performed at two separate days because I wanted to extend some of the tests. This is indicated in the results. A single warmup request was made before the benchmarks, so that the App Engine could pre-load our application.

The first observation is that using the default F1 instance once we start fetching more than 100 entities or once we start to add/update/delete more than 1000 entities, we saturate the Datastore -> Objectify -> Java throughput and don’t scale any more:
App Engine Datastore median time per entity for 1 KB entities @ F1 instance

The other interesting observation is that the Datastore -> Objectify -> Java throughput depends a lot on the App Engine instance. That’s not a surprising fact because the application needs to serialize data back and forth when communicating with the Datastore. This requires CPU power. The following two charts show that more CPU power speeds up all operations where serializing is involved — that is all Datastore operations but the Delete one which only queries the Datastore by supplying the keys of the entities, no data:
App Engine Datastore times per entity for 1000 x 1 KB entities @ F1 instance

App Engine Datastore times per entity for 1000 x 1 KB entities @ F4 instance

It is unexpected that the App Engine and the Datastore still have good and bad days. Their latency as well as CPU accounting could fluctuate a lot. The following chart shows the benchmark results which we got using an F1 instance. If you compare this to the chart above where a much more expensive F4 instance was used, you’ll notice that the 4-times cheaper F1 instance performed almost as fast as an F4 instance:
App Engine Datastore times per entity for 1000 x 1 KB entities @ F1 instance (test on another day)

The source code and the raw results are available for download at http://www.famzah.net/download/gae-datastore-performance/

C++: Use the right size for the right job

Do you want to speed up your C++ application up to 4 times? Read on.

Recently I happened to be on a summer holiday with a friend who is a C++ guru. We discussed my language performance benchmark post, which had just received a lot of attention in regards to C++. It took him a few hours to demonstrate something really important and often underestimated by developers — that we need to allocate only as much memory as needed, and nothing more.

The C++ algorithm which we use at the benchmark blog page uses an array of “int” (which on Linux also happens to be 4 bytes == 32 bits). However, if you review the code you’ll notice that the “int” value is used as a pure boolean variable for the given array index. So my friend proposed the following optimized version, which uses 1 bit of memory for the boolean value and thus doesn’t waste the other 31-bits left if we used “int” instead:

--- primes.cpp	2013-06-17 17:16:09.000000000 +0300
+++ primes-bool.cpp	2013-06-17 18:43:28.000000000 +0300
@@ -12,9 +12,9 @@
 		res.push_back(2);
 		return res;
 	}
-	vector<int> s;
+	vector<bool> s;
 	for (int i = 3; i < n + 1; i += 2) {
-		s.push_back(i);
+		s.push_back(true);
 	}
 	int mroot = sqrt(n);
 	int half = (int)s.size();
@@ -23,9 +23,9 @@
 	while (m <= mroot) {
 		if (s[i]) {
 			int j = (int)((m*m - 3)/2);
-			s[j] = 0;
+			s[j] = false;
 			while (j < half) {
-				s[j] = 0;
+				s[j] = false;
 				j += m;
 			}
 		}
@@ -33,9 +33,9 @@
 		m = 2*i + 3;
 	}
 	res.push_back(2);
-	for (vector<int>::iterator it = s.begin() ; it < s.end(); ++it) {
-		if (*it) {
-			res.push_back(*it);
+	for (size_t it = 0; it < s.size(); ++it) {
+		if (s[it]) {
+			res.push_back(2*it + 3);
 		}
 	}

The end result is that the original C++ implementation finishes for 18.913 CPU seconds, while the memory-optimized completes in 5.354 seconds (28%). Impressive!

The bottom line is that even RAM is very fast and memory allocation is efficient, using only as few memory as really needed could vastly improve the performance of your C++ application. Which applies for programs developed in any language.

Using flock() in Bash without invoking a subshell

The flock(1) utility on Linux manages flock(2) advisory locks from within shell scripts or the command line. This lets you synchronize your Bash scripts with all your other applications written in Perl, Python, C, etc.

I’ll focus on the third usage form where flock() is used inside a Bash script. Here is what the man page suggests:

#!/bin/bash

(
flock -s 200

# ... commands executed under lock ...

) 200>/var/lock/mylockfile

Unfortunately, this invokes a subshell which has the following drawbacks:

  • You cannot pass values to variables from the subshell in the main shell script.
  • There is a performance penalty.
  • The syntax coloring in “vim” does not work properly. :)

This motivated my colleague zImage to come up with a usage form which does not invoke a subshell in Bash:

#!/bin/bash

exec 200>/var/lock/mylockfile || exit 1
flock -n 200 || { echo "ERROR: flock() failed." >&2; exit 1; }

# ... commands executed under lock ...

flock -u 200

Google Reader alternative

Google announced that they are shutting down their online web RSS reader on July 1, 2013. What a shame, it was really useful and with a great web design.

After short research, I decided to code one — for fun and education. It’s designed to operate in a multi-user way, so if you want to give it a try, go on!

My online implementation is named “xs RSS reader“, short for extra-simple RSS reader:

http://www.famzah.net/xs-rss-reader/

Here is a sample demo screenshot:
xs RSS reader -- demo screenshot

Nagios: Improve CPU performance with popen_noshell()

Today I’ll share my real-world experience with popen_noshell() on the Nagios monitoring server which we run at work. We are actively monitoring 1166 hosts and 14250 services. The machine has 6 GB RAM and a single Intel Core i7-950 CPU with enabled multi-threading (8 total threads) and slight overclock. Besides running Nagios, this machine also handles the incoming data from our custom monitoring systems, processes RRD database storage, and generates web interface status + charts output. So it’s a pretty busy machine which does a lot of network activity and where the Nagios daemon is just a part of the CPU load. For example, since boot the main “nagios3″ process has used only 20% of the CPU. The other part has been used by the fork()’ed Perl scripts (we use a lot of them for the active checks), the Nagios standard network checks, and the Apache/PHP web server handling the incoming data.

Recently the machine started to exhaust its CPU resources. First we overclocked it a bit which gave us 10% more CPU idle time. Then we decided to try to compile Nagios with the popen-noshell library. This gave us another 10% CPU idle and now the machine is working great again.

I’ll focus on the popen-noshell integration and results, since CPU overclocking is a well-known topic. Here is the chart which shows the CPU usage before and after we re-compiled Nagios with the popen-noshell library:

nagios-popen-noshell-benchmark-results

As we can see, the system-CPU usage dropped from 38% to 31%, which is an 18% improvement. The user-CPU usage dropped from 44% to 41%, which is a 7% improvement. Overall, we gained a 12% speed-up for our workload by just re-compiling Nagios with the popen-noshell library. I’m stressing out that the speed-up depends a lot on your workload. If this machine was busy only with Nagios and the active checks were more CPU efficient (i.e. not written in Perl but in C), then the speed-up could have been much higher, since popen_noshell() is about 10 times faster than the standard popen().

A list with the other machine metrics which were also affected by the workload change:

  • Used memory: 39% => 24% (38% less)
  • Load average: 39 => 46 (18% higher)
  • Forks rates: 8*61 => 8*61 (created processes/second – no change)

Here are the steps that you need to perform, in order to re-compile the Nagios Debian package by integrating it with the popen-noshell library:

apt-get install devscripts

apt-get build-dep nagios3-core
# No need to run as "root" from here on
apt-get source nagios3-core

svn checkout http://popen-noshell.googlecode.com/svn/trunk/ popen-noshell

cd nagios3-3.2.1/

# BEGIN: patch Nagios to use popen_noshell_compat()

cp ../popen-noshell/popen_noshell.* base/
vi base/Makefile.in
	OBJS=$(BROKER_O) popen_noshell.o 

vi base/utils.c
	#include "popen_noshell.h"
	
        /* run the command */
        struct popen_noshell_pass_to_pclose pclose_arg;
        fp=(FILE *)popen_noshell_compat(cmd,"r",&pclose_arg);

            /* close the command and get termination status */
            status=pclose_noshell(&pclose_arg);

vi base/checks.c
	2x the same as above

# END: patch Nagios to use popen_noshell_compat()

EDITOR=vim dch -i
	# 3.2.1-2+squeeze1 -> 3.2.1-2+squeeze1-noshell1
	# you must have a trailing number in the added version name
	# after exit, this renames the original directory name

cd ..
mv nagios3_3.2.1.orig.tar.gz nagios3_3.2.1-2+squeeze1.orig.tar.gz

# the source directory was renamed by "dch"
cd nagios3-3.2.1-2+squeeze1/
DEB_BUILD_OPTIONS=nocheck debuild -us -uc

cd ..
sudo dpkg -i nagios3-core_3.2.1-2+squeeze1-noshell1_i386.deb \
	nagios3-common_3.2.1-2+squeeze1-noshell1_all.deb \
	nagios3-cgi_3.2.1-2+squeeze1-noshell1_i386.deb \
	nagios3-doc_3.2.1-2+squeeze1-noshell1_all.deb \
	nagios3_3.2.1-2+squeeze1-noshell1_i386.deb

Bash: Split a string into columns by white-space without invoking a subshell

The classical approach is:

RESULT="$(echo "$LINE"| awk '{print $1}')" # executes in a subshell 

Processing thousands of lines this way however fork()’s thousands of processes, which affects performance and makes your script CPU hungry.

Here is the effective solution which I found with my colleagues at work:

COLS=( $LINE ); # parses columns without executing a subshell
RESULT="${COLS[0]}"; # returns first column (0-based indexes)

Here is an example:

LINE="col0 col1  col2     col3  col4      " # white-space including tab chars
COLS=( $LINE ); # parses columns without executing a subshell

echo "${COLS[0]}"; # prints "col0"
echo "${COLS[1]}"; # prints "col1"
echo "${COLS[2]}"; # prints "col2"
echo "${COLS[3]}"; # prints "col3"
echo "${COLS[4]}"; # prints "col4"

If you want to split not by white-space but by any other character, you can temporarily change the IFS variable which determines how Bash recognizes fields and word boundaries.

Perl Net::Ping not working properly with ICMP by default

If you tried to ping a host with Perl Net::Ping using the ICMP protocol and that failed, even though the “ping” command-line utility can ping the host, you’re not alone :) I had the same problem and it turned out to be due to the fact that Net::Ping by default sends no DATA in the ICMP request and thus its requests are rather short and non-standard. Here are some tcpdump examples:

  • Net::Ping with ICMP protocol, everything else is defaults: “$p = new Net::Ping(‘icmp’)“, no replies from remote host, note that the length is just 8 bytes:
    12:29:02.898083 IP source-addr > source-addr: ICMP echo request, id 2194, seq 41, length 8
    12:29:03.711595 IP source-addr > dest-addr: ICMP echo request, id 2194, seq 42, length 8
    
  • Linux “ping” command-line utility, remote host replies accordingly, the length is 64 bytes total:
    12:30:18.278865 IP source-addr > dest-addr: ICMP echo request, id 2488, seq 1, length 64
    12:30:18.289922 IP dest-addr > source-addr: ICMP echo reply, id 2488, seq 1, length 64
    12:30:18.790610 IP source-addr > dest-addr: ICMP echo request, id 2488, seq 2, length 64
    12:30:18.811029 IP dest-addr > source-addr: ICMP echo reply, id 2488, seq 2, length 64
    
  • Net::Ping with ICMP protocol with user-defined length, “$p = new Net::Ping(‘icmp’, 1, 56)“, remote host replies accordingly, the length is 64 bytes total:
    12:30:48.377496 IP source-addr > dest-addr: ICMP echo request, id 2488, seq 6, length 64
    12:30:48.433690 IP dest-addr > source-addr: ICMP echo reply, id 2488, seq 6, length 64
    12:30:48.934310 IP source-addr > dest-addr: ICMP echo request, id 2488, seq 7, length 64
    12:30:48.946152 IP dest-addr > source-addr: ICMP echo reply, id 2488, seq 7, length 64
    

Bottom line is that if you are going to use Net::Ping with ICMP, specify 56 for the “bytes” parameter when creating an instance of the Net::Ping object. This way you will be sending standard ICMP requests with total length of 64 bytes.