Enthusiasm never stops


C++ vs. Python vs. PHP vs. Java vs. Others performance benchmark (2016 Q3)

The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, dynamic arrays with numbers, basic math operations.

This is an improved redo of the tests done in previous years. You are strongly encouraged to read the additional information about the tests in the article.

Here are the benchmark results:

Language CPU time Slower than Language
User System Total C++ previous
C++ (optimized with -O2) 0.899 0.053 0.951 g++ 6.1.1 link
Rust 0.898 0.129 1.026 7% 7% 1.12.0 link
Java 8 (non-std lib) 1.090 0.006 1.096 15% 6% 1.8.0_102 link
Python 2.7 + PyPy 1.376 0.120 1.496 57% 36% PyPy 5.4.1 link
C# .NET Core Linux 1.583 0.112 1.695 78% 13% 1.0.0-preview2 link
Javascript (nodejs) 1.371 0.466 1.837 93% 8% 4.3.1 link
Go 2.622 0.083 2.705 184% 47% 1.7.1 link
C++ (not optimized) 2.921 0.054 2.975 212% 9% g++ 6.1.1 link
PHP 7.0 6.447 0.178 6.624 596% 122% 7.0.11 link
Java 8 (see notes) 12.064 0.080 12.144 1176% 83% 1.8.0_102 link
Ruby 12.742 0.230 12.972 1263% 6% 2.3.1 link
Python 3.5 17.950 0.126 18.077 1800% 39% 3.5.2 link
Perl 25.054 0.014 25.068 2535% 38% 5.24.1 link
Python 2.7 25.219 0.114 25.333 2562% 1% 2.7.12 link

The big difference this time is that we use a slightly modified benchmark method. Programs are no longer limited to just 10 loops. Instead they run for 90 wall-clock seconds, and then we divide and normalize their performance as if they were running for only 10 loops. This way we can compare with the previous results. The benefit of doing the tests like this is that the startup and shutdown times of the interpreters should make almost no difference now. It turned out that the new method doesn’t significantly change the outcome compared to the previous benchmark runs, which is good as the old way of benchmarks seems also correct.

For the curious readers, the raw results also show the maximum used memory (RSS).

Brief analysis of the results:

  • Rust, which we benchmark for the first time, is very fast. 🙂
  • C# .NET Core on Linux, which we also benchmark for the first time, performs very well by being as fast as NodeJS and only 78% slower than C++. Memory usage peak was at 230 MB which is the same as Python 3.5 and PHP 7.0, and two times less than Java 8 and NodeJS.
  • NodeJS version 4.3.x got much slower than the previous major version 4.2.x. This is the only surprise. It turned out to be a minor glitch in the parser which was easy to fix. NodeJS 4.3.x is performing the same as 4.2.x.
  • Python and Perl seem a bit slower than before but this is probably due to the fact that C++ performed even better because of the new benchmark method.
  • Java 8 didn’t perform much faster as we expected. Maybe it gets slower as more and more loops are done, which also allocated more RAM.
  • Also review the analysis in the old 2016 tests for more information.

The tests were run on a Debian Linux 64-bit machine.

You can download the source codes, raw results, and the benchmark batch script at:

Update @ 2016-10-15: Added the Rust implementation. The minor versions of some languages were updated as well.
Update @ 2016-10-19: A redo which includes the NodeJS fix.
Update @ 2016-11-04: Added the C# .NET Core implementation.


C++ vs. Python vs. Perl vs. PHP performance benchmark (2016)

There are newer benchmarks: C++ vs. Python vs. PHP vs. Java vs. Others performance benchmark (2016 Q3)

The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, dynamic arrays with numbers, basic math operations.

This is a redo of the tests done in previous years. You are strongly encouraged to read the additional information about the tests in the article.

Here are the benchmark results:

Language CPU time Slower than Language
User System Total C++ previous
C++ (optimized with -O2) 0.952 0.172 1.124 g++ 5.3.1 link
Java 8 (non-std lib) 1.332 0.096 1.428 27% 27% 1.8.0_72 link
Python 2.7 + PyPy 1.560 0.160 1.720 53% 20% PyPy 4.0.1 link
Javascript (nodejs) 1.524 0.516 2.040 81% 19% 4.2.6 link
C++ (not optimized) 2.988 0.168 3.156 181% 55% g++ 5.3.1 link
PHP 7.0 6.524 0.184 6.708 497% 113% 7.0.2 link
Java 8 14.616 0.908 15.524 1281% 131% 1.8.0_72 link
Python 3.5 18.656 0.348 19.004 1591% 22% 3.5.1 link
Python 2.7 20.776 0.336 21.112 1778% 11% 2.7.11 link
Perl 25.044 0.236 25.280 2149% 20% 5.22.1 link
PHP 5.6 66.444 2.340 68.784 6020% 172% 5.6.17 link

The clear winner among the script languages is… PHP 7. 🙂

Yes, that’s not a mistake. Apparently the PHP team did a great job! The rumor that PHP 7 is really fast confirmed for this particular benchmark test. You can also review the PHP 7 infographic by the Zend Performance Team.

Brief analysis of the results:

  • NodeJS got almost 2x faster.
  • Java 8 seems almost 2x slower.
  • Python has no significant change in the performance. Every new release is a little bit faster but overall Python is steadily 15x slower than C++.
  • Perl has the same trend as Python and is steadily 22x slower than C++.
  • PHP 5.x is the slowest with results between 47x to 60x behind C++.
  • PHP 7 made the big surprise. It is about 10x faster than PHP 5.x, and about 3x faster than Python which is the next fastest script language.

The tests were run on a Debian Linux 64-bit machine.

You can download the source codes, an Excel results sheet, and the benchmark batch script at:



OpenSSH ciphers performance benchmark (update 2015)

It’s been five years since the last OpenSSH ciphers performance benchmark. There are two fundamentally new things to consider, which also gave me the incentive to redo the tests:

  • Since OpenSSH version 6.7 the default set of ciphers and MACs has been altered to remove unsafe algorithms. In particular, CBC ciphers and arcfour* are disabled by default. This has been adopted in Debian “Jessie”.
  • Modern CPUs have hardware acceleration for AES encryption.

I tested five different platforms having CPUs with and without AES hardware acceleration, different OpenSSL versions, and running on different platforms including dedicated servers, OpenVZ and AWS.

Since the processing power of each platform is different, I had to choose a criteria to normalize results, in order to be able to compare them. This was a rather confusing decision, and I hope that my conclusion is right. I chose to normalize against the “arcfour*”, “blowfish-cbc”, and “3des-cbc” speeds, because I doubt it that their implementation changed over time. They should run equally fast on each platform because they don’t benefit from the AES acceleration, nor anyone bothered to make them faster, because those ciphers are meant to be marked as obsolete for a long time.

A summary chart with the results follow:

You can download the raw data as an Excel file. Here is the command which was run on each server:

# uses "/root/tmp/dd.txt" as a temporary file!
for cipher in aes128-cbc aes128-ctr aes128-gcm@openssh.com aes192-cbc aes192-ctr aes256-cbc aes256-ctr aes256-gcm@openssh.com arcfour arcfour128 arcfour256 blowfish-cbc cast128-cbc chacha20-poly1305@openssh.com 3des-cbc ; do
	for i in 1 2 3 ; do
		echo "Cipher: $cipher (try $i)"
		dd if=/dev/zero bs=4M count=1024 2>/root/tmp/dd.txt | pv --size 4G | time -p ssh -c "$cipher" root@localhost 'cat > /dev/null'
		grep -v records /root/tmp/dd.txt

We can draw the following conclusions:

  • Servers which run a newer CPU with AES hardware acceleration can enjoy the benefit of (1) a lot faster AES encryption using the recommended OpenSSH ciphers, and (2) some AES ciphers are now even two-times faster than the old speed champion, namely “arcfour”. I could get those great speeds only using OpenSSL 1.0.1f or newer, but this may need more testing.
  • Servers having a CPU without AES hardware acceleration still get two-times faster AES encryption with the newest OpenSSH 6.7 using OpenSSL 1.0.1k, as tested on Debian “Jessie”. Maybe they optimized something in the library.

Test results may vary (a lot) depending on your hardware platform, Linux kernel, OpenSSH and OpenSSL versions.

Leave a comment

Google App Engine Datastore benchmark

I admire the idea of Google App Engine — a platform as a service where there is “no worrying about DBAs, servers, sharding, and load balancers”. And you can “auto scale to 7 billion requests per day”. I wanted to try the App Engine for a pet project where I had to collect, process and query a huge amount of time series. The fact that I needed to do fast queries over tens of 1000’s of records however made me wonder if the App Engine Datastore would be fast enough. Note that in order to reduce the amount of entities which are fetched from the database, couples of data entries are consolidated into a single database entity. This however imposes another limitation — fetching big data entities uses more memory on the running instance.

My language of choice is Java, because its performance for such computations is great. I am using the the Objectify interface (version 4.0rc2), which is also one of the recommended APIs for the Datastore.

Unfortunately, my tests show that the App Engine is not suitable for querying of such amount of data. For example, fetching and updating 1000 entries takes 1.5 seconds, and additionally uses a lot of memory on the F1 instance. You can review the Excel sheet file below for more detailed results.

Basically each benchmark test performs the following operations and then exits:

  1. Adds a bunch of entries.
  2. Gets those entries from the database and verifies them.
  3. Updates those entries in the database.
  4. Gets the entries again from the database and verifies them.
  5. Deletes the entries.

All Datastore operations are performed in a batch and thus in an asynchronous parallel way. Furthermore, no indexes are used but the entities are referenced directly by their key, which is the most efficient way to query the Datastore. The tests were performed at two separate days because I wanted to extend some of the tests. This is indicated in the results. A single warmup request was made before the benchmarks, so that the App Engine could pre-load our application.

The first observation is that using the default F1 instance once we start fetching more than 100 entities or once we start to add/update/delete more than 1000 entities, we saturate the Datastore -> Objectify -> Java throughput and don’t scale any more:
App Engine Datastore median time per entity for 1 KB entities @ F1 instance

The other interesting observation is that the Datastore -> Objectify -> Java throughput depends a lot on the App Engine instance. That’s not a surprising fact because the application needs to serialize data back and forth when communicating with the Datastore. This requires CPU power. The following two charts show that more CPU power speeds up all operations where serializing is involved — that is all Datastore operations but the Delete one which only queries the Datastore by supplying the keys of the entities, no data:
App Engine Datastore times per entity for 1000 x 1 KB entities @ F1 instance

App Engine Datastore times per entity for 1000 x 1 KB entities @ F4 instance

It is unexpected that the App Engine and the Datastore still have good and bad days. Their latency as well as CPU accounting could fluctuate a lot. The following chart shows the benchmark results which we got using an F1 instance. If you compare this to the chart above where a much more expensive F4 instance was used, you’ll notice that the 4-times cheaper F1 instance performed almost as fast as an F4 instance:
App Engine Datastore times per entity for 1000 x 1 KB entities @ F1 instance (test on another day)

The source code and the raw results are available for download at http://www.famzah.net/download/gae-datastore-performance/

Leave a comment

iSCSI-over-Internet performance notes

I recently played a bit with iSCSI over Internet, in order to design and implement the Locally encrypted secure remote backup over Internet.

My initial impression was that iSCSI over Internet is not usable as a backup device even though my Internet connection is relatively fast — a simple ext4 file-system format took about 24 minutes. I though that the connection latency is killing the performance. Well, I was wrong. Even after making latency two times lower by working on a server which was geographically closer, the ext4 format still took 24 minutes.

Eventually I did some tests and analysis, and finally started to use the iSCSI over Internet volume for backup purposes — and it works flawlessly so far.

Ext4 format benchmark

It turns out that it’s not the latency but my upload bandwidth which was slowing things down:

  • 1 Mbit/s upload Internet connection and Ping latency of 75 ms:
    • Time: 24 minutes.
    • Average transfer rates snapshot:
      • Total rates: 967.7 kbits/sec (212.6 packets/sec).
      • Incoming rates: 83.0 kbits/sec (92.8 packets/sec).
      • Outgoing rates: 884.6 kbits/sec (119.8 packets/sec).
    • About 200 MBytes outgoing transfer; only 12 MBytes incoming transfer (no SSH tunnel compression).
    • About 200.000 packets sent and about 130.000 received.
  • 3 Mbit/s upload Internet connection and Ping latency of 75 ms:
    • Time: 8 minutes.
    • Average transfer rates snapshot:
      • Total rates: 2580.0 kbits/sec (417.8 packets/sec).
      • Incoming rates: 128.5 kbits/sec (149.6 packets/sec).
      • Outgoing rates: 2451.5 kbits/sec (268.2 packets/sec).
    • About 160 MBytes outgoing transfer; only 9 MBytes incoming transfer (with SSH tunnel compression).
    • About 140.000 packets sent and about 80.000 received.

I know I’m missing two tests with and without SSH tunnel compression but it seems compression doesn’t make such a difference. It’s upload speed which affects the total completion time.

File copy benchmark

All tests were done without SSH compression and we make the same conclusion — it is bandwidth which affects the total completion time:

  • 1 Mbit/s upload Internet connection and Ping latency of 75 ms:
    • SSH direct file copy to server: 100 seconds (11 MBytes file).
    • File copy to an iSCSI mounted file-system: 105 seconds.
  • 3 Mbit/s upload Internet connection and Ping latency of 75 ms:
    • SSH direct file copy to server: 39 seconds (11 MBytes file).
    • File copy to an iSCSI mounted file-system: 39 seconds.

The SSH direct file copy (SCP) transfer command was “scp testf root@”, and the file copy command was “cp testf /mnt/ ; sync”.

Server and client load during transfer, other benchmarks

During the transfer both the client and server machines were almost idle in regards to CPU. The iSCSI block storage device on the server was not saturated even at 1%.

Note that the iSCSI target was exported via an SSH tunnel, as described here. Ping tests shown no difference between a direct server ping and a ping via the SSH tunnel.

The file copy tests were done on a regular iSCSI mounted volume, and on an iSCSI volume which was encrypted using TrueCrypt. The same speeds were achieved.

Encountered problems

During the backup runs, I got several of the following kernel messages in “dmesg”. This seems like a normal warning for the iSCSI use-case scenario:

[13200.272157] INFO: task jbd2/dm-0-8:1931 blocked for more than 120 seconds.
[13200.272164] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[13200.272168] jbd2/dm-0-8 D f2abdc80 0 1931 2 0x00000000

Leave a comment

Speed up RRDtool database manipulations via RRDs (Perl)

Use case
You are doing a lot of data operations on your RRD files (create, update, fetch, last), and every update is done by a separate Perl process which lives a very short time – the process is launched, it updates or reads the data, does something else, and then exits.

The problem
If you are using RRDtool and Perl as described, you surely have noticed that running many of these processes wastes a lot of CPU resources. The question is – can we do some performance optimizations, and lessen the performance hit of loading the RRDs library into Perl? We know that launching often Perl itself is quite expensive, but after all, if we chose to work with Perl, this is a price we should be ready to pay.

The RRDtool shared library is a monolithic piece of code which provides ALL functions of the RRDtool suite – data manipulation, graphics and import/export tools. The last two components bring huge dependencies in regards to other shared libraries. The library from RRDtool version 1.4.4 depends on 34 other libraries on my Linux box! This must add up to the loading time of the RRDtool library into Perl.

Resolution and benchmarks
In order to prove my theory (actually, it was more a theory of zImage, and I just followed, enhanced and tried it), I commented out the implementation of the “graphics” and “import/export tools” modules from the source code of RRDtool. Then I re-compiled the library and did some performance benchmarks. I also re-implemented the RRDs.pm module by replacing the DynaLoader module with the XSLoader one. This made no difference in performance whatsoever. The re-compiled RRD library depends on only 4 other libraries – linux-gate.so.1, libm.so.6, libc.so.6, and /lib/ld-linux.so.2. I think this is the most we can cut down. 🙂

So here are the benchmark results. They show the accumulated time for 1000 invocations of the Perl interpreter with three different configurations:

  • Only Perl (baseline): 5.454s.
  • With RRDs, no graphics or import/export functions: 9.744s (+4.290s) +78%.
  • With standard RRDs: 11.647s (+6.192s) +113%.

As you can see, you can make Perl + RRDs start 35% faster. The speed up for RRDs itself is 44%.

Here are the commands I used for the benchmarks:

  • Only Perl (baseline): time ( i=1000 ; while [ “$i” -gt 0 ]; do perl -Mwarnings -Mstrict -e ” ; i=$(($i-1)); done )
  • Perl + RRDs: time ( i=1000 ; while [ “$i” -gt 0 ]; do perl -Mwarnings -Mstrict -MRRDs -e ” ; i=$(($i-1)); done )


OpenSSH ciphers performance benchmark

💡 Please review the newer tests.

Ever wondered how to save some CPU cycles on a very busy or slow x86 system when it comes to SSH/SCP transfers?

Here is how we performed the benchmarks, in order to answer the above question:

  • 41 MB test file with random data, which cannot be compressed – GZip makes it only 1% smaller.
  • A slow enough system – Bifferboard. Bifferboard CPU power is similar to a Pentium @ 100Mhz.
  • The other system is using a dual-core Core2 Duo @ 2.26GHz, so we consider it fast enough, in order not to influence the results.
  • SCP file transfer over SSH using OpenSSH as server and client.

As stated at the Ubuntu man page of ssh_config, the OpenSSH client is using the following Ciphers (most preferred go first):


In order to examine their performance, we will transfer the test file twice using each of the ciphers and note the transfer speed and delta. Here are the shell commands that we used:

for cipher in aes128-ctr aes192-ctr aes256-ctr arcfour256 arcfour128 aes128-cbc 3des-cbc blowfish-cbc cast128-cbc aes192-cbc aes256-cbc arcfour ; do
        echo "$cipher"
        for try in 1 2 ; do
                scp -c "$cipher" test-file root@

You can review the raw results in the “ssh-cipher-speed-results.txt” file. The delta difference between the one and same benchmark test is within 16%-20%. Not perfect, but still enough for our tests.

Here is a chart which visualizes the results:

The clear winner is Arcfour, while the slowest are 3DES and AES. Still the question if all OpenSSH ciphers are strong enough to protect your data remains.

It’s worth mentioning that the results may be architecture dependent, so test for your platform accordingly.
Also take a look at the below comment for the results of the “i7s and 2012 xeons” tests.