/contrib/famzah

Enthusiasm never stops

C++ vs. Python vs. PHP vs. Java vs. Others performance benchmark (2016 Q3)

52 Comments

The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, dynamic arrays with numbers, basic math operations.

This is an improved redo of the tests done in previous years. You are strongly encouraged to read the additional information about the tests in the article.

Here are the benchmark results:

Language CPU time Slower than Language
version
Source
code
User System Total C++ previous
C++ (optimized with -O2) 0.899 0.053 0.951 g++ 6.1.1 link
Rust 0.898 0.129 1.026 7% 7% 1.12.0 link
Java 8 (non-std lib) 1.090 0.006 1.096 15% 6% 1.8.0_102 link
Python 2.7 + PyPy 1.376 0.120 1.496 57% 36% PyPy 5.4.1 link
C# .NET Core Linux 1.583 0.112 1.695 78% 13% 1.0.0-preview2 link
Javascript (nodejs) 1.371 0.466 1.837 93% 8% 4.3.1 link
Go 2.622 0.083 2.705 184% 47% 1.7.1 link
C++ (not optimized) 2.921 0.054 2.975 212% 9% g++ 6.1.1 link
PHP 7.0 6.447 0.178 6.624 596% 122% 7.0.11 link
Java 8 (see notes) 12.064 0.080 12.144 1176% 83% 1.8.0_102 link
Ruby 12.742 0.230 12.972 1263% 6% 2.3.1 link
Python 3.5 17.950 0.126 18.077 1800% 39% 3.5.2 link
Perl 25.054 0.014 25.068 2535% 38% 5.24.1 link
Python 2.7 25.219 0.114 25.333 2562% 1% 2.7.12 link

The big difference this time is that we use a slightly modified benchmark method. Programs are no longer limited to just 10 loops. Instead they run for 90 wall-clock seconds, and then we divide and normalize their performance as if they were running for only 10 loops. This way we can compare with the previous results. The benefit of doing the tests like this is that the startup and shutdown times of the interpreters should make almost no difference now. It turned out that the new method doesn’t significantly change the outcome compared to the previous benchmark runs, which is good as the old way of benchmarks seems also correct.

For the curious readers, the raw results also show the maximum used memory (RSS).

Brief analysis of the results:

  • Rust, which we benchmark for the first time, is very fast. 🙂
  • C# .NET Core on Linux, which we also benchmark for the first time, performs very well by being as fast as NodeJS and only 78% slower than C++. Memory usage peak was at 230 MB which is the same as Python 3.5 and PHP 7.0, and two times less than Java 8 and NodeJS.
  • NodeJS version 4.3.x got much slower than the previous major version 4.2.x. This is the only surprise. It turned out to be a minor glitch in the parser which was easy to fix. NodeJS 4.3.x is performing the same as 4.2.x.
  • Python and Perl seem a bit slower than before but this is probably due to the fact that C++ performed even better because of the new benchmark method.
  • Java 8 didn’t perform much faster as we expected. Maybe it gets slower as more and more loops are done, which also allocated more RAM.
  • Also review the analysis in the old 2016 tests for more information.

The tests were run on a Debian Linux 64-bit machine.

You can download the source codes, raw results, and the benchmark batch script at:
https://github.com/famzah/langs-performance

Update @ 2016-10-15: Added the Rust implementation. The minor versions of some languages were updated as well.
Update @ 2016-10-19: A redo which includes the NodeJS fix.
Update @ 2016-11-04: Added the C# .NET Core implementation.

Author: Ivan Zahariev

An experienced Linux & IT enthusiast, Engineer by heart, Systems architect & developer.

52 thoughts on “C++ vs. Python vs. PHP vs. Java vs. Others performance benchmark (2016 Q3)

  1. Hi
    i came across this post and was just curious to see why java does so badly on seeing the data above i.e around 12 cpu sec total for 10 loops normalized taken from 90sec runs of your source code.
    However i got only 4 sec cpu total using jdk 6 (1.6.0_45) and around 6 sec cpu using jdk 8 (1.8.0_102) in my ubuntu 64bit linux using 90sec wall clock runs.

    note:
    1.Just took your same java source code slightly modified to run for hardcoded 90sec wall clock time(instead of readin env var)
    with above 2 java versions i had and profiled cpu time using jdev ide along with wall clock time. code/logic is same.
    2. Did have a look at your other links to see if i missed anything but couldnt see anything drastic..

    just sharing my thoughts..
    –Kishore

    • update: i realized that you use ‘time’ to get sys/user cpu time breakdown from OS ‘time’ stats of java command.
      So if one times it via OS/linux i.e time i can see 10-11 cpu sec per 10 loop data which is somewhat closer to what you have of around 12sec cpu time.
      Yes , thatway jdk 6 as well as 8 takes 10-11 cpu sec for 10 loop runs.
      thx

    • Hi Kishore,

      Note that CPU time is not “portable”, so to say. One CPU time second on your monster PC could be equivalent to four seconds on my humble laptop, for example.

      That’s why I emphasize on the percentage difference between language performance. You always have to compare on the same hardware platform, in order to be accurate.

      • yes, i know that. thanks. mine was a very simple laptop (2 cores). i was initially just curious to know how java performs and as i updated i can now see 10-11 cpu sec for the same 10 loop data of your source code.. probably i would dig deeper to understand where the issues arise esp for java if i have time
        thx for your post.

  2. IDEs dont take into account jvm startup as well as gc overhead cpu time and sample precisely the user java class code. Hence that explains why one sees 4-6 cpu time for 10 loops versus overall 10-11 cpu time (which includes mainly gc overhead time apart from jvm startup)
    perhaps that is why GC tuning is very important for server side JVM based java apps from both efficient memory as well as cpu usage.
    my 2 cents

  3. In java source link you have provided above if possible you can make few minor changes esp in avoiding object creations inside loop so that GC overhead can be minimized .
    If you do those simple changes you can see easily cpu time per 10 loops falls to 6-7 cpu sec basically putting java8 std lib in same class/bucket like php etc if not better.

    Changed source code link : https://kishorekannan.wordpress.com/2016/09/15/reducing-object-creations/

  4. One last thing : this is not about code but if you are open to adding this runtime flag on the java command line then it would improve java8 time as well.
    i.e Add the hint/flag -XX:NewRatio=1 on your java command line. this alone improves from 10-11 cpu sec to 7cpu sec for 10loop runs .
    This is no code change and just needs to run as : time java -XX:NewRatio=1 PrimeNumbersBenchmarkApp
    thanks in adv

  5. Just fyi : Timings i got as follows:
    1. primes.java (your original file) – java8 (1.8.0_102 – 10-11 cpu sec per 10 loop for 90sec runs
    2. Same primes.java (no code change,just run with -XX:NewRatio=1 flag) – same java8 – 6-7 cpu sec per 10 loop for 90sec runs
    Kindly let know if you would consider this. By the way the flag is nothing spl and it is well documented to use on specific testcases like here creating lot of throwaway dynamic objects which are mostly short-lived in loops.
    This would put std java8 on par with php,javascript if not better and that too without any code change as you pointed out that is as per nature of your test.
    thx

    • I’ve given it a lot of thinking and I still consider this an optimization which is specific to this use-case.

      In order to be more fair, I’ve summarized all Java optimization tips so far and put a visible link in the results table.

      Last but not least, we’ve already demonstrated that Java can be as fast as C++, if it runs a more native Java implementation: “Java 8 (non-std lib)”.

  6. In java example, why don’t you use LinkedList for res variable (it has O(1) complexity for adding elements)

    List res = new LinkedList();

    • I’ll appreciate a fork + pull request at GitHub, so that we can easily compare the original code with the code that you propose. Or is the change only in “List res = new LinkedList()” ?

      • Yes, the only change is one line:
        List res = new LinkedList();

        I get slightly better results on my machine.

      • I made a pull request on Github with a proposed change.

      • Thanks for the pull request: https://github.com/famzah/langs-performance/pull/4

        Unfortunately, this change made Java 8 only 3% faster, while it made Java 7 run 59% slower. Since the Java 8 improvement is marginal and the impact for Java 7 is negative, I won’t merge it into “primes.java”. I’ve documented your idea in the “java-optimizations” section.

        Here is what I see as results:

        ORIG: Java 7 : user_t= 5.110 sys_t= 0.026 cpu_t= 5.136 to_CPP= – to_prev= – version=javac 1.7.0_111
        NEW : Java 7 : user_t= 8.139 sys_t= 0.038 cpu_t= 8.177 to_CPP= – to_prev= – version=javac 1.7.0_111
        diff: +59%

        ORIG: Java 8 : user_t=18.195 sys_t= 0.148 cpu_t=18.343 to_CPP= – to_prev= – version=javac 1.8.0_102
        NEW : Java 8 : user_t=17.594 sys_t= 0.135 cpu_t=17.729 to_CPP= – to_prev= – version=javac 1.8.0_102
        diff: -3%

        Raw results at: https://github.com/famzah/langs-performance/tree/master/results/LinkedList-tests

  7. imho The space heaviness of using Integer objects(by a factor of 4-7 compared to int[]) created via ArrayList s in primes.java woud overshadow any other change imho. Unless one replaces with a spl int[] as in primes-alt.java the runtimes in java cannot be improved significantly. Or you need to atleast help jvm mitigate Garbage cleanup overhead through use of common/basic heap flags as documented above for java. GC overhead should ideally be in 5-15% of total execution time normally for java code esp long running ones as in server. GC overhead eats cpu and affects overall throughput in general and can be measured via time java -Xprof PrimeNumbersBenchmarkApp

  8. Wow. That’s not node.js 4.3.1 that’s slow over 4.2.6 – that’s the impact of your changed test method.
    It’s ridiculous, but node.js 4.x performs first 10-20 iterations very fast and all subsequent ones very slowly (~10 times slower). And it’s not caused by GC (easily seen in node –prof)!
    It seems it’s caused by some deopt issue: run `RUN_TIME=60 node –trace-opt-verbose primes.js` and you’ll see V8 is recompiling get_primes7() many times, then every time “evicting entry from optimized code map” for some reason (is it a V8 bug?) and after several attempts it gives up optimization and then we see unoptimised code running. Which is, by no surprise, ~10-20 times slower…
    If I run nodejs with –max_opt_count=100000 then it succeeds to calculate primes ~440 times per 60 seconds. Default setting is 10 and gives only ~60 times per 60 seconds. This is the same with nodejs 4.6 and 6.0.
    A much older version (nodejs 0.10) does not have this bug, but makes primes only ~230 times per 60 seconds.
    Unoptimized C++ (g++ 6.1.1) runs it ~230 times per 60 seconds on the same machine. Optimised C++ gives ~630 iterations. Java 8 (without boxing/unboxing) makes it ~650 times 🙂 even slightly better than C++…

    • I discovered what causes this problem. Deopt occurs when you push() a lot of times to `var res = [2];`. It seems V8 tries to optimise this array as a tuple (i.e. a short and typed array) which causes guard check violation and deopt when it happens to turn into a long “vector”. So, if you change the initial line to `var res = []; res.push(2);` the bug goes away and V8 runs fast.

      • Well, I’m glad that it wasn’t “the impact of my changed test method”. 🙂 But a bug in NodeJS. I’ve updated the source code as you suggested, redid the tests + updated the page here, and now NodeJS is back on top. Cheers.

  9. Actually in my tests[1] Rust comes out on top, and is actually 10% faster than C++ compiled by g++ 4.8.5. However I see you are using much newer gcc.

    [1]: https://blog.ndenev.com/2016/10/15/rust_benchmark_win/index.html

    • +. I get the same. Rust is faster then g++ 6.1.1 on both i386 and amd64. LLVM gets close to it on i386 but not on amd64 (however it varies by version).

      • 1) The same version of Rust (1.12.0) is faster than the same version of g++ (6.1.1) which I used for the tests? Can you list the exact versions here.
        2) How much faster? Can you post the raw numbers here?

        P.S. To be honest, the 3% difference in my results aren’t a significant number, so we can easily call both languages equally fast on my platform.

  10. In my tests on a CentOS machine (as I posted already) Rust is consistently faster by 10%, which is quite significant.
    I’ve just ran another test on my ancient (core 2 duo) OS X laptop and Rust is 3% faster here: http://pastebin.com/2VERr87a

    x cpp
    + rust
    +------------------------------------------------------------+
    |       x                                               +    |
    |       x                                               +    |
    |x   x  x   x                                    +   +  +   +|
    |  |___AM__|                                                 |
    |                                                  |___AM__| |
    +------------------------------------------------------------+
        N           Min           Max        Median           Avg        Stddev
    x   6           365           368           367     366.66667     1.0327956
    +   6           378           381           380     379.66667     1.0327956
    Difference at 95.0% confidence
    	13 +/- 1.32852
    	3.54545% +/- 0.362324%
    	(Student's t, pooled s = 1.0328)
    

    And, while it is small 3% is not insignificant difference.

    Note that this is compared to C++ compiled by Apple LLVM version 8.0.0.

  11. UUuuuummmmmmm….. where’s C#?

    • Hi. I’m doing the benchmarks on Linux, and until now I thought that .NET Core is not yet production ready. What a pleasant surprise to see that Microsoft released it already, and they also claim that “.NET is 8x faster than Node.js and 3x faster than Go” 🙂

      I’ll try to code a C# version but can’t promise when I can do it. If you contribute a C# version, that would be great.

    • I’ve added a C# .NET Core implementation which I tested on Linux.

      • Yesterday I was so eager to test C# .NET Core that I did the benchmarks on my laptop while running on battery. Today I performed them as usual – my laptop was running plugged in the power socket and at full CPU speed. This made a huge difference. I’ve updated the results table.

  12. Have you guys noticed that the number of context switches for Go is too high comparing to JS, PHP, Python? Any ideas why is that?

    • I’ve noticed it too but don’t know the exact reason. All tests run at similar conditions so it shouldn’t be an error in the benchmarks.

      Go and Java 8 are multi-threaded and this seems to be causing a lot of context switching (between the threads?):
      – Go – multi-threaded (real_TIME:90.08sec user_CPU:141.80sec)
      – Java 8 – multi-threaded (real_TIME:90.60sec user_CPU:125.76sec)

      Java 8 running the “non-std lib” implementation also have a lot of context switching even though it seems that not much CPU time has been used by the non-main GC thread. Running this implementation doesn’t seem to use the other non-main thread(s) a lot but still context switches are huge numbers (real_TIME:90.13sec user_CPU:89.97sec). Still we are sure that Java 8 is multi-threaded — probably the GC is done in another thread and because the “non-std lib” implementation is very efficient, the GC thread gets switched to, notices that there isn’t anything to do, and gives control back to the main thread.

      C# .NET Core seems similar to Java 8 running “non-std lib”. I guess that C# also have a separate GC thread which in this case hasn’t got a lot of real work to do.

  13. Dear Ivan,

    I appreciate the information you published about performance languages..

    I have a big question about Java.. what do you mean for “non-std lib”??

    Can you explain it please?

    • “non-std lib” means a “non-standard library”. The benchmarks use a custom Vector class re-implementation specifically for Integer, in order to avoid the unneeded boxing/unboxing. It works the same way but is not part of the standard Java libraries. The text “non-std lib” is a URL link — just click on it, and you’ll be directed to the comment’s discussion about it.

  14. I’ve modified sources a bit: removed timers, and used Igor Pavlovs’ timer64 to benchmark.
    So here are my results, see with monospaced font:

    Windows 10 x64, i7-3632QM;

    Prog lang version and options            used RAM  elapsed time
    ---------------------------------------------------------------
    Lua         LuaJIT 2.1.0-b2 x64          49980 KB     0.135 sec
    JavaScript  MS Chakra Core 1.4.0 x64     79276 KB     0.175 sec
    JavaScript  MS Chakra Core 1.4.0 x86     78852 KB     0.179 sec
    Go          Go 1.7.3 x64                 97460 KB     0.196 sec (uses MT, 0.343 total)
    Python      pypy 5.6.0 x86               47852 KB     0.228 sec
    Java        jdk1.8.0_112 x64 v2          86176 KB     0.240 sec
    JavaScript  C45.0a1 x86                 109716 KB     0.278 sec
    JavaScript  V8 4.10.0 x64               184788 KB     0.360 sec
    Lua         LuaJIT 2.1.0-b2 x64 -joff    49880 KB     0.380 sec
    Java        jdk1.8.0_112 x64 -Xint v2    82880 KB     0.798 sec
    PHP         PHP 7.1.0 x64               199600 KB     0.891 sec
    JavaScript  V8 4.10.0 x64 --turbo       185524 KB     0.936 sec
    JavaScript  C45.0a1 x86 --no-ion        109624 KB     1.108 sec
    Ruby        ruby 2.3.1p112 x64           91088 KB     1.360 sec
    Java        jdk1.8.0_112 x64            174868 KB     1.601 sec (uses MT, 2.734 total)
    Python      python 3.5.2 x64            218900 KB     2.276 sec
    Perl        perl 5.24.0 x64 v2          237852 KB     2.605 sec
    Perl        perl 5.24.0 x64             226612 KB     2.692 sec
    PHP         PHP 5.4.14 x86              473504 KB     2.928 sec
    JavaScript  V8 4.10.0 x64 --ignition    185480 KB     3.217 sec
    Java        jdk1.8.0_112 x64 -Xint      170352 KB     5.331 sec (uses MT, 6.937 total)
    JavaScript  C45.0a1 x86 --no-baseline   109564 KB     7.633 sec
    
  15. would be nice to include perl with PDL library .

  16. Can I contribute a new benchmark? With Nim language

    • Can you please add a Nim benchmark?
      But please, compile using latest compiler!
      https://pastebin.com/p4BVnz9t

      Compile command – “nim c -d:release -o:primes.nim.out primes.nim”

    • Hi, to be honest, I’m a bit overwhelmed right now, and can’t do this myself.

      I’d recommend that for a start you do the benchmark with Nim on your computer, and paste the results here. Also, do the benchmark with some of the other languages, so that we can compare the relative speed.

      • I ran all benchmarks on Intel i5 4460 CPU, using Manjaro 17.0 (based on Arch Linux)
        I did only one run, but it shows some rough data:
        C++ (g++ 6.3.1 with -O2) got 882 lines.
        C++ (g++ 6.3.1 with -O3) got 931 lines.
        But Nim can be compiled to C, C++, ObjC, or JS.
        So Nim with C – nim c -d:release primes.nim – 629 lines (it used gcc 6.3.1 for compiling C)
        Nim with C++ – nim cpp -d:release primes.nim – the same like Nim with C

        RustC 1.15.0 (rustc -C opt-level=3) got 1065 lines

        But honestly, I literally copy+pasted Python source, and made Nim version out of Python version, and I’m new to Nim, so maybe there’s better optimizations possible 🙂

        Maybe I’ll find out if my Nim code is wrong in some way (maybe I’ve done something wrong)

      • Oh wait, I can do some small optimizations

      • This is better version:


        import os, times, future, math, sequtils, strutils
        proc get_primes7(n: int32): seq[int32] =
        if n < 2:
        return @[]
        result = @[2'i32]
        if n == 2:
        return
        var s = newSeq[int32]()
        for x in countup(3, n + 1, 2):
        s.add(x)
        let
        mroot = int32(sqrt(n.float))
        half = int32(len(s))
        var
        i = 0'i32
        m = 3'i32
        while m <= mroot:
        if s[i] != 0:
        var j = (m * m – 3) div 2 # int div
        s[j] = 0
        while j < half:
        s[j] = 0
        j += m
        inc(i)
        m = 2 * i + 3
        for x in s:
        if x != 0:
        result.add(x)
        let start_time = getTime()
        let period_time = getEnv("RUN_TIME").parseInt()
        while int(getTime() – start_time) < period_time:
        let res = get_primes7(10000000)
        echo("Found $1 prime numbers." % $res.len)

        view raw

        primes.nim

        hosted with ❤ by GitHub

      • So this version is final: https://gist.github.com/def-/fd6f528e51683f7b1baa60518b426d74
        It shows roughly the same speed as C++ does!

      • I’m glad to see that the Nim implementation is valid, and that Nim rocks! 🙂

  17. I was able to improve JS implementation by about 40%. I used the Java trick and replaced untyped array with a typed array.

    function get_primes7(n) {
    if (n < 2) { return []; }
    if (n == 2) { return [2]; }

    var s = new Uint32Array(Math.ceil((n + 1) / 2 – 3));
    for (var i = 3, j = 0; i < n + 1; i += 2) {
    s[j++] = i;
    }

    var mroot = Math.floor(Math.sqrt(n));
    var half = s.length;
    var i = 0;
    var m = 3;

    while (m <= mroot) {
    if (s[i]) {
    var j = Math.floor((m*m-3)/2); // int div
    s[j] = 0;
    while (j < half) {
    s[j] = 0;
    j += m;
    }
    }
    i = i + 1;
    m = 2*i + 3;
    }

    var count = 0;
    for (var x = 0; x < half; x++) {
    if(s[x] !== 0)
    count++;
    }

    var res = new Uint32Array(count + 1);
    res[0] = 2;

    for (var x = 0, j = 0; x < half; x++) {
    if (s[x] !== 0) {
    res[++j] = s[x];
    }
    }
    return res;
    }

    var startTime = Date.now();
    var periodTime = 10 * 1000

    while ((Date.now() – startTime) < periodTime) {
    var res = get_primes7(10000000);
    console.log("Found " + res.length + " prime numbers.");
    }

    Surprising thing is that if we replace vars with let or const, performance considerable degrades.

  18. chart of this test, “total” values

  19. A bit late to the party, but came across this as I am benchmarking an M1 with various things. FYI, pure perl is not supposed to be fast at math, but you are actually handicapping it further by using a while loop. Perl supports simply C-style for loops which are faster, you are only meant to use while, do etc if it makes sense for your program, but here it does not.
    You replace the 6 lines in that if with just:

    for (my $j = int(($m*$m – 3) / 2); $j < $half; $j += $m) {
    $s[$j] = 0;
    }

    Much simpler to read and the program becomes 10.3% faster on my Mac. There are many "clever" ways to improve the algorithm, but you can do the same with most languages, so this was just for something that was bad practice.

  20. Hi, thank for the hint. I can confirm that “for” instead of “while” improves speed by 6% on my laptop.

    Here is more info about the results: https://github.com/famzah/langs-performance/commit/ee4d4440eda8e8b7e9b772334ee39b0a8056521d

    It’s worth mentioning that now we have one less “$s[$j] = 0“ for the initial value of “$j“. I don’t know if this contributes significantly for the improvement of 6%.

  21. I conducted a similar benchmark (“Performance Comparison C vs. Java vs. Javascript vs. LuaJIT vs. PyPy vs. PHP vs. Python vs. Perl”) with quite similar findings:
    https://eklausmeier.goip.de/blog/2021/07-13-performance-comparison-c-vs-java-vs-javascript-vs-luajit-vs-pypy-vs-php-vs-python-vs-perl/

    Interestingly, the results depend on the underlying CPU architecture.

Leave a reply to Ivan Zahariev Cancel reply