The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, dynamic arrays with numbers, basic math operations.
This is an improved redo of the tests done in previous years. You are strongly encouraged to read the additional information about the tests in the article.
Here are the benchmark results:
Language | CPU time | Slower than | Language version |
Source code |
|||
---|---|---|---|---|---|---|---|
User | System | Total | C++ | previous | |||
C++ (optimized with -O2) | 0.899 | 0.053 | 0.951 | – | – | g++ 6.1.1 | link |
Rust | 0.898 | 0.129 | 1.026 | 7% | 7% | 1.12.0 | link |
Java 8 (non-std lib) | 1.090 | 0.006 | 1.096 | 15% | 6% | 1.8.0_102 | link |
Python 2.7 + PyPy | 1.376 | 0.120 | 1.496 | 57% | 36% | PyPy 5.4.1 | link |
C# .NET Core Linux | 1.583 | 0.112 | 1.695 | 78% | 13% | 1.0.0-preview2 | link |
Javascript (nodejs) | 1.371 | 0.466 | 1.837 | 93% | 8% | 4.3.1 | link |
Go | 2.622 | 0.083 | 2.705 | 184% | 47% | 1.7.1 | link |
C++ (not optimized) | 2.921 | 0.054 | 2.975 | 212% | 9% | g++ 6.1.1 | link |
PHP 7.0 | 6.447 | 0.178 | 6.624 | 596% | 122% | 7.0.11 | link |
Java 8 (see notes) | 12.064 | 0.080 | 12.144 | 1176% | 83% | 1.8.0_102 | link |
Ruby | 12.742 | 0.230 | 12.972 | 1263% | 6% | 2.3.1 | link |
Python 3.5 | 17.950 | 0.126 | 18.077 | 1800% | 39% | 3.5.2 | link |
Perl | 25.054 | 0.014 | 25.068 | 2535% | 38% | 5.24.1 | link |
Python 2.7 | 25.219 | 0.114 | 25.333 | 2562% | 1% | 2.7.12 | link |
The big difference this time is that we use a slightly modified benchmark method. Programs are no longer limited to just 10 loops. Instead they run for 90 wall-clock seconds, and then we divide and normalize their performance as if they were running for only 10 loops. This way we can compare with the previous results. The benefit of doing the tests like this is that the startup and shutdown times of the interpreters should make almost no difference now. It turned out that the new method doesn’t significantly change the outcome compared to the previous benchmark runs, which is good as the old way of benchmarks seems also correct.
For the curious readers, the raw results also show the maximum used memory (RSS).
Brief analysis of the results:
- Rust, which we benchmark for the first time, is very fast. ๐
- C# .NET Core on Linux, which we also benchmark for the first time, performs very well by being as fast as NodeJS and only 78% slower than C++. Memory usage peak was at 230 MB which is the same as Python 3.5 and PHP 7.0, and two times less than Java 8 and NodeJS.
- NodeJS version 4.3.x
got much slower than the previous major version 4.2.x. This is the only surprise.It turned out to be a minor glitch in the parser which was easy to fix. NodeJS 4.3.x is performing the same as 4.2.x. - Python and Perl seem a bit slower than before but this is probably due to the fact that C++ performed even better because of the new benchmark method.
- Java 8 didn’t perform much faster as we expected. Maybe it gets slower as more and more loops are done, which also allocated more RAM.
- Also review the analysis in the old 2016 tests for more information.
The tests were run on a Debian Linux 64-bit machine.
You can download the source codes, raw results, and the benchmark batch script at:
https://github.com/famzah/langs-performance
Update @ 2016-10-15: Added the Rust implementation. The minor versions of some languages were updated as well.
Update @ 2016-10-19: A redo which includes the NodeJS fix.
Update @ 2016-11-04: Added the C# .NET Core implementation.

September 14, 2016 at 5:24 pm
Hi
i came across this post and was just curious to see why java does so badly on seeing the data above i.e around 12 cpu sec total for 10 loops normalized taken from 90sec runs of your source code.
However i got only 4 sec cpu total using jdk 6 (1.6.0_45) and around 6 sec cpu using jdk 8 (1.8.0_102) in my ubuntu 64bit linux using 90sec wall clock runs.
note:
1.Just took your same java source code slightly modified to run for hardcoded 90sec wall clock time(instead of readin env var)
with above 2 java versions i had and profiled cpu time using jdev ide along with wall clock time. code/logic is same.
2. Did have a look at your other links to see if i missed anything but couldnt see anything drastic..
just sharing my thoughts..
–Kishore
September 14, 2016 at 7:04 pm
update: i realized that you use ‘time’ to get sys/user cpu time breakdown from OS ‘time’ stats of java command.
So if one times it via OS/linux i.e time i can see 10-11 cpu sec per 10 loop data which is somewhat closer to what you have of around 12sec cpu time.
Yes , thatway jdk 6 as well as 8 takes 10-11 cpu sec for 10 loop runs.
thx
September 14, 2016 at 9:35 pm
Hi Kishore,
Note that CPU time is not “portable”, so to say. One CPU time second on your monster PC could be equivalent to four seconds on my humble laptop, for example.
That’s why I emphasize on the percentage difference between language performance. You always have to compare on the same hardware platform, in order to be accurate.
September 15, 2016 at 5:45 am
yes, i know that. thanks. mine was a very simple laptop (2 cores). i was initially just curious to know how java performs and as i updated i can now see 10-11 cpu sec for the same 10 loop data of your source code.. probably i would dig deeper to understand where the issues arise esp for java if i have time
thx for your post.
September 14, 2016 at 7:13 pm
IDEs dont take into account jvm startup as well as gc overhead cpu time and sample precisely the user java class code. Hence that explains why one sees 4-6 cpu time for 10 loops versus overall 10-11 cpu time (which includes mainly gc overhead time apart from jvm startup)
perhaps that is why GC tuning is very important for server side JVM based java apps from both efficient memory as well as cpu usage.
my 2 cents
October 11, 2016 at 9:22 pm
PyPy have JIT too, and no one take into account it’s JIT warm-upping
October 11, 2016 at 10:56 pm
All tests now run for 90 wall-clock seconds. More than enough time for JIT warm-up of all languages which have it.
September 15, 2016 at 3:56 pm
In java source link you have provided above if possible you can make few minor changes esp in avoiding object creations inside loop so that GC overhead can be minimized .
If you do those simple changes you can see easily cpu time per 10 loops falls to 6-7 cpu sec basically putting java8 std lib in same class/bucket like php etc if not better.
Changed source code link : https://kishorekannan.wordpress.com/2016/09/15/reducing-object-creations/
September 17, 2016 at 7:28 am
hi
Kindly have a look at below 2 links whenever you have time and let me know. (these should help java8 times)
1.https://github.com/Kishore-Kannan/JavaPrimes/blob/master/primes_java8_array.java
2.https://github.com/Kishore-Kannan/JavaPrimes/blob/master/primes_java8_ArrayList.java
thanks
September 17, 2016 at 10:03 am
Both those optimizations have already been discussed:
– primes_java8_array.java — this directly pre-allocates the whole array, as it knows how many numbers this algorithm returns
– primes_java8_ArrayList.java — similar hint; the first loop Java works by allocating the array dynamically (what we benchmark), but then next runs are done with a pre-allocated ArrayList
I can’t accept this. This has already been discussed a few times. Here is a direct explanation: http://www.famzah.net/download/langs-performance/java-discussion-by-Isaac-Gouy/messages/20120619-Re_C%2B%2B%20vs.%20Python%20vs.%20Perl%20vs.%20PHP%20performance%20benchmark-10.html
September 17, 2016 at 11:13 am
ok.thx np.
September 17, 2016 at 12:54 pm
One last thing : this is not about code but if you are open to adding this runtime flag on the java command line then it would improve java8 time as well.
i.e Add the hint/flag -XX:NewRatio=1 on your java command line. this alone improves from 10-11 cpu sec to 7cpu sec for 10loop runs .
This is no code change and just needs to run as : time java -XX:NewRatio=1 PrimeNumbersBenchmarkApp
thanks in adv
September 18, 2016 at 7:38 pm
Just fyi : Timings i got as follows:
1. primes.java (your original file) – java8 (1.8.0_102 – 10-11 cpu sec per 10 loop for 90sec runs
2. Same primes.java (no code change,just run with -XX:NewRatio=1 flag) – same java8 – 6-7 cpu sec per 10 loop for 90sec runs
Kindly let know if you would consider this. By the way the flag is nothing spl and it is well documented to use on specific testcases like here creating lot of throwaway dynamic objects which are mostly short-lived in loops.
This would put std java8 on par with php,javascript if not better and that too without any code change as you pointed out that is as per nature of your test.
thx
September 19, 2016 at 6:23 pm
I’ve given it a lot of thinking and I still consider this an optimization which is specific to this use-case.
In order to be more fair, I’ve summarized all Java optimization tips so far and put a visible link in the results table.
Last but not least, we’ve already demonstrated that Java can be as fast as C++, if it runs a more native Java implementation: “Java 8 (non-std lib)”.
September 21, 2016 at 9:59 am
In java example, why don’t you use LinkedList for res variable (it has O(1) complexity for adding elements)
List res = new LinkedList();
September 21, 2016 at 10:02 am
I’ll appreciate a fork + pull request at GitHub, so that we can easily compare the original code with the code that you propose. Or is the change only in “List res = new LinkedList()” ?
September 21, 2016 at 10:09 am
Yes, the only change is one line:
List res = new LinkedList();
I get slightly better results on my machine.
September 22, 2016 at 4:36 pm
I made a pull request on Github with a proposed change.
September 22, 2016 at 10:26 pm
Thanks for the pull request: https://github.com/famzah/langs-performance/pull/4
Unfortunately, this change made Java 8 only 3% faster, while it made Java 7 run 59% slower. Since the Java 8 improvement is marginal and the impact for Java 7 is negative, I won’t merge it into “primes.java”. I’ve documented your idea in the “java-optimizations” section.
Here is what I see as results:
ORIG: Java 7 : user_t= 5.110 sys_t= 0.026 cpu_t= 5.136 to_CPP= – to_prev= – version=javac 1.7.0_111
NEW : Java 7 : user_t= 8.139 sys_t= 0.038 cpu_t= 8.177 to_CPP= – to_prev= – version=javac 1.7.0_111
diff: +59%
ORIG: Java 8 : user_t=18.195 sys_t= 0.148 cpu_t=18.343 to_CPP= – to_prev= – version=javac 1.8.0_102
NEW : Java 8 : user_t=17.594 sys_t= 0.135 cpu_t=17.729 to_CPP= – to_prev= – version=javac 1.8.0_102
diff: -3%
Raw results at: https://github.com/famzah/langs-performance/tree/master/results/LinkedList-tests
September 23, 2016 at 4:50 am
imho The space heaviness of using Integer objects(by a factor of 4-7 compared to int[]) created via ArrayList s in primes.java woud overshadow any other change imho. Unless one replaces with a spl int[] as in primes-alt.java the runtimes in java cannot be improved significantly. Or you need to atleast help jvm mitigate Garbage cleanup overhead through use of common/basic heap flags as documented above for java. GC overhead should ideally be in 5-15% of total execution time normally for java code esp long running ones as in server. GC overhead eats cpu and affects overall throughput in general and can be measured via time java -Xprof PrimeNumbersBenchmarkApp
October 12, 2016 at 2:06 pm
Wow. That’s not node.js 4.3.1 that’s slow over 4.2.6 – that’s the impact of your changed test method.
It’s ridiculous, but node.js 4.x performs first 10-20 iterations very fast and all subsequent ones very slowly (~10 times slower). And it’s not caused by GC (easily seen in node –prof)!
It seems it’s caused by some deopt issue: run `RUN_TIME=60 node –trace-opt-verbose primes.js` and you’ll see V8 is recompiling get_primes7() many times, then every time “evicting entry from optimized code map” for some reason (is it a V8 bug?) and after several attempts it gives up optimization and then we see unoptimised code running. Which is, by no surprise, ~10-20 times slower…
If I run nodejs with –max_opt_count=100000 then it succeeds to calculate primes ~440 times per 60 seconds. Default setting is 10 and gives only ~60 times per 60 seconds. This is the same with nodejs 4.6 and 6.0.
A much older version (nodejs 0.10) does not have this bug, but makes primes only ~230 times per 60 seconds.
Unoptimized C++ (g++ 6.1.1) runs it ~230 times per 60 seconds on the same machine. Optimised C++ gives ~630 iterations. Java 8 (without boxing/unboxing) makes it ~650 times ๐ even slightly better than C++…
October 17, 2016 at 6:29 pm
I discovered what causes this problem. Deopt occurs when you push() a lot of times to `var res = [2];`. It seems V8 tries to optimise this array as a tuple (i.e. a short and typed array) which causes guard check violation and deopt when it happens to turn into a long “vector”. So, if you change the initial line to `var res = []; res.push(2);` the bug goes away and V8 runs fast.
October 19, 2016 at 10:44 am
Well, I’m glad that it wasn’t “the impact of my changed test method”. ๐ But a bug in NodeJS. I’ve updated the source code as you suggested, redid the tests + updated the page here, and now NodeJS is back on top. Cheers.
October 16, 2016 at 9:15 pm
Actually in my tests[1] Rust comes out on top, and is actually 10% faster than C++ compiled by g++ 4.8.5. However I see you are using much newer gcc.
[1]: https://blog.ndenev.com/2016/10/15/rust_benchmark_win/index.html
October 17, 2016 at 6:20 pm
+. I get the same. Rust is faster then g++ 6.1.1 on both i386 and amd64. LLVM gets close to it on i386 but not on amd64 (however it varies by version).
October 17, 2016 at 7:13 pm
1) The same version of Rust (1.12.0) is faster than the same version of g++ (6.1.1) which I used for the tests? Can you list the exact versions here.
2) How much faster? Can you post the raw numbers here?
P.S. To be honest, the 3% difference in my results aren’t a significant number, so we can easily call both languages equally fast on my platform.
October 17, 2016 at 10:39 pm
In my tests on a CentOS machine (as I posted already) Rust is consistently faster by 10%, which is quite significant.
I’ve just ran another test on my ancient (core 2 duo) OS X laptop and Rust is 3% faster here: http://pastebin.com/2VERr87a
And, while it is small 3% is not insignificant difference.
Note that this is compared to C++ compiled by Apple LLVM version 8.0.0.
November 1, 2016 at 1:54 am
UUuuuummmmmmm….. where’s C#?
November 1, 2016 at 6:06 am
Hi. I’m doing the benchmarks on Linux, and until now I thought that .NET Core is not yet production ready. What a pleasant surprise to see that Microsoft released it already, and they also claim that “.NET is 8x faster than Node.js and 3x faster than Go” ๐
I’ll try to code a C# version but can’t promise when I can do it. If you contribute a C# version, that would be great.
November 4, 2016 at 12:27 am
I’ve added a C# .NET Core implementation which I tested on Linux.
November 4, 2016 at 10:20 am
Yesterday I was so eager to test C# .NET Core that I did the benchmarks on my laptop while running on battery. Today I performed them as usual – my laptop was running plugged in the power socket and at full CPU speed. This made a huge difference. I’ve updated the results table.
November 7, 2016 at 5:18 am
Have you guys noticed that the number of context switches for Go is too high comparing to JS, PHP, Python? Any ideas why is that?
November 7, 2016 at 1:38 pm
I’ve noticed it too but don’t know the exact reason. All tests run at similar conditions so it shouldn’t be an error in the benchmarks.
Go and Java 8 are multi-threaded and this seems to be causing a lot of context switching (between the threads?):
– Go – multi-threaded (real_TIME:90.08sec user_CPU:141.80sec)
– Java 8 – multi-threaded (real_TIME:90.60sec user_CPU:125.76sec)
Java 8 running the “non-std lib” implementation also have a lot of context switching even though it seems that not much CPU time has been used by the non-main GC thread. Running this implementation doesn’t seem to use the other non-main thread(s) a lot but still context switches are huge numbers (real_TIME:90.13sec user_CPU:89.97sec). Still we are sure that Java 8 is multi-threaded — probably the GC is done in another thread and because the “non-std lib” implementation is very efficient, the GC thread gets switched to, notices that there isn’t anything to do, and gives control back to the main thread.
C# .NET Core seems similar to Java 8 running “non-std lib”. I guess that C# also have a separate GC thread which in this case hasn’t got a lot of real work to do.
December 14, 2016 at 7:24 pm
Dear Ivan,
I appreciate the information you published about performance languages..
I have a big question about Java.. what do you mean for “non-std lib”??
Can you explain it please?
December 14, 2016 at 9:53 pm
“non-std lib” means a “non-standard library”. The benchmarks use a custom Vector class re-implementation specifically for Integer, in order to avoid the unneeded boxing/unboxing. It works the same way but is not part of the standard Java libraries. The text “non-std lib” is a URL link — just click on it, and you’ll be directed to the comment’s discussion about it.
January 7, 2017 at 12:40 am
I’ve modified sources a bit: removed timers, and used Igor Pavlovs’ timer64 to benchmark.
So here are my results, see with monospaced font:
Windows 10 x64, i7-3632QM;
March 28, 2017 at 9:44 am
would be nice to include perl with PDL library .
March 30, 2017 at 12:34 pm
If you provide source code using PDL, I’ll run a benchmark.
March 30, 2017 at 5:01 pm
Can I contribute a new benchmark? With Nim language
March 30, 2017 at 5:13 pm
Can you please add a Nim benchmark?
But please, compile using latest compiler!
https://pastebin.com/p4BVnz9t
Compile command – “nim c -d:release -o:primes.nim.out primes.nim”
March 30, 2017 at 5:20 pm
All imports are from stdlib
March 31, 2017 at 3:26 am
Hi, to be honest, I’m a bit overwhelmed right now, and can’t do this myself.
I’d recommend that for a start you do the benchmark with Nim on your computer, and paste the results here. Also, do the benchmark with some of the other languages, so that we can compare the relative speed.
March 31, 2017 at 5:04 pm
I ran all benchmarks on Intel i5 4460 CPU, using Manjaro 17.0 (based on Arch Linux)
I did only one run, but it shows some rough data:
C++ (g++ 6.3.1 with -O2) got 882 lines.
C++ (g++ 6.3.1 with -O3) got 931 lines.
But Nim can be compiled to C, C++, ObjC, or JS.
So Nim with C – nim c -d:release primes.nim – 629 lines (it used gcc 6.3.1 for compiling C)
Nim with C++ – nim cpp -d:release primes.nim – the same like Nim with C
RustC 1.15.0 (rustc -C opt-level=3) got 1065 lines
But honestly, I literally copy+pasted Python source, and made Nim version out of Python version, and I’m new to Nim, so maybe there’s better optimizations possible ๐
Maybe I’ll find out if my Nim code is wrong in some way (maybe I’ve done something wrong)
March 31, 2017 at 5:08 pm
Oh wait, I can do some small optimizations
March 31, 2017 at 5:32 pm
This is better version:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
primes.nim
hosted with ❤ by GitHub
March 31, 2017 at 5:35 pm
So this version is final: https://gist.github.com/def-/fd6f528e51683f7b1baa60518b426d74
It shows roughly the same speed as C++ does!
April 4, 2017 at 10:51 pm
I’m glad to see that the Nim implementation is valid, and that Nim rocks! ๐
April 20, 2017 at 12:07 pm
I was able to improve JS implementation by about 40%. I used the Java trick and replaced untyped array with a typed array.
function get_primes7(n) {
if (n < 2) { return []; }
if (n == 2) { return [2]; }
var s = new Uint32Array(Math.ceil((n + 1) / 2 – 3));
for (var i = 3, j = 0; i < n + 1; i += 2) {
s[j++] = i;
}
var mroot = Math.floor(Math.sqrt(n));
var half = s.length;
var i = 0;
var m = 3;
while (m <= mroot) {
if (s[i]) {
var j = Math.floor((m*m-3)/2); // int div
s[j] = 0;
while (j < half) {
s[j] = 0;
j += m;
}
}
i = i + 1;
m = 2*i + 3;
}
var count = 0;
for (var x = 0; x < half; x++) {
if(s[x] !== 0)
count++;
}
var res = new Uint32Array(count + 1);
res[0] = 2;
for (var x = 0, j = 0; x < half; x++) {
if (s[x] !== 0) {
res[++j] = s[x];
}
}
return res;
}
var startTime = Date.now();
var periodTime = 10 * 1000
while ((Date.now() – startTime) < periodTime) {
var res = get_primes7(10000000);
console.log("Found " + res.length + " prime numbers.");
}
Surprising thing is that if we replace vars with let or const, performance considerable degrades.
May 6, 2017 at 7:00 pm
chart of this test, โtotalโ values

April 7, 2021 at 1:17 pm
A bit late to the party, but came across this as I am benchmarking an M1 with various things. FYI, pure perl is not supposed to be fast at math, but you are actually handicapping it further by using a while loop. Perl supports simply C-style for loops which are faster, you are only meant to use while, do etc if it makes sense for your program, but here it does not.
You replace the 6 lines in that if with just:
for (my $j = int(($m*$m – 3) / 2); $j < $half; $j += $m) {
$s[$j] = 0;
}
Much simpler to read and the program becomes 10.3% faster on my Mac. There are many "clever" ways to improve the algorithm, but you can do the same with most languages, so this was just for something that was bad practice.
May 18, 2021 at 9:30 pm
Hi, thank for the hint. I can confirm that “for” instead of “while” improves speed by 6% on my laptop.
Here is more info about the results: https://github.com/famzah/langs-performance/commit/ee4d4440eda8e8b7e9b772334ee39b0a8056521d
It’s worth mentioning that now we have one less “$s[$j] = 0“ for the initial value of “$j“. I don’t know if this contributes significantly for the improvement of 6%.
August 5, 2021 at 9:27 pm
I conducted a similar benchmark (“Performance Comparison C vs. Java vs. Javascript vs. LuaJIT vs. PyPy vs. PHP vs. Python vs. Perl”) with quite similar findings:
https://eklausmeier.goip.de/blog/2021/07-13-performance-comparison-c-vs-java-vs-javascript-vs-luajit-vs-pypy-vs-php-vs-python-vs-perl/
Interestingly, the results depend on the underlying CPU architecture.