C++ vs. Python vs. Perl vs. PHP performance benchmark

Update: There is a part #2 of the benchmark results. This page has been translated into Spanish by Maria Ramos.


This all began as a colleague of mine stated that Python was so damn slow for maths. Which really astonished me and made me check it out, as my father told me once that he was very satisfied with Python, as it was very maths oriented.

The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, dynamic arrays with numbers, basic math operations.

Out of curiosity, Python was also benchmarked with and without the Psyco Python extension (now obsoleted by PyPy), which people say could greatly speed up the execution of any Python code without any modifications.

Here are the benchmark results:

Language CPU time Slower than Language
version
Source
code
User System Total C++ previous
C++ (optimized with -O2) 1,520 0,188 1,708 - - g++ 4.5.2 link
Java (non-std lib) 2,446 0,150 2,596 52% 52% 1.6.0_26 link
C++ (not optimized) 3,208 0,184 3,392 99% 31% g++ 4.5.2 link
Javascript (SpiderMonkey) see comment (SpiderMonkey seems as fast as C++ on Windows)
Javascript (nodejs) 4,068 0,544 4,612 170% 36% 0.8.8 link
Java 8,521 0,192 8,713 410% 150% 1.6.0_26 link
Python + Psyco 13,305 0,152 13,457 688% 54% 2.6.6 link
Ruby see comment (Ruby seems 35% faster than standard Python)
Python 27,886 0,168 28,054 1543% 108% 2.7.1 link
Perl 41,671 0,100 41,771 2346% 49% 5.10.1 link
PHP 5.4 roga’s blog results (PHP 5.4 seems 33% faster than PHP 5.3)
PHP 5.3 94,622 0,364 94,986 5461% 127% 5.3.5 link

The clear winner among the script languages is… Python. :)

NodeJS JavaScript is pretty fast too, but internally it works more like a compiled language. See the comments below.

Please read the discussion about Java which I had with Isaac Gouy. He accused me that I am not comparing what I say am comparing. And also that I did not want to show how slow and how fast the Java example program can be. You deserve the whole story, so please read it if you are interested in Java.

Both PHP and Python are taking advantage of their built-in range() function, because they have one. This speeds up PHP by 5%, and Python by 20%.

The times include the interpretation/parsing phase for each language, but it’s so small that its significance is negligible. The math function is called 10 times, in order to have more reliable results. All scripts are using the very same algorithm to calculate the prime numbers in a given range. The correctness of the implementation is not so important, as we just want to check how fast the languages perform. The original Python algorithm was taken from http://www.daniweb.com/code/snippet216871.html.

The tests were run on an Ubuntu Linux machine.

You can download the source codes, an Excel results sheet, and the benchmark batch script at:
http://www.famzah.net/download/langs-performance/


Update (Jul/24/2010): Added the C++ optimized values.
Update (Aug/02/2010): Added a link to the benchmarks, part #2.
Update (Mar/31/2011): Using range() in PHP improves performance with 5%.
Update (Jan/14/2012): Re-organized the results summary table and the page. Added Java.
Update (Apr/02/2012): Added a link to PHP 5.4 vs. PHP 5.3 benchmarks.
Update (May/29/2012): Added the results for Java using a non-standard library.
Update (Jun/25/2012): Made the discussion about Java public, as well as added a note that range() is used for PHP and Python.
Update (Aug/31/2012): Updated benchmarks for the latest node.js.
Update (Oct/24/2012): Added the results for SpiderMonkey JavaScript.
Update (Jan/11/2013): Added the results for Ruby vs. Python and Nodejs.

About these ads

55 thoughts on “C++ vs. Python vs. Perl vs. PHP performance benchmark

  1. Nice little benchmark. I ran these tests on my iMac with Core i7 2.8 GHz:

    Language CPU time
    User System
    C++ 0.979 0.082
    Python 25.934 0.191
    Perl 32.553 0.108

    While the choice between Perl or Python really doesn’t make any significant difference, to C++ from both, there is a massive gap.

    More interestingly, I create a Javascript version of the algorithm and let it run on nodejs:

    Language CPU time
    User System
    Javascript (nodejs) 2.812 0.599

    Which is roughly 9 times (900%) faster than Perl or PHP, and only 3 times (300%) slower than C++. Here is the Javascript version, click “show source” below to see the full source:

    function get_primes7(n) {
    	if (n < 2) { return []; }
    	if (n == 2) { return [2]; }
    
    	var s = [];
    	for (var i = 3; i < n + 1; i += 2) {
    		s.push(i);
    	}
    
    	var mroot = Math.floor(Math.sqrt(n));
    	var half = s.length;
    	var i = 0;
    	var m = 3;
    
    	while (m <= mroot) {
    		if (s[i]) {
    			var j = Math.floor((m*m-3)/2);   // int div
    			s[j] = 0;
    			while (j < half) {
    				s[j] = 0;
    				j += m;
    			}
    		}
    		i = i + 1;
    		m = 2*i + 3;
    	}
    
    	var res = [2];
    	for (var x = 0; x < s.length; x++) {
    		if (s[x]) {
    			res.push(s[x]);
    		}
    	}
    	return res;
    }
    
    for (var i = 0; i < 10; i++) {
    	var res = get_primes7(10000000);
    	console.log("Found " + res.length + " prime numbers.");
    }
    

    I will keep a close eye on that Javascript environment ;)

    • This was very insightful, thanks. You actually managed to astonish me – when I tried the nodejs Javascript on my computer, the same used for the above tests, it ran *faster* than the C++ version. Then I recalled that I could compile C++ with optimizations, and now things seem logical again. Though, nodejs seems to be very very fast, at least in this benchmark test.

      Here are my nodejs results (using nodejs v0.1.101) which can be compared directly with the original results in my post:

      Language CPU time Slower than
      User System Total C++ previous
      C++ (optimized with -O2) 2.456 0.400 2.856 - -
      Javascript (nodejs) 3.888 0.504 4.409 54% 54%
      C++ (not optimized) 4.352 0.404 4.756 67% 8%

      Though I’m not quite sure if nodejs doesn’t actually do the same as C++ – compile the program, and then execute it. In contrast, the script languages don’t compile their code – they parse it to bytecode and interpret it on-the-fly, which is, of course, slower than a real one-time compilation.

      P.S. I edited your comment, in an effort to make the layout a bit more readable. I hope you don’t mind…

      • Nodejs is v8 which tents to generate a lot of optimised machine code.
        ( see http://www.youtube.com/watch?v=hWhMKalEicY )

        So (hopefully) the actual implementation is more like a JIT compiled machine code version of the program then interpreted bytecode. (Which is good performance wise)

        Javascript development is REALLY REALLY interesting to follow. And most of the idea’s going into Javascript implementations like v8 but also the new Apple and Mozilla implementations in the war on having the fastest Javascript engine are very applicable to other (scripting) languages as well.

        Javascript has the advantage of being the browser language which is in the middle of the browser wars and therefore has gotten many many resources and money and engineers thrown onto itself in the effort to make it faster. (It’s just too bad that the all implement old javascript versions and not use new versions which are more ‘Pythonic’)

    • This PHP version is a bit more optimized on code. There are some inefficiencies in the code above. The following is about 20% faster on my system.

      function get_primes7($n) {
          if ($n &lt; 2) return array();
          if ($n == 2) return array(2);
          $s = range(3, $n, 2);
      
          $mroot = sqrt($n);
          $half = count($s);
          for ($i=0, $m=3; $m &lt;= $mroot; $m = 2*++$i + 3) {
              if ($s[$i]) {
                  for ($j = ($m*$m - 3) &gt;&gt; 1; $j &lt; $half; $j += $m) $s[$j] = 0;
              }
          }
          return array(2) + array_filter($s);
      }
      
      • Hm, very strange. The optimized code doesn’t run faster on my system… There are some very good ideas though:

        * range() — faster with 5%
        * “$number / 2″ vs. “$number >> 1″ — maybe slightly faster
        * while() substituted with for() — no change
        * “$i = $i + 1″ vs. “++$i” — no change
        * array_filter() — slower (also returns 1 less prime numbers)

        With your help, my tests show 5% faster PHP execution now. Thanks.

  2. Some other comparisons with cython :)

    On my machine the python script takes 28.7s normally and 13.9s with psyco.

    Compiling the unmodified code with cython shaves almost 8s of running in 20.9s.
    When declaring c types for i,m and j it goes down to 18.3s.

  3. Pingback: C++ vs. Python vs. Perl vs. PHP performance benchmark (part #2) « /contrib/famzah

  4. If you are comparing Perl performance for mathematical functions you should really be using the Perl Data Language additions to Perl.

    PDL exists for this very purpose and hugely outperforms basic Perl.

    If you compared PDL I think you would see a very different story.

    • As far as I know and according to the Wikipedia article about PHP accelerators, they help only “to avoid the overhead of parsing and compiling source code on each request”. Since I’m doing only one request (script execution) but it does a lot of work inside it, without exiting the PHP script, the PHP interpreter spends most of the time working on the code algorithm, which is pretty short — only a couple of lines.

      Just in case, I did test the PHP benchmark with and without APC enabled, and it made no difference on my computer.

      Code optimizers are a different story. In theory they should make the PHP code run faster. This is what Zend claim for their well-known PHP Optimizer which for PHP 5.3 is now called Zend Guard:

      Zend Optimizer is a free application that runs the files encoded using Zend Guard and enhances the overall performance of your PHP applications.

      My tests however shown no difference in performance. Other tests confirm my results (PHP compiler performance; Impact of Zend Optimizer on PHP Performance).

      I’m not really sure though if Zend Guard does the same thing as Zend Optimizer did for PHP 5.2 and older. In PHP 5.2 we had an INI setting for optimization level, for example. With Zend Guard there are almost no settings and it could be that Zend Guard is only to decode Zend encoded PHP scripts. The Zend Server suite has a Zend Optimizer+ component which may be what Zend Optimizer for PHP 5.2 did, but Zend Optimizer+ is not shipped as a separate .so library, so I’m not going to try it.

      • You’re welcome. :)

        Yes, we can say that but for this particular test case. This simple benchmark is pretty limited — review the second, bold paragraph in the beginning of the article. You need to test further for your particular setup (program algorithms, usage pattern, hardware and OS, etc).

      • In the java version there is a lot of boxing/unboxing. For the second ArrayList why not use an int[]. This should boost Java by a great factor.

  5. Pingback: roga's blog » PHP 5.4 效能有長足的進步

  6. class PrimeNumbersGenerator {
    
        private IntVector res;
    
        public PrimeNumbersGenerator(int max) {
            res = new IntVector(max);
        }
    
        public IntVector get_primes7(int n) {
            res.clear();
            if (n &lt; 2) {
                return res;
            }
            if (n == 2) {
                res.add(2);
                return res;
            }
            int[] s = new int[(n - 2) / 2];
            for (int i = 3, off = 0; i &lt;= n; i += 2) {
                s[off++] = i;
            }
            int mroot = (int) Math.sqrt(n);
            int half = s.length;
            int i = 0;
            int m = 3;
            while (m &lt;= mroot) {
                if (s[i] != 0) {
                    int j = (int) ((m * m - 3) / 2);
                    s[j] = 0;
                    while (j &lt; half) {
                        s[j] = 0;
                        j += m;
                    }
                }
                i = i + 1;
                m = 2 * i + 3;
            }
            res.add(2);
            for (int it = 0; it &lt; s.length; ++it) {
                if (s[(it)] != 0) {
                    res.add(s[it]);
                }
            }
            return res;
        }
    }
    
    class IntVector {
    
        int[] data;
        int off = 0;
    
        public IntVector(int max) {
            data = new int[max];
        }
    
        public void add(int x) {
            data[off++] = x;
        }
    
        public void clear() {
            off = 0;
        }
    
        public int size() {
            return off;
        }
    
        public void set(int i, int x) {
            data[i] = x;
        }
    
        public int get(int x) {
            return data[x];
        }
    }
    
    class PrimeNumbersBenchmarkApp {
    
        public static void main(String[] args) {
            PrimeNumbersGenerator gen = new PrimeNumbersGenerator(10000000);
            for (int i = 1; i &lt;= 10; ++i) {
                System.out.format(&quot;Found %d prime numbers.\n&quot;, gen.get_primes7(10000000).size());
            }
        }
    }
    

    This is a very much faster Java version… (1.5 seconds on my machine! )

    • Great job! I’ve slightly modified the source code, in order for it to be formatted like the others, so that I can compare it line by line. I’ve saved it online:

      http://www.famzah.net/download/langs-performance/primes-not-same-algo-but-5x-faster.txt

      While your implementation is 5x faster, I cannot accept it for the following reasons:
      * Allocates more memory (a static array of size “n”); which is a great tradeoff in our case but is a different algorithm.
      * Uses background knowledge about the algorithm (size of “int[] s” is pre-defined) and thus the algorithm is not the same as by the other languages.
      * Does not use the same language features as the other implementation (static vs. variable length array lists).

      • Here is an “allowed” implementation:
        It doesn’t access the array directly and it uses variable length lists!

        import java.util.*;
        
        class PrimeNumbersGenerator {
        
            private PrimeNumbersGenerator() {
            }
        
            public static IntList getPrimes7(int n) {
                IntList res = new IntList();
                IntList s = new IntList();
        
                if (n &lt; 2) {
                    return res;
                }
                if (n == 2) {
                    res.add(2);
                    return res;
                }
        
                for (int i = 3; i &lt;= n; i += 2) {
                    s.add(i);
                }
        
                int mroot = (int) Math.sqrt(n);
                int half = s.size();
                int i = 0;
                int m = 3;
                while (m &lt;= mroot) {
                    if (s.get(i) != 0) {
                        int j = ((m * m - 3) / 2);
                        s.set(j, 0);
                        while (j &lt; half) {
                            s.set(j, 0);
                            j += m;
                        }
                    }
                    i += 1;
                    m += 2;
                }
        
                res.add(2);
                res.addAllNotNull(s);
                return res;
            }
        }
        
        /*
         * Variable length int ArrayList.
         * 
         * (Like the Java java.util.ArrayList.
         * Just without the slow java.lang.Integer unboxing/boxing)
         */
        class IntList {
        
            private static final int DEFAULT_CAPACITY = 1000;
            public int[] data;
            private int off = 0;
        
            public IntList() {
                data = new int[DEFAULT_CAPACITY];
            }
        
            public void add(int x) {
                // if there is not enough room to store the value a new array is created.
                // (like in java.lang.ArrayList)
                if (off &gt;= data.length) {
                    data = Arrays.copyOf(data, data.length * 5);
                }
                data[off++] = x;
            }
        
            public void clear() {
                off = 0;
            }
        
            public int size() {
                return off;
            }
        
            public void set(int i, int x) {
                data[i] = x;
            }
        
            public void addAllNotNull(IntList b) {
                int[] items = b.data;
                int len = items.length;
                for (int i = 0; i &lt; len; i++) {
                    int val = items[i];
                    if (val != 0) {
                        add(val);
                    }
                }
            }
        
            public int get(int x) {
                return data[x];
            }
        }
        
        class PrimeNumbersBenchmarkApp {
        
            public static void main(String[] args) {
                for (int i = 1; i &lt;= 10; ++i) {
                    System.out.println(&quot;Found &quot; + PrimeNumbersGenerator.getPrimes7(10000000).size() + &quot; prime numbers!&quot;);
                }
            }
        }
        
      • Hey, I’m very happy that you’re so enthusiastic to re-work the code — it is now the same as the original benchmark. An interesting observation is that with “DEFAULT_CAPACITY = 100″, the program runs 20% faster.

        Sorry that I’m so obsessed with details but I have the following comments and questions about your custom library “IntList”:

        The program gets extremely slowly if “DEFAULT_CAPACITY = 10000″ (or any other number below say 500000) and in IntList.add() we use the following:

        data = Arrays.copyOf(data, data.length + DEFAULT_CAPACITY); // original was "data.length * 5"
        

        * What’s the magic behind the decision to multiply by 5? This may use a lot of memory — if I had 10 million elements and wanted to add just one last, then another 40 million elements would be allocated from the memory pool, for the addition of just one element. Those 40 million elements, each being a 32-bit Java “int”, would cost 120 MBytes of RAM. You can do the math if we had 100 million elements and wanted to add just one more.
        * Does java.lang.ArrayList uses the same exponential memory allocation policy?
        * In a nutshell: This does not scale well, correct?

  7. 1. If you don’t multiply, the array has to be reallocated very often. This takes (very) much time..
    ( to reach 10,000,000 elements and a DEFAULT_CAPACITY = 10,000 you would need 1000 reallocations! But with multiply by a factor you would need just log2(10.000.000) reallocations. )

    2 The java.lang.ArrayList does the same. (the factor is 2, not 5)

    3. The expotential solution is much faster and the overhead isn’t a very big problem.
    ( especially if you use the factor 2)

    • Cool. I’ve changed the factor to 2, in order to be the same as the default Java behavior. This didn’t change the execution speed in our case. I also modified the syntax a bit, in order for it to look the same as the other implementations, which also didn’t affect execution speed.
      Thanks for your thorough explanations and excellent work. I’ve added the benchmark results of your implementation in the comparison table.

  8. Hi, I made a port of the bench in AS3, please try it out on your test system.

    package 
    {
    	import flash.display.Sprite;
    	import flash.events.Event;
    	import flash.text.TextField;
    	import flash.utils.getTimer;
    	
    	/**
    	 * ...
    	 * @author Raphaël Benzazon
    	 */
    	public class Main extends Sprite 
    	{
    		
    		public function Main():void 
    		{
    			if (stage) init();
    			else addEventListener(Event.ADDED_TO_STAGE, init);
    		}
    		
    		private function init(e:Event = null):void 
    		{
    			removeEventListener(Event.ADDED_TO_STAGE, init);
    			// entry point
    			var timeBefore:int = getTimer();
    			for (var i:int = 0; i < 10; i++) {
    				var res:Vector = get_primes7(10000000);				
    			}			
    			
    			trace("Found " + res.length + " prime numbers.");
    				
    			var timeAfter:int = getTimer()-timeBefore;
    			var text:TextField = new TextField();
    			addChild(text);
    			text.text = String(timeAfter);
    		}
    		private function get_primes7(n:int):Vector {
    			var res:Vector;
    			var lnt:int;
    			if (n < 2) { return new Vector(); }
    			if (n == 2) {
    				res = new Vector();
    				res.push(2)
    				return res;
    			}
    			var j:int;
    			var s:Vector = new Vector();
    			lnt = n + 1;
    			for (var i:int = 3; i < lnt; i += 2) {
    				s.push(i);
    			}
    
    			var mroot:int = int(Math.sqrt(n));
    			var half:int = s.length;
    			i = 0;
    			var m:int = 3;
    
    			while (m <= mroot) {
    				if (s[i]) {
    					j = (m*m-3)/2;   // int div
    					s[j] = 0;
    					while (j < half) {
    						s[j] = 0;
    						j += m;
    					}
    				}
    				i++;
    				m = 2*i + 3;
    			}
    
    			res = new Vector();
    			res.push(2);
    			lnt = s.length;
    			for (var x:int = 0; x < lnt; x++) {
    				if (s[x]) {
    					res.push(s[x]);
    				}
    			}
    			return res;
    		}
    
    
    		
    	}
    	
    }
    
    • Hi, as far as I could understand, there is no AS3 compiler in Ubuntu packages and getting one isn’t exactly trivial. I’ll appreciate it if you run a few of the benchmarks on your system (for example the C++ ones), then your AS3 ones and share the results here.

  9. Here is a version of the JavaScript benchmark, but on SpiderMonkey (the Firefox javascript engine), executed using the spidermonkey ‘js’ command line. I included comparisons against mingw C++ and win64 Python2.7 for context.

    ——————–

    akesterson@andrew-laptop ~/source
    $ bash bench.sh
    == C++ (optimized with -O2) ==

    real 0m2.141s
    user 0m0.015s
    sys 0m0.015s

    real 0m2.163s
    user 0m0.047s
    sys 0m0.000s

    g++.exe (GCC) 4.5.2

    == C++ (not optimized) ==

    real 0m5.032s
    user 0m0.000s
    sys 0m0.000s

    real 0m5.390s
    user 0m0.031s
    sys 0m0.000s

    g++.exe (GCC) 4.5.2

    == Python 2.7 ==

    real 0m41.942s
    user 0m0.031s
    sys 0m0.000s

    real 0m42.127s
    user 0m0.015s
    sys 0m0.031s

    Python 2.7.2

    == JavaScript (SpiderMonkey) ==

    real 0m5.022s
    user 0m0.015s
    sys 0m0.031s

    real 0m4.759s
    user 0m0.000s
    sys 0m0.015s

    JavaScript-C 1.8.5+ 2011-04-16

    —————

    While this initially suggests that SpiderMonkey is only ~20% slower than V8/Node.js, this can be misleading. My C++ benchmarks are ~30% slower than yours with the same code and compiler (g++ 4.5.2), so it is probably safe to say that the performance of SpiderMonkey is similarly out of context. Adjusting the SpiderMonkey performance by -30%, we get an (average) real runtime of 6.357s, or ~40% slower than Node.js running under V8.

    … But still faster than the other “made for serious work” scripting languages, by an astonishingly large margin.

    • Before approving this comment, I wrote to “akesterson” per-email with some questions:

      (begin my questions)

      Thanks for the comment and benchmark results. Something’s strange about the numbers though — they show almost nothing for “user” and “sys” times… I also see “gcc.exe” — did you run the tests on Windows?

      Not knowing the real “user” and “sys” values leaves a small backdoor for the JS compiler to use multi-threading which is “unfair competition” here since all tests are strictly single-threaded. Furthermore, you should also provide the source code.

      (end my questions)

      Here is his very thorough and satisfying reply:

      (begin “akesterson” reply)

      Thanks for getting back to me!

      Yes, I performed the tests on windows using mingw, as pointed out in the first paragraph of the comment (sorry if that wasn’t clear, I really didn’t belabor the point at all).

      The user times on windows are confusing, the “real” times in mingw are the only ones that can really be trusted, or which should be considered.

      As far as multithreading, SpiderMonkey is absolutely single threaded (see http://blog.mozilla.org/luke/2012/01/24/jsruntime-is-now-officially-single-threaded/ for a brief report of same; I was using the stock js.exe command line, which does not support the W3 web workers stuff).

      Regarding the source, I did not change the source at all, except to change the javascript console.log to a print statement. here is the diff:

      akesterson@andrew-laptop ~/source/unorganized
      $ wget -O primes-js.txt.orig http://www.famzah.net/download/langs-performance/primes-js.txt
      --2012-10-22 00:19:12--  http://www.famzah.net/download/langs-performance/primes-js.txt
      Resolving www.famzah.net (www.famzah.net)... 64.14.78.45
      Connecting to www.famzah.net (www.famzah.net)|64.14.78.45|:80... connected.
      HTTP request sent, awaiting response... d200 OK
      Length: 656 [text/plain]
      Saving to: `primes-js.txt.orig'
      
           0K                                                       100%  476K=0.001s
      
      2012-10-22 00:19:13 (476 KB/s) - `primes-js.txt.orig' saved [656/656]
      
      akesterson@andrew-laptop ~/source/unorganized
      $ diff primes-js.txt.orig primes-js.txt
      39c39
      &lt;       console.log("Found " + res.length + " prime numbers.");
      ---
      &gt;       print("Found " + res.length + " prime numbers."); 
      

      I hope these explanations are sufficient to clear up any confusion regarding the sanity of my test results. Thanks again.

      (end “akesterson” reply)

  10. Here code If you need for Ruby, what the time benchmark between nodejs and ruby?

    def get_primes7(n)
    	if n &lt; 2
    		return []
    	end
    	
    	if n == 2
    		return [2]
    	end
    	
    	#s = Range.new(3, n, 2)???
    	s = []
    	i = 3
    	while i &lt; n + 1
    		s.push(i)
    		i += 2
    	end
    	
    	mroot = Math.sqrt(n)
    
    	half = s.length
    	i = 0
    	m = 3
    	while m &lt;= mroot
    		if s[i] != 0
    			j = (m * m - 3) / 2
    			s[j] = 0
    			while j &lt; half
    				s[j] = 0
    				j += m
    			end
    		end
    		i = i + 1
    		m = 2 * i + 3
    	end
    	
    	res = [2]
    	for v in s
    		if v != 0
    			res.push(v);
    		end
    	end
    #	x = 0
    #	while x &lt; s.length
    #		if s[x] != 0
    #			res.push(s[x])
    #		end
    #		x = x + 1
    #	end
    	return res
    end
    
    
    res = [];
    for i in 1..10
    	res = get_primes7 10000000
    	puts &quot;Found #{res.length} prime numbers.\n&quot;
    end
    
    • Hi,

      Here are the benchmark on my system:
      * nodejs (0.6.12): 5,724 s
      * Ruby (1.9.3): 23,085 s
      * Python (2.7.3): 31,294 s

      If we linearly adjust the above times with a coefficient of “0,88” so that they match the already done benchmarks in the blog post, we get the following numbers:
      * nodejs (0.6.12): 5,037 s
      * Ruby (1.9.3): 20,315 s
      * Python (2.7.3): 27,539 s

      This means that standard Python is almost 35% slower than Ruby. Interesting. :)

  11. Benchmarks are all well and fine but as time has shown such matters are very dependent on exactly what type of application is being developed.

    Getting VERY dirty would mean, guess what, use assembler. A dedicated webserver with your application coded native in assembler. It will make C++ look like its standing still, but is it practical? Is it practical on a Linux or Windows box? No. And Yes. How bi-polar of me. It DEPENDS on what one need accomplish. If I care try distributing hundreds of thousands of incoming connections on a singular connection server and send them off where they need go, yes, assembler is going to perform far better than say a C++ solution.

    Same is true of Web Applications. C++ / Java are clear winners in performance. But are they or which is more practical? A native high traffic say auction venue C++ or Java are up to the task. eBay uses C++ and of course, oodles of connection servers, database servers etc. Dedicated machines as does Amazon.

    However, eBay’s or Amazons help or forums, or even static forms of pages need not get all performance freaky.

    PHP and Apache Web Server are coded in C++. That in itself should (but doesnt) quell the PHP’rs who go, “PHP is the way to go for just about everything”. Its not.

    Python, Ruby and other more vertical languages also have a their areas where they do quite well. Point being, planning ahead of time and understanding WHICH is best for a suited application is important. Very important.

    In enterprise levels, C++, Java and Microsofts C# and subsequent oodles of CONSISTENCY supported applications are the clear choices.

    C++ tends power the biggest complex, high traffic and gobs of DB query based sites on the net. Its the worst to maintain but yields best results. Java is used widely in enterprise applications, LOTS (I mean lots) of Points Of Sale systems accessible via terminals, online etc. In your AT&T store, Verizon, Walmarts etc. All Java. Regional inventory push/pull systems for the likes of Walmart, Target on and on, all Java. Java is fast and yes, tends be more maintainable than C++.

    Microsoft as noted, gobs of applications supported consistently. .NET gives a enterprise the ability of desktop, client/server, use of ANY language in support of .NET (VB, C#, many others, even Cobol by other ISV’s) not to mention things such as Office (Word, Excel,etc), Sharepoint, on and on and on.

    The handoff in all of this is development time, deployment, scalability, performance etc. We all know this as developers.

    But what we as developers often DONT take into account is the future of HARDWARE.

    This is actually where Microsoft is smart. Code generated, inefficient as the bloated beast .NET, Windows, SQL Server etc. are… Microsoft is banking on hardware coming to them. That is to say, Hardware performance will give them “the edge” in time. I believe they are right assuming the global economy doesnt collapse.

    Eventually SSD’s will rid the world of mechanical drives and this will be an enormous leap across the board as we end up with enormous SSD drives with enormous data and addressing busses.

    Enterprise level applications (on web/off web / client/server) will no longer need concern themselves with efficiency. We see this on Desktop PC usage and have for years. Linux is far more efficient than Windows or Windows Server, but, people in their homes choose Windows and pay for it .vs. free Linux. People pay for Mac’s w/ OS/X .vs. Linux.

    Many development companies have tried to support Linux in commercial applications for home and business and sales just dont make it. As they say in a community, people who use free things tend to expect often free things. Dunno.

    Eventually everything will end up in “the cloud” happily piped in to our home entertainment centers which are TV, Stereo, Computer, Game Machine etc etc all in one. People will pay, be happy (perhaps controlled) and then some.

    But the software technology that sits behind that revolution when it happens? More than likely you will see Microsofts name on it.

    There comes a time in other words where efficiency and all the compu-jibe becomes immaterial. This is the case with home computing, the Windows or Mac PC’s. If I were to take count of all the PC’s in say southern California and what OS they run the numbers would outpace every home linux installation and mac install in the world, perhaps more than both combined. Sad perhaps, but surely true.

    In business and webs this is not the case. Not yet. But as the Net connectivity, hardware technologies continue to evolve naturally the same thing will occur. Businesses will go, “We want what works and gives us the most options to grow and is the LEAST complex for US to use daily”. That will be IMHO Microsoft who has continually now for years and years on end been advancing software technologies albeit not caring about efficiency as a “practical goal” as they consider efficiency of the shall we say “server based” technologies to become extinct as they have in home computing and even console gaming systems coding etc. Thus expending oodles of resources to make everything Microsoft efficient is simply wasting ones time as eventually the hardware technologies will make it moot as it has in home computing. Further, the cost of this hardware ALWAYS goes downward in time.

    Hosting companies dont like talking about it. You dont see for example hosts of hosts offering 10 year old dedicated servers at costs of shared hosting packages. No No. Yet, those machines would do more for ones growing and slowing PHP based web(s) than any nifty new shared box they sit upon.

    I recently deployed a site for some folks (PHP CMS deal) to a decent VPS host. Site runs fine. Sits on a machine thats a quad core zip zip machine. Know what? The AMD 1700 (yes you read that right) based machine I have LAMP on (and not loaded over a LAN but the net) runs RINGS around it and that 1700 has considerably more running on it than the slimmed down VPS. Sure… there are all forms of differences at play, but the results are quite dramatic. I’d benchmark it but again, doesnt matter. What matters is they got their site and are happy.

    Ok, one last gripe? About some of the “biggie names” in open source. The Magento’s and Joomla’s etc.

    Here are these weenies who develop these PHP based applications that are essentially not any better than similar applications that abound, in fact worse than many. But, they have a community of developers which make the software do lots of nifty things. Modules, plugins etc. Then THEY capitalize be it by making editions for “those who need pay for things” or jetsetting around the globe selling everyone on how wonderful they and their works are charging for seminars!

    I call it “Open-a__” developers.

    Perhaps brilliant in using others to become millionaires themselves and famous? LOL.

    But ya know, I never see them or hear of em’ banging out say open source C++ frameworks, Java ones, those that are considered by the WORLD as PERFORMANCE based technologies etc. Uh Uh. IMHO and I know PHP its a mess, just a mess.

    For example, look at Liferay (not sure if its still free these days) and Joomla or Drupal. Install em’. Use some nice profiling software and then pick your jaw up off the ground when you see the performance curves.

    But see… again, doesnt seem to matter. Whats actually best, most efficient this/that becomes moot in the face of the masses.

  12. You can use multiprocessing, try this python source:

    import time
    
    from multiprocessing import Process
    
    def get_primes7(n):
    
        """ standard optimized sieve algorithm to get a list of prime numbers
        --- this is the function to compare your functions against! --- """
        if n &lt; 2:  return []
        if n == 2: return [2]
        # do only odd numbers starting at 3
        s = range(3, n+1, 2)
        # n**0.5 simpler than math.sqr(n)
        mroot = n ** 0.5
        half = len(s)
        i = 0
        m = 3
        while m &lt;= mroot:
            if s[i]:
                j = (m*m-3)/2  # int div
                s[j] = 0
                while j &lt; half:
                    s[j] = 0
                    j += m
            i = i+1
            m = 2*i+3
            res = [2]+[x for x in s if x]
            print &quot;Found&quot;, len(res), &quot;prime numbers.&quot;
            return res
    
    
    t1 = time.time()
    for t in range(10):
        print &quot;start: &quot;
        Process(target=get_primes7, args=(10000000,)).start()
    t2 = time.time()
    print 'Execution of all threads is complete, time: %.2f' % (t2-t1)
    
    t1 = time.time()
    
    
  13. Hello,
    I just want to say that your C++ code is really not the way that c++ coders implement algorithms. To exemplify that, I made really some small changes in your code and I was able to reduce runtime by 40% (I use intel compiler. Should check on gcc)!! Remember: one of the BIGGEST advantages of C++ in comparison to JAVA (and also python) is that you can control allocation of heap memory (which is a very slow process). So don’t create unnecessary vectors and do unnecessary reallocation if you don’t need. With one call of the vector function “reserve”, the use of iterators and the use of references, my version only allocates memory ONCE! Here is my code

    Another thing to remember: push_back is slower than operator [] even when there is no reallocation!!!

    Here is my code

    #include 
    #include 
    #include 
    #include 
    
    void get_primes7( int n, std::vector& res ) {
    
    	if (n < 2) {
    		res.resize(0);
    		return;
    	}
    
    	if (n == 2) {
    		res.resize(1);
    		res[0] = 2;
    		return;
    	}
    	
    	// we dont need a second vector
    	res.resize( static_cast(n/2.0) );       // resize vector only changes .end() 
                                                    //  iterator. because we have enough capacity
                                                    // it does NOT allocate or deallocate heap memory	
    	size_t j = 0;
    	for (int i = 3; i < n + 1; i += 2, ++j) res[j] = i;  // operator [] is faster than push_back 
    	res.resize(j); // Now we need the correct size for the algorithm below
    
    	int mroot = sqrt(n);
    	int half = static_cast(res.size()); // static_cast is always better than reinterpret cast!
    	int i = 0;
    	int m = 3;
    
    	while (m <= mroot) {
    		if (res[i]) {
    			int j = static_cast(0.5*(m*m - 3)); // inside a loop, replace division /2 by mult. 0.5
    			res[j] = 0;
    			while (j < half) {
    				res[j] = 0;
    				j += m;
    			}
    		}
    		i++; 
    		m = 2*i + 3;
    	}
    
            res.push_back(2);
    	std::vector::iterator pend  = 
            std::remove (res.begin(), res.end(), 0);  // remove does not change capacity
    			                          // or size. Only reorder the vector and returns the 
    			                          // iterators that points to the 
    		                                  // last non zero position
    	res.resize(pend-res.begin()); // again - no allocation or reallocation of memory!
    }
    
    int main() {
    	std::vector res;
    	res.reserve(10000001);  // after you reserve memory, resize() only plays with its internal 
                                    // pointers as long as it does not exceed vector previous capacity
    	for (int i = 1; i <= 10; ++i) {
    		get_primes7(10000000, res);		
    		std::cout << "Found " << res.size() << " prime numbers.\n"; 
    	}
    
    	return 0;
    }
    
    • Hi, thanks for the comment and source code. You have, however, changed the algorithm which makes this source code not comparable to the others. I’ve already had such a discussion about Java, please take a look at it, especially my first reply. I’m quoting the first paragraph here:

      [begin quote]
      This is a fair competition with fair rules — every language uses the very same algorithm. I know that the algorithm can be adjusted/optimized specifically for each language, but that’s not the point. The idea is, if we used another algorithm which *did* require that we use dynamic arrays everywhere, to compare how each language does the job in regards to performance. Hence I didn’t try to think if I can optimize the algorithm in any language (not that I can for all of them). I tried to explain that by “The correctness of the implementation is not so important, as we just want to check how fast the languages perform”. Which in plain English means — we try to use the very same usage pattern, as if the task really required it. So let’s pretend that we really needed the boxing/unboxing, just as we did for the other languages — all languages do the same algorithm complexity using the same (dynamic) structures.
      [end quote]

      What I find incompatible in your implementation, compared to the other languages:
      1. Let’s pretend that we needed the second vector — just as we have it in all other implementations. If we weren’t implementing the Sieve of Eratosthenes algorithm, we may have needed it for something else. The idea here is to see how languages operate with two different arrays.
      2. Pre-allocation of memory/heap/array-size is not what we were after. As already mentioned, we want to see how each language manages to allocate memory in a dynamic way.

      The other proposed changes are good points:
      1. static_cast() — if that’s faster, we should use it indeed.
      2. Operator [] is faster than push_back() — ditto. However, this requires an array with a predefined size, correct?
      3. Replace division /2 by mult. 0.5 — does this really make any difference?

      In a nutshell, we could use the static_cast() and the division trick, in order to speed up the original C++ algorithm, and at the same time to keep it comparable with other languages, right?

  14. Hi.

    Nice reply. Thanks. I understand your point. However, even you want a second vector, I made some crucial changes in that implementation that optimize a lot your code (30%). Actually, the pre allocation of the answer only made the code 10% faster in comparison to things I will show in this message

    “2. Operator [] is faster than push_back() — ditto. However, this requires an array with a predefined size, correct?”.

    That is the most imp optimization !!! The difference between push_back and operator [] can be huge for two reasons. First, push_back does SEVERAL reallocation. (you started from size 0 and every time it does a new allocation, C++ doubles the vector size. So the vector needs ~log2(10000000) to allocate 100000. Resize, on the other hand, will allocate memory only once. Second, push_back is a function and there is more calling overhead than operator. To avoid the first problem, you could reserve memory for s first using the .reserve() member function. Then, you would only have the push_back calling overhead. Another option If you dont know how many iterations you will have (in your case it was easy to calculate but pretend it is not): you can resize the vector with a bigger size first and then, shrink to the size after the loop, as I did. Shrink a vector with resize() does not change its capacity. It only moves the pointer .end().

    About 3. Not sure here, but I saw places in my own research code where it made a huge difference. Division takes almost 30 cycles to calculate. Multiplication can be done in 1.

    Another point : you were coding by yourself things that are better optimized in the standard. For example. If you really want a second vector. Then the last step could be something like

    res.push_back(2);
    std::vector::iterator pend = std::remove (s.begin(), s.end(), 0); 
    res.insert(res.begin()+1, s.begin(), pend);
    

    Insert is much faster than a loop with push_back, because, as I said before, the latter will do log2(10000000) reallocations. Insert, on the other hand, will only allocate once. But even if you reserve memory first, insert is faster than hand coded loop. Lemma is: standard is always better!

    Last, do not pass a vector by value. It is rule 1 of C++. Create a second vector, ok. But the first one you pass as a reference argument and resize to zero to reset it at each call (in the first line of the function you resize it to zero – resize to zero maintains its old capacity!)

    If you follow this, your code will speed up by 25%-30% even with a second vector! ( I tested this)

    Last comment: I completely understand your points, but to be fair with C++ you could made an observation that it is still possible to speed up the code using pre-allocation (10% more over the 30% you gain by following the points I wrote in this comment). I say this because the other languages you tested dont offer this choice and this is a great advantage of C++. Java is great because of its libraries, Python is great because it is easy and has awesome libraries. C++ is great because allows you to control low level things like memory. If you code C++ like you code Java, then you may think that Java is almost as fast as C++ and that is really not the case. Or you can think that C is much faster, which is also not the case. When you are comparing language, it is important to try to get the best of each language. And note that I did not made any hard core template code. I just changed your implementation a little bit.

  15. Here is my code with a second vector. It is already much faster than your impl.

    #include<vector>
    #include<algorithm>
    #include<iostream>
    #include<cmath>
    
    using namespace std;
    
    void get_primes7( int n, std::vector<int>& res ) {
    
        if (n < 2) {
            res.resize(0);
            return;
        }
    
        if (n == 2) {
            res.resize(1);
            res[0] = 2;
            return;
        }
        
        res.resize(0);   
        std::vector<int> s;
        size_t j = 0;
    	
        s.resize( static_cast<int>(n/2.0) );
        for (int i = 3; i < n + 1; i += 2, ++j) s[j] = i;
        s.resize(j);
            
        /* Alternatively: */
        //s.reserve( static_cast<int>(n/2.0)  );
        //for (int i = 3; i < n + 1; i += 2) s.push_back(i);
    
        int mroot = sqrt(n);
        int half = static_cast<int>(s.size());
        int i = 0;
        int m = 3;
    
        while (m <= mroot) {
            if (s[i]) {
                int j = static_cast<int>(0.5*(m*m - 3));
                s[j] = 0;
                while (j < half) {
                    s[j] = 0;
                    j += m;
                }
            }
            i++;
            m = 2*i + 3;
        }
    
        res.push_back(2);
        std::vector<int>::iterator pend = std::remove (s.begin(), s.end(), 0);
        res.insert(res.begin()+1, s.begin(), pend);               
    }
    
    int main() {
        std::vector<int> res;
        for (int i = 1; i <= 10; ++i) {
            get_primes7(10000000, res);       
            std::cout << "Number of primes " << res.size() << "\n";
        }
        return 0;
    } 
    
    • but look 10% of the time is lost in allocation
      reak 0m0.700s
      sys 0m0.088s

      is the 10% you gain by pre-allocation as I said :)
      best

    • Hey, all this was very enlightening. I’ve done some tests with GCC and here are the speed-up numbers compared to my original implementation (every next case includes the previous):
      * pass by reference: 4,4%
      * static_cast: 4,2% (no speed-up, some statistical error)
      * multiply instead of divide: 4,6% (no speed-up)
      * use std instead of loop: 5,3% (some more speed-up)
      * hint C++ by reserve() about the vector size: 13,0% (much faster)
      * pre-allocate memory by resize(): 16,1% (the fastest)

      By combining all your suggestions, I got at most a 16% speed-up. Great!

      I had a good talk with another C++ guru, and he also suggested that we should always hint about the expected memory usage with reserve(), and when possible to use resize(). Which is exactly the same thing that you said. This is a good practice but I won’t include it in the blog results mainly because this is somehow specific to the used algorithm and we are doing benchmarks for a completely generic use-case. I’ll add a note about this at the page though.

      Passing by reference is by far the most obvious optimization and every C/C++ developer must use it whenever possible. I’ve updated my source code, even though this is a feature not present in the script languages.

      • Nice study. It is always interesting to talk about C++. I talked to a lot of people to understand the language better.

        I got more speed up, but I am using intel compiler, which implements better optimization. Using gcc, you can try -ffast-math, because it usually makes a significant difference in performance. The -ffast-math makes optimization which is safe unless you get NAN or overflow or underflow or you really need the 16 digits of double precision. Gcc people are kind of purist and dont make this default in -O2, like intel does. Because it is just a flag, I can consider a “trivial way” to optimize a code.

        static_cast does not speed up code, but it is the way C++ does cast at compile time because it is safer than C style cast. I included to make the code look more like C++. Multiply instead of divide should only make a difference if it is in a very inner loop (like a for loop inside another for loop), but it is a good practice anyway.

        Glad we had this discussion.
        Best

  16. Hi! You can get 40% boost in pure Java implementation by predefining the array size for big values. Sth like that:

    List res = new ArrayList(n / 10);
    if (n < 2)
    return res;
    if (n == 2) {
    res.add(2);
    return res;
    }
    List s = new ArrayList(n / 10);

  17. Hi

    Golang:

    import (
    “fmt”
    “math”
    )

    func GetPrimes7(n int) []int {
    if n < 2 {
    return []int{}
    }
    if n == 2 {
    return []int{2}
    }

    s := []int{}
    for i := 3; i < n + 1; i += 2 {
    s = append(s, i)
    }
    mroot := int(math.Sqrt(float64(n)))
    half := len(s)
    i := 0
    m := 3
    for m 0 {
    j := int(float64(m * m – 3) * 0.5)
    s[j] = 0
    for j 0 {
    r = append(r, value)
    }
    }
    return r
    }

    func main() {
    for i := 0; i < 10; i++ {
    fmt.Printf("Found %d prime numbers.\n", len(GetPrimes7(10000000)))
    }
    }

    real 0m1.894s
    user 0m1.789s
    sys 0m0.099s

    Hardware — MacBook Air 2011 1.6 GHz Intel Core i5

  18. Wow! I’ve been wanting to use node for a local project, but felt like I should do c++ for speed. This shows that it’s harldy slower, but development will be much quicker. Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s