/contrib/famzah

Enthusiasm never stops


Leave a comment

PHP non-interactive usage in a cron job

Using a PHP script in a crontab is fairly easy, as stated in the “Using PHP from the command line” documentation… Until you start to get the following warning during the execution:

No entry for terminal type “unknown”;
using dumb terminal settings.

The script works, but this nasty warning really bothers you.

Here is a sample crontab entry:

* * * * * root sudo -u www-data php -r ‘echo “test”;’

When executed, it prints the warning on STDERR.

Yes, I know I don’t need “sudo” here, but this was my initial usage pattern as I discovered the problem, and at the first time I suspected that “sudo” got crazy. Well, it wasn’t “sudo” to blame, but PHP.

Here is the fixed crontab entry:

* * * * * root sudo -u www-data TERM=dumb php -r ‘echo “test”;’

The issue was encountered on an Ubuntu 10.04 server. I though crond usually sets $TERM to something… Anyway, problem solved.


Leave a comment

Getting “500 Line too long (limit is 4096)” error in Perl

The error may also be “500 Line too long (limit is 8192)” but the problem is still the same – LWP or SOAP::Lite return this error when you try to POST or GET something very long.

The one to blame is actually Net::HTTP::Methods, included somewhere by something.

It took me a few hours to get this resolved:

use LWP::Protocol::http; # to suppress the warning "possible typo" in the next statement
push(@LWP::Protocol::http::EXTRA_SOCK_OPTS, MaxLineLength => 0); # to remove the limit

Put the above code in your Perl HTTP client and you’re good to go!

References:


Leave a comment

Google App Engine – Datastore performance, and Memcache behavior

Ever since I’ve been working with Google App Engine, there are two issues which bothered me a lot:

  1. Datastore performance – lots of people have already written about it (see links #1, #2, and #3). Currently, when working with small datasets, it’s far from being comparable even with a slow MySQL database, and you may also occasionally get internal errors, as well as increased latencies. I contacted Google about this, and asked them if the Business customers of GAE who pay for it would get better Datastore performance. Here is what I got as an answer from Nick Johnson, a GAE developer:

    Business customers will receive paid support, which is prioritized, as well as the extra features we announced at I/O. System latency is not any different, however, as we try and make the system as fast as possible for all our users.

    So the bad news is that you cannot make the Datastore run faster, even if you pay.
    The good news is that we are all getting the same service in terms of speed, which is a good thing – when everybody is having difficulties, then the community will eventually find a solution.

  2. Memcache fairness – what happens if another website (on the same server) uses the Memcache service extensively, thus making the Memcache entries of my website expire too quickly, due to the memory pressure. Here is what Nick Johnson from Google replied:

    Memcache is segmented by application. Although there is some variation (so that apps that don’t use any memcache don’t take up usable space), every app is guaranteed a fair share of memcache space.

    Excellent system design GAE engineers. Keep up the good work!

Update: Google App Engine engineers continue to do a very good work indeed! You should take a look at the new features announced with the 1.3.6 release of GAE.


5 Comments

Beware of leading zeros in Bash numeric variables

Suppose you have some (user) value in a numeric variable with leading zeros. For example, you number something with zero-padded numbers consisting of 3 digits: 001, 002, 003, and so on. This label is assigned to a Bash variable, named $N.

Until the numbers are below 008, and until you use the variable only in text interpolations, you’re safe. For example, the following works just fine:

N=016
echo "Value: $N"
# result is "016"

However… πŸ™‚
If you start using this variable as a numeric variable in arithmetics, then you’re in trouble. Here is an example:

N=016
echo $((N + 2))
# result is 16, not 18, as expected!
printf %d "$N"
# result is 14, not 16, as expected!

You probably already see the pattern – “016” is not treated as a decimal number, but as an octal one. Because of the leading zero. This is explained in the man page of bash, section “ARITHMETIC EVALUATION” (aka. “Shell Arithmetic”).

In order to force decimal representation and as a side effect also remove any leading zeros for a Bash variable, you need to treat it as follows:

N=016
N=$((10#$N)) # force decimal (base 10)
echo $((N + 2))
# result is 18, ok
printf %d "$N"
# result is 16, ok

Note also that there’s another caveat – forcing the number to decimal base 10 doesn’t actually validate that it contains only [0-9] characters. Read the very last paragraph of the man page of bash, section “ARITHMETIC EVALUATION” (aka. “Shell Arithmetic”), for more details on how digits can be represented by letters and symbols. My tests however show that you can’t operate with invalid numbers in base 10, though I’m no expert here. In order to be on the safe side, I would suggest that you validate your numbers with a strict regular expression, just in case, and if you don’t trust the data input.


Resources:


Leave a comment

Validator for the Model key_name property in Google App Engine datastore (Python)

The Google App Engine datastore provides convenient data modeling with Python. One important aspect is the validation of the data stored in a Model instance. Each data key-value is stored as a Property which is an attribute of a Model class.

While every Property can be validated automatically by specifying a “validator” function, there is no option for the Model key name to be automatically validated. Note that we can manually specify by our code the value of the key name, and therefore this key name can be considered user-data and must be validated. The key name is by the way the only unique index constraint, similar to the “primary key” in relational databases, which is supported by the Google datastore, and can be specified manually.

Here is my version for a validation function for the Model’s key name:

from google.appengine.ext import db
import re

def ModelKeyNameValidator(self, regexp_string, *args, **kwargs):
	gotKey = None
	className = self.__class__.__name__

	if len(args) >= 2:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'args'
		k = args[1] # key_name given as an unnamed argument
	if 'key' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'Key'
		k = kwargs['key'].name() # key_name given as Key instance
	if 'key_name' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'key_name'
		k = kwargs['key_name'] # key_name given as a keyword argument

	if not gotKey:
		raise Exception('No key found for Model ' + className)

	id = '%s.key_name(%s)' % (self.__class__.__name__, gotKey)
	if (not re.search(regexp_string, k)):
		raise ValueError('(%s) Value "%s" is invalid. It must match the regexp "%s"' % (id, k, regexp_string))

class ClubDB(db.Model):
	# key = url
	def __init__(self, *args, **kwargs):
		ModelKeyNameValidator(self, '^[a-z0-9-]{2,32}$', *args, **kwargs)
		super(self.__class__, self).__init__(*args, **kwargs)

	name = db.StringProperty(required = True)

As you can see, the proposed solution is not versatile enough, and requires you to copy and alter the ModelKeyNameValidator() function again and again for every new validation type. I strictly follow the Don’t Repeat Yourself principle in programming, so after much Googling and struggling with Python, I got to the following solution which I actually use in my projects (click “show source” to see the code):

from google.appengine.ext import db
import re

def re_validator(id, regexp_string):
	def validator(v):
		string_type_validator(v)
		if (not re.search(regexp_string, v)):
			raise ValueError('(%s) Value "%s" is invalid. It must match the regexp "%s"' % (id, v, regexp_string))
	return validator

def length_validator(id, minlen, maxlen):
	def validator(v):
		string_type_validator(v)
		if minlen is not None and len(v) < minlen:
			raise ValueError('(%s) Value "%s" is invalid. It must be more than %s characters' % (id, v, minlen))
		if maxlen is not None and len(v) > maxlen:
			raise ValueError('(%s) Value "%s" is invalid. It must be less than %s characters' % (id, v, maxlen))
	return validator

def ModelKeyValidator(v, self, *args, **kwargs):
	gotKey = None

	if len(args) >= 2:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'args'
		k = args[1] # key_name given as unnamed argument
	if 'key' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'Key'
		k = kwargs['key'].name()
	if 'key_name' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'key_name'
		k = kwargs['key_name']

	if not gotKey:
		raise Exception('No key found for Model ' + self.__class__.__name__)

	v.execute('%s.key_name(%s)' % (self.__class__.__name__, gotKey), k) # validate the key now

class DelayedValidator:
	''' Validator class which allows you to specify the "id" dynamically on validation call '''
	def __init__(self, v, *args): # specify the validation function and its arguments
		self.validatorArgs = args
		self.validatorFunction = v

	def execute(self, id, value):
		if not isinstance(id, basestring):
			raise Exception('No valid ID specified for the Validator object')
		func = self.validatorFunction(id, *(self.validatorArgs)) # get the validator function
		func(value) # do the validation

class ClubDB(db.Model):
	# key = url
	def __init__(self, *args, **kwargs):
		ModelKeyValidator(DelayedValidator(re_validator, '^[a-z0-9-]{2,32}$'), self, *args, **kwargs)
		super(self.__class__, self).__init__(*args, **kwargs)

	name = db.StringProperty(
		required = True,
		validator = length_validator('ClubDB.name', 1, None))

You probably noticed that in the second example I also added a validator for the “name” property too. Note that the re_validator() and length_validator() functions can be re-used. Furthermore, thanks to the DelayedValidator class which accepts a validator function and its arguments as constructor arguments, the ModelKeyValidator class can be re-used without any modifications too.

P.S. It seems that all “validator” functions are executed every time a Model class is being instantiated. This means that no matter if you are updating/creating the data object, or you are simply reading it from the datastore, the assigned values are always validated. This surely wastes some CPU cycles, but for now I have no idea how to easily circumvent this.

Disclaimer: I’m new to Python and Google App Engine. But they seem fun! πŸ™‚ Sorry for the long lines…


Resources:


6 Comments

C++ vs. Python vs. Perl vs. PHP performance benchmark (part #2)

This time we will focus on the startup time. The process start time is important if your processes are not persistent. If you are using FastCGI, mod_perl, mod_php, or mod_python, then these statistics are not so important to you. However, if you are spawning many processes which do something small and live for a very short time, then you should consider the CPU resources which get wasted while the script interpreter is being initialized.

The benchmarked scripts do only one thing – say “Hello, world” on the standard output. They do not include any additional modules in their source code – this may, or may not be your use-case. Though, very often the scripting languages have pretty many built-in functions, and for simple tasks you never need to include other modules.

Here are the benchmark results:

Language CPU time Slower than
User System Total C++ previous
C++ (with or w/o optimization) 2.568 3.536 6.051
Perl 12.561 6.096 18.723 209% 209%
PHP (w/o php.ini) 20.473 13.877 34.918 477% 86%
Python 27.014 11.881 39.318 550% 13%
Python + Psyco 32.986 14.845 48.132 695% 22%

The clear winner among the script languages this time is… Perl. πŸ™‚

All scripts were invoked 3000 times using the following Bash loop:

time ( i=3000 ; while [ “$i” -gt 0 ]; do $CMD >/dev/null ; i=$(($i-1)); done )

All tests were done on a Kubuntu Lucid box. The versions of the used software packages follow:

  • g++ (GNU project C and C++ compiler) 4.4.3
  • Python 2.6.5
  • Python Psyco 1.6 (1ubuntu2)
  • Perl 5.10.1
  • PHP 5.3.2 (1ubuntu4.2 with Suhosin-Patch), Zend Engine 2.3.0

The C++ implementation follows, click “show source” below to see the full source:

#include <iostream>
using namespace std;

int main() {
	cout << "Hello, world!\n";
	return 0;
}

The Perl implementation follows, click “show source” below to see the full source:

use strict;
use warnings;

print "Hello, world!\n";

The PHP implementation follows, click “show source” below to see the full source:

<?php
echo "Hello, world!\n";

The Python implementation follows, click “show source” below to see the full source:

#import psyco
#psyco.full()

print 'Hello, world!'

Update (Jan/14/2012): Copied the used test environment info here.


Leave a comment

Speed up RRDtool database manipulations via RRDs (Perl)

Use case
You are doing a lot of data operations on your RRD files (create, update, fetch, last), and every update is done by a separate Perl process which lives a very short time – the process is launched, it updates or reads the data, does something else, and then exits.

The problem
If you are using RRDtool and Perl as described, you surely have noticed that running many of these processes wastes a lot of CPU resources. The question is – can we do some performance optimizations, and lessen the performance hit of loading the RRDs library into Perl? We know that launching often Perl itself is quite expensive, but after all, if we chose to work with Perl, this is a price we should be ready to pay.

The RRDtool shared library is a monolithic piece of code which provides ALL functions of the RRDtool suite – data manipulation, graphics and import/export tools. The last two components bring huge dependencies in regards to other shared libraries. The library from RRDtool version 1.4.4 depends on 34 other libraries on my Linux box! This must add up to the loading time of the RRDtool library into Perl.

Resolution and benchmarks
In order to prove my theory (actually, it was more a theory of zImage, and I just followed, enhanced and tried it), I commented out the implementation of the “graphics” and “import/export tools” modules from the source code of RRDtool. Then I re-compiled the library and did some performance benchmarks. I also re-implemented the RRDs.pm module by replacing the DynaLoader module with the XSLoader one. This made no difference in performance whatsoever. The re-compiled RRD library depends on only 4 other libraries – linux-gate.so.1, libm.so.6, libc.so.6, and /lib/ld-linux.so.2. I think this is the most we can cut down. πŸ™‚

So here are the benchmark results. They show the accumulated time for 1000 invocations of the Perl interpreter with three different configurations:

  • Only Perl (baseline): 5.454s.
  • With RRDs, no graphics or import/export functions: 9.744s (+4.290s) +78%.
  • With standard RRDs: 11.647s (+6.192s) +113%.

As you can see, you can make Perl + RRDs start 35% faster. The speed up for RRDs itself is 44%.


Here are the commands I used for the benchmarks:

  • Only Perl (baseline): time ( i=1000 ; while [ “$i” -gt 0 ]; do perl -Mwarnings -Mstrict -e ” ; i=$(($i-1)); done )
  • Perl + RRDs: time ( i=1000 ; while [ “$i” -gt 0 ]; do perl -Mwarnings -Mstrict -MRRDs -e ” ; i=$(($i-1)); done )


67 Comments

C++ vs. Python vs. Perl vs. PHP performance benchmark

Update: There areΒ newer benchmark results.


This all began as a colleague of mine stated that Python was so damn slow for maths. Which really astonished me and made me check it out, as my father told me once that he was very satisfied with Python, as it was very maths oriented.

The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, dynamic arrays with numbers, basic math operations.

Out of curiosity, Python was also benchmarked with and without the Psyco Python extension (now obsoleted by PyPy), which people say could greatly speed up the execution of any Python code without any modifications.

Here are the benchmark results:

Language CPU time Slower than Language
version
Source
code
User System Total C++ previous
C++ (optimized with -O2) 1,520 0,188 1,708 g++ 4.5.2 link
Java (non-std lib) 2,446 0,150 2,596 52% 52% 1.6.0_26 link
C++ (not optimized) 3,208 0,184 3,392 99% 31% g++ 4.5.2 link
Javascript (SpiderMonkey) see comment (SpiderMonkey seems as fast as C++ on Windows)
Javascript (nodejs) 4,068 0,544 4,612 170% 36% 0.8.8 link
Java 8,521 0,192 8,713 410% 150% 1.6.0_26 link
Python + Psyco 13,305 0,152 13,457 688% 54% 2.6.6 link
Ruby see comment (Ruby seems 35% faster than standard Python)
Python 27,886 0,168 28,054 1543% 108% 2.7.1 link
Perl 41,671 0,100 41,771 2346% 49% 5.10.1 link
PHP 5.4 roga’s blog results (PHP 5.4 seems 33% faster than PHP 5.3)
PHP 5.3 94,622 0,364 94,986 5461% 127% 5.3.5 link

The clear winner among the script languages is… Python. πŸ™‚

NodeJS JavaScript is pretty fast too, but internally it works more like a compiled language. See the comments below.

Please read the discussion about Java which I had with Isaac Gouy. He accused me that I am not comparing what I say am comparing. And also that I did not want to show how slow and how fast the Java example program can be. You deserve the whole story, so please read it if you are interested in Java.

Both PHP and Python are taking advantage of their built-in range() function, because they have one. This speeds up PHP by 5%, and Python by 20%.

The times include the interpretation/parsing phase for each language, but it’s so small that its significance is negligible. The math function is called 10 times, in order to have more reliable results. All scripts are using the very same algorithm to calculate the prime numbers in a given range. The correctness of the implementation is not so important, as we just want to check how fast the languages perform. The original Python algorithm was taken from http://www.daniweb.com/code/snippet216871.html.

The tests were run on an Ubuntu Linux machine.

You can download the source codes, an Excel results sheet, and the benchmark batch script at:
http://www.famzah.net/download/langs-performance/


Update (Jul/24/2010): Added the C++ optimized values.
Update (Aug/02/2010): Added a link to the benchmarks, part #2.
Update (Mar/31/2011): Using range() in PHP improves performance with 5%.
Update (Jan/14/2012): Re-organized the results summary table and the page. Added Java.
Update (Apr/02/2012): Added a link to PHP 5.4 vs. PHP 5.3 benchmarks.
Update (May/29/2012): Added the results for Java using a non-standard library.
Update (Jun/25/2012): Made the discussion about Java public, as well as added a note that range() is used for PHP and Python.
Update (Aug/31/2012): Updated benchmarks for the latest node.js.
Update (Oct/24/2012): Added the results for SpiderMonkey JavaScript.
Update (Jan/11/2013): Added the results for Ruby vs. Python and Nodejs.


Leave a comment

Filter a character sequence leaving only valid UTF-8 characters

This is my implementation of a Perl regular expression which sanitizes a multi-byte character sequence by filtering only the valid UTF-8 characters in it. Any non-UTF-8 character sequences are deleted and in the end you get a clean, valid UTF-8 multi-byte string.

Note that this works only for a subset of the UTF-8 alphabet. I.e. this is not a general filtering regular expression, but it leaves the standard ASCII and only the Cyrillic UTF-8 characters. You can easily extend the regular expression and add another UTF-8 subset.

Let’s get to the requirements:

  • Standard ASCII symbols: As it is described at the Wikipedia UTF-8 page, the ASCII characters from Hex 00-7F are encoded without modification in a UTF-8 sequence, as they are “Single-byte encoding (compatible with US-ASCII)”. Therefore, any character between Hex 00-7F is valid in a UTF-8 sequence. Though, for our current example, we will leave only certain ASCII symbols and namely a few of the control ones, and the printable ones:
    • ASCII control symbols: \t -> Hex 09, \n -> Hex 0A, \r -> Hex 0D.
    • Printable single-byte ASCII symbols: Hex 20-7E.
  • Cyrillic multi-byte UTF-8 characters, only the Russian/Bulgarian ones: If you open the Unicode/UTF-8 character table, and navigate to the “U+0400…U+04FF: Cyrillic” block, you can visually choose which characters you want to allow in your UTF-8 sequence by looking in the “character” column. In my case, I want to allow the characters “А”, “Π‘”, “Π’”, “Π“” and so on until “ю”, “я”. If you look at the “UTF-8 (hex.)” column, you will notice that the range of these Cyrillic characters is from Hex d0 91 to Hex d0 bf, and from Hex d1 80 to Hex d1 8f. Yes, two ranges.

Therefore, our regular expression has to allow only the following sequences:

  • Single-byte, standard ASCII: \t, \n, \r, and x20-x7E.
  • Multi-byte, Cyrillic UTF-8: xD090-xD0BF, and xD180-xD18F.

Once you have established these rules, it’s very easy to construct the regular expression:

$my_string =~ s/.*?((?:[\t\n\r\x20-\x7E])+|(?:\xD0[\x90-\xBF])+|(?:\xD1[\x80-\x8F])+|).*?/$1/sg;


Update: 19/Nov/2010

If you want to allow some more characters, for example, the German umlaut letters “Γ€”, “ΓΆ”, “ΓΌ”, you have to include the following sequence too:

  • Multi-byte, UTF-8 Latin letters with diaeresis, tilde, etc: \xC380-\xC3BF.

The new UTF-8 filtering regular expression then becomes the following:

$my_string =~ s/.*?((?:[\t\n\r\x20-\x7E])+|(?:\xD0[\x90-\xBF])+|(?:\xD1[\x80-\x8F])+|(?:\xC3[\x80-\xBF])+|).*?/$1/sg;


If you are wondering why I would need only certain ASCII control and only the printable ASCII characters, the answer is – because of the XML standard. As the XML W3C Recommendations state, only certain Hex characters and character sequences are valid in an XML document, even as HTML entities: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF].

Libexpat is very strict in what you feed as input, and if your input isn’t a valid UTF-8 sequence, you will end up with the error message “XML parse error: not well-formed (invalid token)”.


9 Comments

Migrate your TWiki to Google Sites (using Google Sites API and Perl)

If you want to transfer your existing TWiki webs to Google Sites, you can do it automatically with the power of the Google Sites API and Perl.

You can download the Perl script, which I used to export my TWiki webs and then import them in Google Sites, at the following page: http://www.famzah.net/download/google-api/twiki2googlesites.pl

Note that this is in no way a complete migration solution. You can use it as a demonstration/base on how to interact with the following Google API features using Perl:

Now you know that you can use Perl to interact with Google APIs. Go build your own scripts!

Update: Google released Python command line tools for the Google Data APIs (GoogleCL). They seem promising and very easy to use for simple automation tasks.

P.S. If you’re limited on time and your TWiki is relatively small on pages count, you’ve got a pretty good chance of migrating it manually with copy/paste, than writing your own migration scripts. Believe me. πŸ™‚