Google App Engine – Datastore performance, and Memcache behavior

August 31, 2010

Ever since I’ve been working with Google App Engine, there are two issues which bothered me a lot:

  1. Datastore performance – lots of people have already written about it (see links #1, #2, and #3). Currently, when working with small datasets, it’s far from being comparable even with a slow MySQL database, and you may also occasionally get internal errors, as well as increased latencies. I contacted Google about this, and asked them if the Business customers of GAE who pay for it would get better Datastore performance. Here is what I got as an answer from Nick Johnson, a GAE developer:

    Business customers will receive paid support, which is prioritized, as well as the extra features we announced at I/O. System latency is not any different, however, as we try and make the system as fast as possible for all our users.

    So the bad news is that you cannot make the Datastore run faster, even if you pay.
    The good news is that we are all getting the same service in terms of speed, which is a good thing – when everybody is having difficulties, then the community will eventually find a solution.

  2. Memcache fairness – what happens if another website (on the same server) uses the Memcache service extensively, thus making the Memcache entries of my website expire too quickly, due to the memory pressure. Here is what Nick Johnson from Google replied:

    Memcache is segmented by application. Although there is some variation (so that apps that don’t use any memcache don’t take up usable space), every app is guaranteed a fair share of memcache space.

    Excellent system design GAE engineers. Keep up the good work!

Update: Google App Engine engineers continue to do a very good work indeed! You should take a look at the new features announced with the 1.3.6 release of GAE.


Beware of leading zeros in Bash numeric variables

August 7, 2010

Suppose you have some (user) value in a numeric variable with leading zeros. For example, you number something with zero-padded numbers consisting of 3 digits: 001, 002, 003, and so on. This label is assigned to a Bash variable, named $N.

Until the numbers are below 008, and until you use the variable only in text interpolations, you’re safe. For example, the following works just fine:

N=016
echo "Value: $N"
# result is "016"

However… :)
If you start using this variable as a numeric variable in arithmetics, then you’re in trouble. Here is an example:

N=016
echo $((N + 2))
# result is 16, not 18, as expected!
printf %d "$N"
# result is 14, not 16, as expected!

You probably already see the pattern – “016″ is not treated as a decimal number, but as an octal one. Because of the leading zero. This is explained in the man page of bash, section “ARITHMETIC EVALUATION” (aka. “Shell Arithmetic”).

In order to force decimal representation and as a side effect also remove any leading zeros for a Bash variable, you need to treat it as follows:

N=016
N=$((10#$N)) # force decimal (base 10)
echo $((N + 2))
# result is 18, ok
printf %d "$N"
# result is 16, ok

Note also that there’s another caveat – forcing the number to decimal base 10 doesn’t actually validate that it contains only [0-9] characters. Read the very last paragraph of the man page of bash, section “ARITHMETIC EVALUATION” (aka. “Shell Arithmetic”), for more details on how digits can be represented by letters and symbols. My tests however show that you can’t operate with invalid numbers in base 10, though I’m no expert here. In order to be on the safe side, I would suggest that you validate your numbers with a strict regular expression, just in case, and if you don’t trust the data input.


Resources:


Validator for the Model key_name property in Google App Engine datastore (Python)

August 4, 2010

The Google App Engine datastore provides convenient data modeling with Python. One important aspect is the validation of the data stored in a Model instance. Each data key-value is stored as a Property which is an attribute of a Model class.

While every Property can be validated automatically by specifying a “validator” function, there is no option for the Model key name to be automatically validated. Note that we can manually specify by our code the value of the key name, and therefore this key name can be considered user-data and must be validated. The key name is by the way the only unique index constraint, similar to the “primary key” in relational databases, which is supported by the Google datastore, and can be specified manually.

Here is my version for a validation function for the Model’s key name:

from google.appengine.ext import db
import re

def ModelKeyNameValidator(self, regexp_string, *args, **kwargs):
	gotKey = None
	className = self.__class__.__name__

	if len(args) >= 2:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'args'
		k = args[1] # key_name given as an unnamed argument
	if 'key' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'Key'
		k = kwargs['key'].name() # key_name given as Key instance
	if 'key_name' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'key_name'
		k = kwargs['key_name'] # key_name given as a keyword argument

	if not gotKey:
		raise Exception('No key found for Model ' + className)

	id = '%s.key_name(%s)' % (self.__class__.__name__, gotKey)
	if (not re.search(regexp_string, k)):
		raise ValueError('(%s) Value "%s" is invalid. It must match the regexp "%s"' % (id, k, regexp_string))

class ClubDB(db.Model):
	# key = url
	def __init__(self, *args, **kwargs):
		ModelKeyNameValidator(self, '^[a-z0-9-]{2,32}$', *args, **kwargs)
		super(self.__class__, self).__init__(*args, **kwargs)

	name = db.StringProperty(required = True)

As you can see, the proposed solution is not versatile enough, and requires you to copy and alter the ModelKeyNameValidator() function again and again for every new validation type. I strictly follow the Don’t Repeat Yourself principle in programming, so after much Googling and struggling with Python, I got to the following solution which I actually use in my projects (click “show source” to see the code):

from google.appengine.ext import db
import re

def re_validator(id, regexp_string):
	def validator(v):
		string_type_validator(v)
		if (not re.search(regexp_string, v)):
			raise ValueError('(%s) Value "%s" is invalid. It must match the regexp "%s"' % (id, v, regexp_string))
	return validator

def length_validator(id, minlen, maxlen):
	def validator(v):
		string_type_validator(v)
		if minlen is not None and len(v) < minlen:
			raise ValueError('(%s) Value "%s" is invalid. It must be more than %s characters' % (id, v, minlen))
		if maxlen is not None and len(v) > maxlen:
			raise ValueError('(%s) Value "%s" is invalid. It must be less than %s characters' % (id, v, maxlen))
	return validator

def ModelKeyValidator(v, self, *args, **kwargs):
	gotKey = None

	if len(args) >= 2:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'args'
		k = args[1] # key_name given as unnamed argument
	if 'key' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'Key'
		k = kwargs['key'].name()
	if 'key_name' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'key_name'
		k = kwargs['key_name']

	if not gotKey:
		raise Exception('No key found for Model ' + self.__class__.__name__)

	v.execute('%s.key_name(%s)' % (self.__class__.__name__, gotKey), k) # validate the key now

class DelayedValidator:
	''' Validator class which allows you to specify the "id" dynamically on validation call '''
	def __init__(self, v, *args): # specify the validation function and its arguments
		self.validatorArgs = args
		self.validatorFunction = v

	def execute(self, id, value):
		if not isinstance(id, basestring):
			raise Exception('No valid ID specified for the Validator object')
		func = self.validatorFunction(id, *(self.validatorArgs)) # get the validator function
		func(value) # do the validation

class ClubDB(db.Model):
	# key = url
	def __init__(self, *args, **kwargs):
		ModelKeyValidator(DelayedValidator(re_validator, '^[a-z0-9-]{2,32}$'), self, *args, **kwargs)
		super(self.__class__, self).__init__(*args, **kwargs)

	name = db.StringProperty(
		required = True,
		validator = length_validator('ClubDB.name', 1, None))

You probably noticed that in the second example I also added a validator for the “name” property too. Note that the re_validator() and length_validator() functions can be re-used. Furthermore, thanks to the DelayedValidator class which accepts a validator function and its arguments as constructor arguments, the ModelKeyValidator class can be re-used without any modifications too.

P.S. It seems that all “validator” functions are executed every time a Model class is being instantiated. This means that no matter if you are updating/creating the data object, or you are simply reading it from the datastore, the assigned values are always validated. This surely wastes some CPU cycles, but for now I have no idea how to easily circumvent this.

Disclaimer: I’m new to Python and Google App Engine. But they seem fun! :) Sorry for the long lines…


Resources:


C++ vs. Python vs. Perl vs. PHP performance benchmark (part #2)

August 2, 2010

This time we will focus on the startup time. The process start time is important if your processes are not persistent. If you are using FastCGI, mod_perl, mod_php, or mod_python, then these statistics are not so important to you. However, if you are spawning many processes which do something small and live for a very short time, then you should consider the CPU resources which get wasted while the script interpreter is being initialized.

The benchmarked scripts do only one thing – say “Hello, world” on the standard output. They do not include any additional modules in their source code – this may, or may not be your use-case. Though, very often the scripting languages have pretty many built-in functions, and for simple tasks you never need to include other modules.

Here are the benchmark results:

Language CPU time Slower than
User System Total C++ previous
C++ (with or w/o optimization) 2.568 3.536 6.051 - -
Perl 12.561 6.096 18.723 209% 209%
PHP (w/o php.ini) 20.473 13.877 34.918 477% 86%
Python 27.014 11.881 39.318 550% 13%
Python + Psyco 32.986 14.845 48.132 695% 22%

The clear winner among the script languages this time is… Perl. :)

All scripts were invoked 3000 times using the following Bash loop:

time ( i=3000 ; while [ "$i" -gt 0 ]; do $CMD >/dev/null ; i=$(($i-1)); done )

For complete information about the test environment, please review the previous article.


The C++ implementation follows, click “show source” below to see the full source:

#include <iostream>
using namespace std;

int main() {
	cout << "Hello, world!\n";
	return 0;
}

The Perl implementation follows, click “show source” below to see the full source:

use strict;
use warnings;

print "Hello, world!\n";

The PHP implementation follows, click “show source” below to see the full source:

<?php
echo "Hello, world!\n";

The Python implementation follows, click “show source” below to see the full source:

#import psyco
#psyco.full()

print 'Hello, world!'

Speed up RRDtool database manipulations via RRDs (Perl)

August 1, 2010

Use case
You are doing a lot of data operations on your RRD files (create, update, fetch, last), and every update is done by a separate Perl process which lives a very short time – the process is launched, it updates or reads the data, does something else, and then exits.

The problem
If you are using RRDtool and Perl as described, you surely have noticed that running many of these processes wastes a lot of CPU resources. The question is – can we do some performance optimizations, and lessen the performance hit of loading the RRDs library into Perl? We know that launching often Perl itself is quite expensive, but after all, if we chose to work with Perl, this is a price we should be ready to pay.

The RRDtool shared library is a monolithic piece of code which provides ALL functions of the RRDtool suite – data manipulation, graphics and import/export tools. The last two components bring huge dependencies in regards to other shared libraries. The library from RRDtool version 1.4.4 depends on 34 other libraries on my Linux box! This must add up to the loading time of the RRDtool library into Perl.

Resolution and benchmarks
In order to prove my theory (actually, it was more a theory of zImage, and I just followed, enhanced and tried it), I commented out the implementation of the “graphics” and “import/export tools” modules from the source code of RRDtool. Then I re-compiled the library and did some performance benchmarks. I also re-implemented the RRDs.pm module by replacing the DynaLoader module with the XSLoader one. This made no difference in performance whatsoever. The re-compiled RRD library depends on only 4 other libraries – linux-gate.so.1, libm.so.6, libc.so.6, and /lib/ld-linux.so.2. I think this is the most we can cut down. :)

So here are the benchmark results. They show the accumulated time for 1000 invocations of the Perl interpreter with three different configurations:

  • Only Perl (baseline): 5.454s.
  • With RRDs, no graphics or import/export functions: 9.744s (+4.290s) +78%.
  • With standard RRDs: 11.647s (+6.192s) +113%.

As you can see, you can make Perl + RRDs start 35% faster. The speed up for RRDs itself is 44%.


Here are the commands I used for the benchmarks:

  • Only Perl (baseline): time ( i=1000 ; while [ "$i" -gt 0 ]; do perl -Mwarnings -Mstrict -e ” ; i=$(($i-1)); done )
  • Perl + RRDs: time ( i=1000 ; while [ "$i" -gt 0 ]; do perl -Mwarnings -Mstrict -MRRDs -e ” ; i=$(($i-1)); done )

C++ vs. Python vs. Perl vs. PHP performance benchmark

July 1, 2010

Update: There is a part #2 of the benchmark results.


This all began as a colleague of mine stated that Python was so damn slow for maths. Which really astonished me and made me check it out, as my father told me once that he was very satisfied with Python, as it was very maths oriented.

The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, arrays with numbers, basic math operations.

Update: Give your ideas and use-cases on what to benchmark, and I’ll try to implement it for you. I.e. “benchmark the languages for reading a file, then splitting it to tokens by white-space and finally outputting all unique elements and their count”.

Out of curiosity, Python was also benchmarked with and without the Psyco Python extension, which people say could greatly speed up the execution of any Python code without any modifications.

Here are the benchmark results:

Language CPU time Slower than
User System Total C++ previous
C++ (optimized with -O2) 2.456 0.400 2.856 - -
C++ (not optimized) 4.352 0.404 4.756 67% 67%
Python + Psyco 12.693 0.320 13.013 356% 174%
Python 28.866 0.208 29.074 918% 123%
Perl 42.515 0.184 42.699 1395% 47%
PHP 85.873 0.560 86.433 2926% 102%

The clear winner among the script languages is… Python. :)

The times include the interpretation/parsing phase for each language, but it’s so small that its significance is negligible. The math function is called 10 times, in order to have more reliable results. All scripts are using the very same algorithm to calculate the prime numbers in a given range. The correctness of the implementation is not so important, as we just want to check how fast the languages perform. The original Python algorithm was taken from http://www.daniweb.com/code/snippet216871.html.

All tests were done on a Kubuntu Lucid box. The versions of the used software packages follow:

  • g++ (GNU project C and C++ compiler) 4.4.3
  • Python 2.6.5
  • Python Psyco 1.6 (1ubuntu2)
  • Perl 5.10.1
  • PHP 5.3.2 (1ubuntu4.2 with Suhosin-Patch), Zend Engine 2.3.0

The C++ implementation follows, click “show source” below to see the full source:

#include <cstdio>
#include <cmath>
#include <vector>

using namespace std;

vector<int> get_primes7(int n) { // ugly variable declarations but close to the other lang. syntaxes
	vector<int> res;

	if (n < 2) return res;
	if (n == 2) {
		res.push_back(2);
		return res;
	}
	vector<int> s;
	for (int i = 3; i < n + 1; i += 2) {
		s.push_back(i);
	}
	int mroot = sqrt(n);
	int half = (int)s.size();
	int i = 0;
	int m = 3;
	while (m <= mroot) {
		if (s[i]) {
			int j = (int)((m*m - 3)/2);
			s[j] = 0;
			while (j < half) {
				s[j] = 0;
				j += m;
			}
		}
		i = i + 1;
		m = 2*i + 3;
	}
	res.push_back(2);
	for (vector<int>::iterator it = s.begin() ; it < s.end(); ++it) {
		if (*it) {
			res.push_back(*it);
		}
	}

	return res;
}

int main() {
	vector<int> res;
	for (int i = 1; i <= 10; ++i) {
		res = get_primes7(10000000);
		printf("Found %d prime numbers.\n", (int)res.size());
	}

	return 0;
}

The Python implementation follows, click “show source” below to see the full source:

#import psyco
#psyco.full()

def get_primes7(n):
	"""
	standard optimized sieve algorithm to get a list of prime numbers
	--- this is the function to compare your functions against! ---
	"""
	if n < 2:  return []
	if n == 2: return [2]
	# do only odd numbers starting at 3
	s = range(3, n+1, 2)
	# n**0.5 simpler than math.sqr(n)
	mroot = n ** 0.5
	half = len(s)
	i = 0
	m = 3
	while m <= mroot:
		if s[i]:
			j = (m*m-3)//2  # int div
			s[j] = 0
			while j < half:
				s[j] = 0
				j += m
		i = i+1
		m = 2*i+3
	return [2]+[x for x in s if x]

for t in range(10):
	res = get_primes7(10000000)
	print "Found", len(res), "prime numbers."

The Perl implementation follows, click “show source” below to see the full source:

use strict;
use warnings;

sub get_primes7($) {
	my ($n) = @_;

	if ($n < 2) { return (); }
	if ($n == 2) { return (2); }
	# do only odd numbers starting at 3
	my @s = ();
	for (my $i = 3; $i < $n + 1; $i += 2) {
		push(@s, $i);
	}
	# n**0.5 simpler than math.sqr(n)
	my $mroot = $n ** 0.5;
	my $half = scalar @s;
	my $i = 0;
	my $m = 3;
	while ($m <= $mroot) {
		if ($s[$i]) {
			my $j = int(($m*$m - 3) / 2);
			$s[$j] = 0;
			while ($j < $half) {
				$s[$j] = 0;
				$j += $m;
			}
		}
		$i = $i + 1;
		$m = 2*$i + 3;
	}
	my @res = (2);
	foreach (@s) {
		push(@res, $_) if ($_);
	}
	return @res;
}

my @res;
for (1..10) {
	@res = get_primes7(10000000);
	print "Found ".(scalar @res)." prime numbers.\n";
}

The PHP implementation follows, click “show source” below to see the full source:

<?php
error_reporting(E_ALL);
ini_set('display_errors', '1');

function get_primes7($n) {
	if ($n < 2) return array();
	if ($n == 2) return array(2);
	$s = array();
	for ($i = 3; $i < $n + 1; $i += 2) {
		$s[] = $i;
	}
	$mroot = sqrt($n);
	$half = count($s);
	$i = 0;
	$m = 3;
	while ($m <= $mroot) {
		if ($s[$i]) {
			$j = (int)(($m*$m - 3) / 2);
			$s[$j] = 0;
			while ($j < $half) {
				$s[$j] = 0;
				$j += $m;
			}
		}
		$i = $i + 1;
		$m = 2*$i + 3;
	}
	$res = array(2);
	foreach ($s as $v) {
		if ($v) {
			$res[] = $v;
		}
	}
	return $res;
}

$res = array();
for ($i = 1; $i <= 10; ++$i) {
	$res = get_primes7(10000000);
	print "Found ".count($res)." prime numbers.\n";
}

Update (Jul/24/2010): Added the C++ optimized values.
Update (Aug/02/2010): Added a link to the benchmarks, part #2.


Filter a character sequence leaving only valid UTF-8 characters

July 1, 2010

This is my implementation of a Perl regular expression which sanitizes a multi-byte character sequence by filtering only the valid UTF-8 characters in it. Any non-UTF-8 character sequences are deleted and in the end you get a clean, valid UTF-8 multi-byte string.

Note that this works only for a subset of the UTF-8 alphabet. I.e. this is not a general filtering regular expression, but it leaves the standard ASCII and only the Cyrillic UTF-8 characters. You can easily extend the regular expression and add another UTF-8 subset.

Let’s get to the requirements:

  • Standard ASCII symbols: As it is described at the Wikipedia UTF-8 page, the ASCII characters from Hex 00-7F are encoded without modification in a UTF-8 sequence, as they are “Single-byte encoding (compatible with US-ASCII)”. Therefore, any character between Hex 00-7F is valid in a UTF-8 sequence. Though, for our current example, we will leave only certain ASCII symbols and namely a few of the control ones, and the printable ones:
    • ASCII control symbols: \t -> Hex 09, \n -> Hex 0A, \r -> Hex 0D.
    • Printable single-byte ASCII symbols: Hex 20-7E.
  • Cyrillic multi-byte UTF-8 characters, only the Russian/Bulgarian ones: If you open the Unicode/UTF-8 character table, and navigate to the “U+0400…U+04FF: Cyrillic” block, you can visually choose which characters you want to allow in your UTF-8 sequence by looking in the “character” column. In my case, I want to allow the characters “А”, “Б”, “В”, “Г” and so on until “ю”, “я”. If you look at the “UTF-8 (hex.)” column, you will notice that the range of these Cyrillic characters is from Hex d0 91 to Hex d0 bf, and from Hex d1 80 to Hex d1 8f. Yes, two ranges.

Therefore, our regular expression has to allow only the following sequences:

  • Single-byte, standard ASCII: \t, \n, \r, and x20-x7E.
  • Multi-byte, Cyrillic UTF-8: xD090-xD0BF, and xD180-xD18F.

Once you have established these rules, it’s very easy to construct the regular expression:

$my_string =~ s/.*?((?:[\t\n\r\x20-\x7E])+|(?:\xD0[\x90-\xBF])+|(?:\xD1[\x80-\x8F])+|).*?/$1/sg;


If you are wondering why I would need only certain ASCII control and only the printable ASCII characters, the answer is – because of the XML standard. As the XML W3C Recommendations state, only certain Hex characters and character sequences are valid in an XML document, even as HTML entities: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF].

Libexpat is very strict in what you feed as input, and if your input isn’t a valid UTF-8 sequence, you will end up with the error message “XML parse error: not well-formed (invalid token)”.


Migrate your TWiki to Google Sites (using Google Sites API and Perl)

May 30, 2010

If you want to transfer your existing TWiki webs to Google Sites, you can do it automatically with the power of the Google Sites API and Perl.

You can download the Perl script, which I used to export my TWiki webs and then import them in Google Sites, at the following page: http://www.famzah.net/download/google-api/twiki2googlesites.pl

Note that is in no way a complete migration solution. You can use it as a demonstration/base on how to interact with the following Google API features using Perl:

Now you know that you can use Perl to interact with Google APIs. Go build your own scripts!

Update: Google released Python command line tools for the Google Data APIs (GoogleCL). They seem promising and very easy to use for simple automation tasks.

P.S. If you’re limited on time and your TWiki is relatively small on pages count, you’ve got a pretty good chance of migrating it manually with copy/paste, than writing your own migration scripts. Believe me. :)


Perl API Kit for ResellerClub (DirectI)

May 26, 2010

ResellerClub offer a SOAP/WSDL API interface, in addition to their Online Control Panel, which lets you automate some of your tasks or integrate it directly with your website.

They claim to support a Perl API Kit, but it doesn’t work out-of-the box for me. Whenever I make an API call, I get the following:

soapenv:Server.userException java.lang.Exception: Body not found.

There is a similar bug report at Web Hosting Talk too.

After a few hours of struggling with SOAP::Lite, reading sources, and some trial and error, I finally was able to make the API work in Perl! :D

If you want to try my version of their Perl API Kit, you have to execute the following:

wget --no-verbose http://www.famzah.net/download/resellerclub/resellerclub-api.tgz
tar -zxf resellerclub-api.tgz
cd resellerclub-api

vi example.pl # edit your username/password
./example.pl

In order to build my version of the Perl API Kit yourself, click the “show source” link below and execute the commands.

mkdir resellerclub-api
cd resellerclub-api
wget --no-verbose http://www.famzah.net/download/resellerclub/setup.sh
wget --no-verbose http://www.famzah.net/download/resellerclub/example.pl
chmod +x setup.sh example.pl
./setup.sh

vi example.pl # edit your username/password
./example.pl

The scripts use some Debian/Ubuntu specific “apt-get” commands to install the required Perl and system packages, but this can easily be ported to other *nix systems too.

Hot offer: If you want to start selling directly at “Slab 2″ discounted pricing with ResellerClub, contact me, and I’ll set up a sub-reseller account for you. You will save $1499 worth of initial investments.


Auto-flush both STDOUT and STDERR in Perl

April 8, 2010

Q: Why Perl warn() or other STDERR output is not shown/logged/saved/flushed into my log file?
A: You may have encountered the well-known feature of stream buffering which is enabled by default.

An excerpt from the perlvar documentation says that “…STDOUT will typically be line buffered if output is to the terminal and block buffered otherwise”. Thus it is always buffered, also for STDERR.

Usually people remember to set STDOUT as auto-flush, but you should enable this for STDERR as well, or else your messages to STDERR may not appear in your log file immediately, if you are redirecting STDERR to a file.

The following piece of code sets an auto-flush for both STDOUT and STDERR:

select(STDERR);
$| = 1;
select(STDOUT); # default
$| = 1;

The select() function and the $| variable are built-in for Perl and require no additional libraries to be included.

Alternatively, you can also use IO::Handle to achieve the same result:

use IO::Handle;
STDERR->autoflush(1);
STDOUT->autoflush(1);

I never realized why stream buffering for STDOUR and STDERR is enabled by default for most scripting languages… But that’s just me.


References: