/contrib/famzah

Enthusiasm never stops


Leave a comment

Google App Engine Datastore benchmark

I admire the idea of Google App Engine — a platform as a service where there is “no worrying about DBAs, servers, sharding, and load balancers”. And you can “auto scale to 7 billion requests per day”. I wanted to try the App Engine for a pet project where I had to collect, process and query a huge amount of time series. The fact that I needed to do fast queries over tens of 1000’s of records however made me wonder if the App Engine Datastore would be fast enough. Note that in order to reduce the amount of entities which are fetched from the database, couples of data entries are consolidated into a single database entity. This however imposes another limitation — fetching big data entities uses more memory on the running instance.

My language of choice is Java, because its performance for such computations is great. I am using the the Objectify interface (version 4.0rc2), which is also one of the recommended APIs for the Datastore.

Unfortunately, my tests show that the App Engine is not suitable for querying of such amount of data. For example, fetching and updating 1000 entries takes 1.5 seconds, and additionally uses a lot of memory on the F1 instance. You can review the Excel sheet file below for more detailed results.

Basically each benchmark test performs the following operations and then exits:

  1. Adds a bunch of entries.
  2. Gets those entries from the database and verifies them.
  3. Updates those entries in the database.
  4. Gets the entries again from the database and verifies them.
  5. Deletes the entries.

All Datastore operations are performed in a batch and thus in an asynchronous parallel way. Furthermore, no indexes are used but the entities are referenced directly by their key, which is the most efficient way to query the Datastore. The tests were performed at two separate days because I wanted to extend some of the tests. This is indicated in the results. A single warmup request was made before the benchmarks, so that the App Engine could pre-load our application.

The first observation is that using the default F1 instance once we start fetching more than 100 entities or once we start to add/update/delete more than 1000 entities, we saturate the Datastore -> Objectify -> Java throughput and don’t scale any more:
App Engine Datastore median time per entity for 1 KB entities @ F1 instance

The other interesting observation is that the Datastore -> Objectify -> Java throughput depends a lot on the App Engine instance. That’s not a surprising fact because the application needs to serialize data back and forth when communicating with the Datastore. This requires CPU power. The following two charts show that more CPU power speeds up all operations where serializing is involved — that is all Datastore operations but the Delete one which only queries the Datastore by supplying the keys of the entities, no data:
App Engine Datastore times per entity for 1000 x 1 KB entities @ F1 instance

App Engine Datastore times per entity for 1000 x 1 KB entities @ F4 instance

It is unexpected that the App Engine and the Datastore still have good and bad days. Their latency as well as CPU accounting could fluctuate a lot. The following chart shows the benchmark results which we got using an F1 instance. If you compare this to the chart above where a much more expensive F4 instance was used, you’ll notice that the 4-times cheaper F1 instance performed almost as fast as an F4 instance:
App Engine Datastore times per entity for 1000 x 1 KB entities @ F1 instance (test on another day)

The source code and the raw results are available for download at http://www.famzah.net/download/gae-datastore-performance/


Leave a comment

Google App Engine Performance Profiling

When developing under Google App Engine, developers need to pay attention to how fast their code works, and also how much resources their code uses. The first parameter is vital for user-experience, the second – for the hosting expenses, as resources usage costs money.

It turns out that there are quite a few suitable profilers for Google App Engine. At least these are the ones I could find:

  • Appstats – no doubt, the most advanced one, giving you information about the timing and costs of your RPC calls to the datastore, Memcache, etc.
    Works for both Python, and Java. Included in the official SDK as of version 1.3.1.
  • appengine-profiler – besides being an RPC profiler like Appstats, appengine-profiler gives you the option to profile the CPU usage of your code, thus you can easily identify hot-spots where your code wastes a lot of CPU resources and wall-clock time. You can define “tracepoints” which surround part of your code blocks, and you’ll easily know the resources usage of these blocks for each page load.
    Works for Python.
  • AppWrench – the Java profiler. I don’t code on Java, and I haven’t tested this profiler, but I’m including it here for all you Java gurus.

Resources:


Leave a comment

Validator for the Model key_name property in Google App Engine datastore (Python)

The Google App Engine datastore provides convenient data modeling with Python. One important aspect is the validation of the data stored in a Model instance. Each data key-value is stored as a Property which is an attribute of a Model class.

While every Property can be validated automatically by specifying a “validator” function, there is no option for the Model key name to be automatically validated. Note that we can manually specify by our code the value of the key name, and therefore this key name can be considered user-data and must be validated. The key name is by the way the only unique index constraint, similar to the “primary key” in relational databases, which is supported by the Google datastore, and can be specified manually.

Here is my version for a validation function for the Model’s key name:

from google.appengine.ext import db
import re

def ModelKeyNameValidator(self, regexp_string, *args, **kwargs):
	gotKey = None
	className = self.__class__.__name__

	if len(args) >= 2:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'args'
		k = args[1] # key_name given as an unnamed argument
	if 'key' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'Key'
		k = kwargs['key'].name() # key_name given as Key instance
	if 'key_name' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + className)
		gotKey = 'key_name'
		k = kwargs['key_name'] # key_name given as a keyword argument

	if not gotKey:
		raise Exception('No key found for Model ' + className)

	id = '%s.key_name(%s)' % (self.__class__.__name__, gotKey)
	if (not re.search(regexp_string, k)):
		raise ValueError('(%s) Value "%s" is invalid. It must match the regexp "%s"' % (id, k, regexp_string))

class ClubDB(db.Model):
	# key = url
	def __init__(self, *args, **kwargs):
		ModelKeyNameValidator(self, '^[a-z0-9-]{2,32}$', *args, **kwargs)
		super(self.__class__, self).__init__(*args, **kwargs)

	name = db.StringProperty(required = True)

As you can see, the proposed solution is not versatile enough, and requires you to copy and alter the ModelKeyNameValidator() function again and again for every new validation type. I strictly follow the Don’t Repeat Yourself principle in programming, so after much Googling and struggling with Python, I got to the following solution which I actually use in my projects (click “show source” to see the code):

from google.appengine.ext import db
import re

def re_validator(id, regexp_string):
	def validator(v):
		string_type_validator(v)
		if (not re.search(regexp_string, v)):
			raise ValueError('(%s) Value "%s" is invalid. It must match the regexp "%s"' % (id, v, regexp_string))
	return validator

def length_validator(id, minlen, maxlen):
	def validator(v):
		string_type_validator(v)
		if minlen is not None and len(v) < minlen:
			raise ValueError('(%s) Value "%s" is invalid. It must be more than %s characters' % (id, v, minlen))
		if maxlen is not None and len(v) > maxlen:
			raise ValueError('(%s) Value "%s" is invalid. It must be less than %s characters' % (id, v, maxlen))
	return validator

def ModelKeyValidator(v, self, *args, **kwargs):
	gotKey = None

	if len(args) >= 2:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'args'
		k = args[1] # key_name given as unnamed argument
	if 'key' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'Key'
		k = kwargs['key'].name()
	if 'key_name' in kwargs:
		if gotKey: raise Exception('Found key for second time for Model ' + self.__class__.__name__)
		gotKey = 'key_name'
		k = kwargs['key_name']

	if not gotKey:
		raise Exception('No key found for Model ' + self.__class__.__name__)

	v.execute('%s.key_name(%s)' % (self.__class__.__name__, gotKey), k) # validate the key now

class DelayedValidator:
	''' Validator class which allows you to specify the "id" dynamically on validation call '''
	def __init__(self, v, *args): # specify the validation function and its arguments
		self.validatorArgs = args
		self.validatorFunction = v

	def execute(self, id, value):
		if not isinstance(id, basestring):
			raise Exception('No valid ID specified for the Validator object')
		func = self.validatorFunction(id, *(self.validatorArgs)) # get the validator function
		func(value) # do the validation

class ClubDB(db.Model):
	# key = url
	def __init__(self, *args, **kwargs):
		ModelKeyValidator(DelayedValidator(re_validator, '^[a-z0-9-]{2,32}$'), self, *args, **kwargs)
		super(self.__class__, self).__init__(*args, **kwargs)

	name = db.StringProperty(
		required = True,
		validator = length_validator('ClubDB.name', 1, None))

You probably noticed that in the second example I also added a validator for the “name” property too. Note that the re_validator() and length_validator() functions can be re-used. Furthermore, thanks to the DelayedValidator class which accepts a validator function and its arguments as constructor arguments, the ModelKeyValidator class can be re-used without any modifications too.

P.S. It seems that all “validator” functions are executed every time a Model class is being instantiated. This means that no matter if you are updating/creating the data object, or you are simply reading it from the datastore, the assigned values are always validated. This surely wastes some CPU cycles, but for now I have no idea how to easily circumvent this.

Disclaimer: I’m new to Python and Google App Engine. But they seem fun! 🙂 Sorry for the long lines…


Resources: