Google App Engine Performance Profiling

September 14, 2010

When developing under Google App Engine, developers need to pay attention to how fast their code works, and also how much resources their code uses. The first parameter is vital for user-experience, the second – for the hosting expenses, as resources usage costs money.

It turns out that there are quite a few suitable profilers for Google App Engine. At least these are the ones I could find:

  • Appstats – no doubt, the most advanced one, giving you information about the timing and costs of your RPC calls to the datastore, Memcache, etc.
    Works for both Python, and Java. Included in the official SDK as of version 1.3.1.
  • appengine-profiler – besides being an RPC profiler like Appstats, appengine-profiler gives you the option to profile the CPU usage of your code, thus you can easily identify hot-spots where your code wastes a lot of CPU resources and wall-clock time. You can define “tracepoints” which surround part of your code blocks, and you’ll easily know the resources usage of these blocks for each page load.
    Works for Python.
  • AppWrench – the Java profiler. I don’t code on Java, and I haven’t tested this profiler, but I’m including it here for all you Java gurus.

Resources:


Speed up RRDtool database manipulations via RRDs (Perl)

August 1, 2010

Use case
You are doing a lot of data operations on your RRD files (create, update, fetch, last), and every update is done by a separate Perl process which lives a very short time – the process is launched, it updates or reads the data, does something else, and then exits.

The problem
If you are using RRDtool and Perl as described, you surely have noticed that running many of these processes wastes a lot of CPU resources. The question is – can we do some performance optimizations, and lessen the performance hit of loading the RRDs library into Perl? We know that launching often Perl itself is quite expensive, but after all, if we chose to work with Perl, this is a price we should be ready to pay.

The RRDtool shared library is a monolithic piece of code which provides ALL functions of the RRDtool suite – data manipulation, graphics and import/export tools. The last two components bring huge dependencies in regards to other shared libraries. The library from RRDtool version 1.4.4 depends on 34 other libraries on my Linux box! This must add up to the loading time of the RRDtool library into Perl.

Resolution and benchmarks
In order to prove my theory (actually, it was more a theory of zImage, and I just followed, enhanced and tried it), I commented out the implementation of the “graphics” and “import/export tools” modules from the source code of RRDtool. Then I re-compiled the library and did some performance benchmarks. I also re-implemented the RRDs.pm module by replacing the DynaLoader module with the XSLoader one. This made no difference in performance whatsoever. The re-compiled RRD library depends on only 4 other libraries – linux-gate.so.1, libm.so.6, libc.so.6, and /lib/ld-linux.so.2. I think this is the most we can cut down. :)

So here are the benchmark results. They show the accumulated time for 1000 invocations of the Perl interpreter with three different configurations:

  • Only Perl (baseline): 5.454s.
  • With RRDs, no graphics or import/export functions: 9.744s (+4.290s) +78%.
  • With standard RRDs: 11.647s (+6.192s) +113%.

As you can see, you can make Perl + RRDs start 35% faster. The speed up for RRDs itself is 44%.


Here are the commands I used for the benchmarks:

  • Only Perl (baseline): time ( i=1000 ; while [ "$i" -gt 0 ]; do perl -Mwarnings -Mstrict -e ” ; i=$(($i-1)); done )
  • Perl + RRDs: time ( i=1000 ; while [ "$i" -gt 0 ]; do perl -Mwarnings -Mstrict -MRRDs -e ” ; i=$(($i-1)); done )

Firefox 3.6 menu and right-click are slow on Kubuntu Lucid

May 18, 2010

If you have installed a fresh copy of Kubuntu Lucid as I did, on two different computers, and your Firefox main menus or right-click context menus are being shown slowly (i.e. take several seconds to appear), then I can tell you what it was not for me. And later I’ll tell you how I fixed it. :)

It was not:

  • Firefox settings in “~/.mozilla”, add-ons or extensions. Disable/enable makes no difference.
  • nVidia drivers, or any other video drivers like ATI, etc, as you’ll see below.
  • GTK theme.
  • Custom fonts.
  • The Xorg server.
  • Anything I could easily spot via strace, because Firefox is very complicated for me.
  • Kubuntu theme.
  • Firefox safe-mode doesn’t help either.

I’ve tried it all. The problem was in the global network settings, and more exactly in “/etc/hosts“.
Logical, no? :)

The fix: You need to make sure that “localhost” is defined properly for “127.0.0.1″ and not defined for the IPv6 configuration like it was by default on my two newly set up Kubuntu boxes.

That’s a bad “/etc/hosts” file:

127.0.0.1       localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
... (and so on)

That’s the fixed “/etc/hosts” file:

127.0.0.1       localhost

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
... (and so on)

Pay attention that on line #1 there is “localhost” defined for “127.0.0.1″, and on line #4 there is no “localhost” for the IPv6 address “::1″.

Enjoy your faster Firefox.

Update: It seems that “localhost” must be defined for IPv6, according to the Ubuntu developers. This is discussed in Ubuntu Bug #301430: ipv6 /etc/hosts missing localhost hostname. I’ll leave it up to you to decide if you want to risk breaking your IPv6 or other functionality.

P.S. Also enjoy Steve Ballmer and his “Developers, developers, developers…”!


Resources:


Why /sys/block/dm-0/queue/scheduler exists on my Linux system?

January 25, 2010

The device-mapper (DM) traditionally didn’t have its own I/O scheduler. Then why suddenly my DM devices have such a scheduler and what does it control?


A new type of device-mapper was introduced recently in the Linux kernel 2.6.31 – the request-based device-mapper. According to the Linux Kernel Newbies changelog for 2.6.31, there is a commit which does “Prepare for request based option”.

The issue is actually not in the new request-based DM option, which is to be used only for multipath block devices. The problem is that when you create a regular LVM device on kernels 2.6.31+, the DM device itself has I/O scheduler parameters. So does the underlying block device on top of which you created the LVM. Thus we are having two I/O schedulers in the path from the LVM device to the physical storage.

According to the author of the kernel patches for the request-based DM device, Kiyoshi Ueda, for a bio-based DM device, only the underlying device’s scheduler should affect performance. This is what my tests shown too, therefore there is no discrepancy.

Let me summarize this:

  • If you are *not* using multipath block devices in your DM/LVM setup, then only the underlying device’s scheduler (i.e. “/sys/block/sda/queue/scheduler”) takes effect. This applies for the trivial LVM setup which many of us used for years.
  • If you are using a multipath DM/LVM setup, then only the DM device’s scheduler (i.e. “/sys/block/dm-0/queue/scheduler”) takes effect.

References:


Bifferboard performance benchmarks

November 25, 2009

The benchmarks were done while Bifferboard was running Linux kernel 2.6.30.5 and Debian Lenny.

Boot time
Total boot time: 1 minute 11 seconds (standard Debian Lenny base installation)

The boot process goes like this:

  • Initial boot. Mounted root device (5 seconds wasted on waiting for the USB mass storage to be initialized). Executing INIT. [+21 seconds elapsed]
  • Waiting for udev to be initialized (most of the time spent here), configuring some misc settings, no dhcp network, entering Runlevel 2. [+41 seconds elapsed]
  • Started “rsyslogd”, “sshd”, “crond”. Got prompt on the serial console. [+9 seconds elapsed]

Therefore, if a very limited and custom Linux installation is used, the total boot time could be reduced almost twice.

CPU speed
Calculated BogoMips: 56.32
According to a quite complete BogoMips list table, this is an equivalent of Pentium@133MHz or 486DX4@100MHz.

According to another CPU benchmarks comparison table for Linux, Bifferboard falls into the category of Pentium@100Mhz.

Memory speed
A “dd” write to “/dev/shm” performs with a speed of 6.3 MB/s.
The MBW memory bandwidth benchmark shows the following results:

bifferboard:/tmp# wget

http://de.archive.ubuntu.com/ubuntu/pool/universe/m/mbw/mbw_1.1.1-1_i386.deb

bifferboard:/tmp# dpkg -i mbw_1.1.1-1_i386.deb
bifferboard:/tmp# mbw 4 -n 20|egrep ^AVG
AVG Method: MEMCPY Elapsed: 0.15602 MiB: 4.00000 Copy:
25.637 MiB/s
AVG Method: DUMB Elapsed: 0.06611 MiB: 4.00000 Copy:
60.502 MiB/s
AVG Method: MCBLOCK Elapsed: 0.06619 MiB: 4.00000 Copy:
60.431 MiB/s

For comparison, my Pentium Dual-Core @ 2.50GHz with DDR2 @ 800 MHz (1.2 ns) shows a “dd” copy speed to “/dev/shm” of 271 MB/s, while the MBW test shows a maximum average speed of 7670.954 MiB/s. Bifferboard is an embedded device after all… :)

Available memory for applications
My base Debian installation with “udevd”, “dhclient3″, “rsyslogd”, “sshd”, “getty” and one tty session running shows 24908 kbytes free memory. You surely cannot put CNN.com on this little machine, but compared to the PIC16F877A which has 368 bytes (yes, bytes) total RAM memory, Bifferboard is a monster.

Disk system
All tests are done on an Ext3 file-system and a very fast USB Flash 8GB A-Data Xupreme 200X.
A “dd” copy to a file completes with a write speed of 6.1 MB/s.
The Bonnie++ benchmark test shows the following results:

Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
bifferboard.lo 300M   822  96  5524  61  4305  42   855  99 16576  67 143.2  12
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP
                 16  1274  94 14830 100  1965  85  1235  90 22519 100 2015  87

bifferboard.localdomain,300M,822,96,5524,61,4305,42,855,99,16576,67,143.2,12,16,1274,94,14830,100,1965,85,1235,90,22519,100,2015,87

Therefore, the sequential write speed is about 5.5 MB/s, while the sequential read speed is about 16.5 MB/s.

It’s worth mentioning that while the write tests were running, there was a very high CPU System load (not CPU I/O waiting) which indicates that the write throughput of Bifferboard may be a bit better if the file-system is not a journaling one. However, the tests for “Memory speed” above show that writing to “/dev/shm” (a memory-based file-system) completes with a rate of 6.3 MB/s. Therefore, this is probably the limit with this configuration.

Network
Both Netperf and Wget show a throughput of 6.5 MB/s.
The packets-per-second tests complete at a rate of 8000 packets/second.
Modern systems can handle several hundred thousand packets-per-second without an issue. However, the measured network performance of Bifferboard is more than enough for trivial network communication with the device. During the network benchmark tests, there was very high CPU System usage, but that was expected.

Encryption and SSH transfers
The maximum encryption rate for an eCryptfs mount with AES cipher and 16 bytes key length is 536 KB/s. The standard SSH Protocol 2 transfer rate using the OpenSSH server is about the same – 587.1 KB/s. If you try to transfer a file over SSH and store it on an eCryptfs mounted volume, the transfer rate is 272.2 KB/s, which is logical, as the processing power is split between the SSH transfer and the eCryptfs encryption.
You can try to tweak your OpenSSH ciphers, in order to get much better performance. The OpenSSH ciphers performance benchmark page will give you a starting point.

Conclusion
Bifferboard performs pretty well for its price. It’s my personal choice over the 8-bit 16F877A and the other 16-bit Microchip / ARM microcontrollers, when a project does not require very fast I/O.


Benchmark the packets-per-second performance of a network device

November 24, 2009

This article is about measuring network throughput and performance using a Linux box, in a very approximate way.

There are many network performance benchmarks and stress-tests which measure the maximum bandwidth transfer rate of a device – that is how many bytes-per-second or bits-per-second a device can handle. Good examples for such benchmark tests are NetPerf, Iperf, or even multiple Wget downloads which run simultaneously and save the downloaded files to /dev/null, so that we eliminate the hard disk throughput of the local machine.

There is however another very important metric for the throughput of a network device – its packets-per-second (pps) maximum rate.

Today I struggled a lot to find out a tool which measures the packets-per-second throughput of a network device and couldn’t find a suitable one. Therefore, I tried two different methods both using the ICMP echo protocol, in order to approximately measure the packets-per-second metric of a remote network device which is directly attached to my network.

Method #1: The old-school “ping“. Execute the following as root:

ping -q -s 1 -f 192.168.100.101

This floods the destination “192.168.100.101″ with the smallest possible PING packets. Here is a sample output:

--- 192.168.100.101 ping statistics ---
20551 packets transmitted, 20551 received, 0% packet loss, time 6732ms

You can calculate the packets-per-second rate by dividing the count of the packets to the elapsed time:

20551 packets / 6.732 secs ~= 3052 packets-per-second

Note that this is the count of the replies at 0% packet loss. Which means that the same count requests were sent too. So the final result is:

3052 x 2 = 6104 packets-per-second

Note that if you run multiple “ping” commands simultaneously, you will get more accurate results. You need to sum the average values from every “ping” command, in order to calculate the overall packets-per-second rate. For example, when I performed the measurement against the very same host but using two or three simultaneous “ping” commands, the average packets-per-second value was settled to 8000.

Method #2: The much faster and extended “hping3“. Execute the following:

taurus:~# time hping3 192.168.100.101 -q -i u20 --icmp|tail -n10

This will bombard the host “192.168.100.101″ with ICMP echo requests. After a few seconds, you can interrupt the ping by pressing CTRL+C. Here is the output:

--- 192.168.100.101 hping statistic ---
67932 packets transmitted, 10818 packets received, 85% packet loss
round-trip min/avg/max = 0.5/7.8/15.6 ms

real 0m2.635s
user 0m0.368s
sys 0m1.160s

If the packet loss is 0%, decrease “-i u20″ to something smaller like “-i u15″ or less. Make sure to not decrease it too much, or you may temporarily lose connectivity to the device (because of switch problems?).

The calculations for the packets-per-second value are the same as for “ping” above. You need to divide the received packets to the time:

10818 packets / 2.635 secs ~= 4105 packets-per-second

Because the calculated packets-per-second represent the rate of successful replies, we need to multiply it by two, in order to get the actual packets-per-second, as the count of the replies is the same as the count of the successfully received requests. The final result is:

4105 x 2 = 8210 packets-per-second

The calculated value by “hping3” is the same as the one by “ping”. We are either correct, or both methods failed… :)


Some notes which apply for both methods:

  • You should run multiple instances of the ping commands, either from the same or from different machines, in order to be able to properly saturate the connection of the machine which is being tested.
  • The machines from which you test have a hardware rate limit too. If the host being tested has a similar rate limit, you surely need to run the ping commands from more than a single machine.
  • You should run more and more simultaneous ping commands until you start to receive similar results for the calculated packets-per-second rate. This indicates a saturation of the network bandwidth, usually at the remote host which is being tested, but may also indicate a saturation somewhere in the route between your tester machines and the tested host…
  • The ethernet switches also have a hardware limit of the bandwidth and the packets-per-second. This may influence the overall results.
  • If the tested device has any firewall rules, you should temporarily remove them, because they may slow down the packets processing.
  • If the tested device is a Linux box, you should temporarily remove the ICMP rate limits by executing “sysctl net.ipv4.icmp_ratelimit=0″ and “sysctl net.ipv4.icmp_ratemask=0″.
  • If the tested device is a router, a better way to test its packets-per-second maximum rate is to flood one of its interfaces and count the output rate at the other one, assuming that you are actually flooding a host which is behind the router according to its routing tables.

Thanks to zImage for his usual willingness to help people out by giving good tips and ideas.


Follow

Get every new post delivered to your Inbox.