Google App Engine Datastore benchmark

I admire the idea of Google App Engine — a platform as a service where there is “no worrying about DBAs, servers, sharding, and load balancers”. And you can “auto scale to 7 billion requests per day”. I wanted to try the App Engine for a pet project where I had to collect, process and query a huge amount of time series. The fact that I needed to do fast queries over tens of 1000′s of records however made me wonder if the App Engine Datastore would be fast enough. Note that in order to reduce the amount of entities which are fetched from the database, couples of data entries are consolidated into a single database entity. This however imposes another limitation — fetching big data entities uses more memory on the running instance.

My language of choice is Java, because its performance for such computations is great. I am using the the Objectify interface (version 4.0rc2), which is also one of the recommended APIs for the Datastore.

Unfortunately, my tests show that the App Engine is not suitable for querying of such amount of data. For example, fetching and updating 1000 entries takes 1.5 seconds, and additionally uses a lot of memory on the F1 instance. You can review the Excel sheet file below for more detailed results.

Basically each benchmark test performs the following operations and then exits:

  1. Adds a bunch of entries.
  2. Gets those entries from the database and verifies them.
  3. Updates those entries in the database.
  4. Gets the entries again from the database and verifies them.
  5. Deletes the entries.

All Datastore operations are performed in a batch and thus in an asynchronous parallel way. Furthermore, no indexes are used but the entities are referenced directly by their key, which is the most efficient way to query the Datastore. The tests were performed at two separate days because I wanted to extend some of the tests. This is indicated in the results. A single warmup request was made before the benchmarks, so that the App Engine could pre-load our application.

The first observation is that using the default F1 instance once we start fetching more than 100 entities or once we start to add/update/delete more than 1000 entities, we saturate the Datastore -> Objectify -> Java throughput and don’t scale any more:
App Engine Datastore median time per entity for 1 KB entities @ F1 instance

The other interesting observation is that the Datastore -> Objectify -> Java throughput depends a lot on the App Engine instance. That’s not a surprising fact because the application needs to serialize data back and forth when communicating with the Datastore. This requires CPU power. The following two charts show that more CPU power speeds up all operations where serializing is involved — that is all Datastore operations but the Delete one which only queries the Datastore by supplying the keys of the entities, no data:
App Engine Datastore times per entity for 1000 x 1 KB entities @ F1 instance

App Engine Datastore times per entity for 1000 x 1 KB entities @ F4 instance

It is unexpected that the App Engine and the Datastore still have good and bad days. Their latency as well as CPU accounting could fluctuate a lot. The following chart shows the benchmark results which we got using an F1 instance. If you compare this to the chart above where a much more expensive F4 instance was used, you’ll notice that the 4-times cheaper F1 instance performed almost as fast as an F4 instance:
App Engine Datastore times per entity for 1000 x 1 KB entities @ F1 instance (test on another day)

The source code and the raw results are available for download at http://www.famzah.net/download/gae-datastore-performance/

C++: Use the right size for the right job

Do you want to speed up your C++ application up to 4 times? Read on.

Recently I happened to be on a summer holiday with a friend who is a C++ guru. We discussed my language performance benchmark post, which had just received a lot of attention in regards to C++. It took him a few hours to demonstrate something really important and often underestimated by developers — that we need to allocate only as much memory as needed, and nothing more.

The C++ algorithm which we use at the benchmark blog page uses an array of “int” (which on Linux also happens to be 4 bytes == 32 bits). However, if you review the code you’ll notice that the “int” value is used as a pure boolean variable for the given array index. So my friend proposed the following optimized version, which uses 1 bit of memory for the boolean value and thus doesn’t waste the other 31-bits left if we used “int” instead:

--- primes.cpp	2013-06-17 17:16:09.000000000 +0300
+++ primes-bool.cpp	2013-06-17 18:43:28.000000000 +0300
@@ -12,9 +12,9 @@
 		res.push_back(2);
 		return res;
 	}
-	vector<int> s;
+	vector<bool> s;
 	for (int i = 3; i < n + 1; i += 2) {
-		s.push_back(i);
+		s.push_back(true);
 	}
 	int mroot = sqrt(n);
 	int half = (int)s.size();
@@ -23,9 +23,9 @@
 	while (m <= mroot) {
 		if (s[i]) {
 			int j = (int)((m*m - 3)/2);
-			s[j] = 0;
+			s[j] = false;
 			while (j < half) {
-				s[j] = 0;
+				s[j] = false;
 				j += m;
 			}
 		}
@@ -33,9 +33,9 @@
 		m = 2*i + 3;
 	}
 	res.push_back(2);
-	for (vector<int>::iterator it = s.begin() ; it < s.end(); ++it) {
-		if (*it) {
-			res.push_back(*it);
+	for (size_t it = 0; it < s.size(); ++it) {
+		if (s[it]) {
+			res.push_back(2*it + 3);
 		}
 	}

The end result is that the original C++ implementation finishes for 18.913 CPU seconds, while the memory-optimized completes in 5.354 seconds (28%). Impressive!

The bottom line is that even RAM is very fast and memory allocation is efficient, using only as few memory as really needed could vastly improve the performance of your C++ application. Which applies for programs developed in any language.

Using flock() in Bash without invoking a subshell

The flock(1) utility on Linux manages flock(2) advisory locks from within shell scripts or the command line. This lets you synchronize your Bash scripts with all your other applications written in Perl, Python, C, etc.

I’ll focus on the third usage form where flock() is used inside a Bash script. Here is what the man page suggests:

#!/bin/bash

(
flock -s 200

# ... commands executed under lock ...

) 200>/var/lock/mylockfile

Unfortunately, this invokes a subshell which has the following drawbacks:

  • You cannot pass values to variables from the subshell in the main shell script.
  • There is a performance penalty.
  • The syntax coloring in “vim” does not work properly. :)

This motivated my colleague zImage to come up with a usage form which does not invoke a subshell in Bash:

#!/bin/bash

exec 200>/var/lock/mylockfile || exit 1
flock -n 200 || { echo "ERROR: flock() failed." >&2; exit 1; }

# ... commands executed under lock ...

flock -u 200

Google Reader alternative

Google announced that they are shutting down their online web RSS reader on July 1, 2013. What a shame, it was really useful and with a great web design.

After short research, I decided to code one — for fun and education. It’s designed to operate in a multi-user way, so if you want to give it a try, go on!

My online implementation is named “xs RSS reader“, short for extra-simple RSS reader:

http://www.famzah.net/xs-rss-reader/

Here is a sample demo screenshot:
xs RSS reader -- demo screenshot

Nagios: Improve CPU performance with popen_noshell()

Today I’ll share my real-world experience with popen_noshell() on the Nagios monitoring server which we run at work. We are actively monitoring 1166 hosts and 14250 services. The machine has 6 GB RAM and a single Intel Core i7-950 CPU with enabled multi-threading (8 total threads) and slight overclock. Besides running Nagios, this machine also handles the incoming data from our custom monitoring systems, processes RRD database storage, and generates web interface status + charts output. So it’s a pretty busy machine which does a lot of network activity and where the Nagios daemon is just a part of the CPU load. For example, since boot the main “nagios3″ process has used only 20% of the CPU. The other part has been used by the fork()’ed Perl scripts (we use a lot of them for the active checks), the Nagios standard network checks, and the Apache/PHP web server handling the incoming data.

Recently the machine started to exhaust its CPU resources. First we overclocked it a bit which gave us 10% more CPU idle time. Then we decided to try to compile Nagios with the popen-noshell library. This gave us another 10% CPU idle and now the machine is working great again.

I’ll focus on the popen-noshell integration and results, since CPU overclocking is a well-known topic. Here is the chart which shows the CPU usage before and after we re-compiled Nagios with the popen-noshell library:

nagios-popen-noshell-benchmark-results

As we can see, the system-CPU usage dropped from 38% to 31%, which is an 18% improvement. The user-CPU usage dropped from 44% to 41%, which is a 7% improvement. Overall, we gained a 12% speed-up for our workload by just re-compiling Nagios with the popen-noshell library. I’m stressing out that the speed-up depends a lot on your workload. If this machine was busy only with Nagios and the active checks were more CPU efficient (i.e. not written in Perl but in C), then the speed-up could have been much higher, since popen_noshell() is about 10 times faster than the standard popen().

A list with the other machine metrics which were also affected by the workload change:

  • Used memory: 39% => 24% (38% less)
  • Load average: 39 => 46 (18% higher)
  • Forks rates: 8*61 => 8*61 (created processes/second – no change)

Here are the steps that you need to perform, in order to re-compile the Nagios Debian package by integrating it with the popen-noshell library:

apt-get install devscripts

apt-get build-dep nagios3-core
# No need to run as "root" from here on
apt-get source nagios3-core

svn checkout http://popen-noshell.googlecode.com/svn/trunk/ popen-noshell

cd nagios3-3.2.1/

# BEGIN: patch Nagios to use popen_noshell_compat()

cp ../popen-noshell/popen_noshell.* base/
vi base/Makefile.in
	OBJS=$(BROKER_O) popen_noshell.o 

vi base/utils.c
	#include "popen_noshell.h"
	
        /* run the command */
        struct popen_noshell_pass_to_pclose pclose_arg;
        fp=(FILE *)popen_noshell_compat(cmd,"r",&pclose_arg);

            /* close the command and get termination status */
            status=pclose_noshell(&pclose_arg);

vi base/checks.c
	2x the same as above

# END: patch Nagios to use popen_noshell_compat()

EDITOR=vim dch -i
	# 3.2.1-2+squeeze1 -> 3.2.1-2+squeeze1-noshell1
	# you must have a trailing number in the added version name
	# after exit, this renames the original directory name

cd ..
mv nagios3_3.2.1.orig.tar.gz nagios3_3.2.1-2+squeeze1.orig.tar.gz

# the source directory was renamed by "dch"
cd nagios3-3.2.1-2+squeeze1/
DEB_BUILD_OPTIONS=nocheck debuild -us -uc

cd ..
sudo dpkg -i nagios3-core_3.2.1-2+squeeze1-noshell1_i386.deb \
	nagios3-common_3.2.1-2+squeeze1-noshell1_all.deb \
	nagios3-cgi_3.2.1-2+squeeze1-noshell1_i386.deb \
	nagios3-doc_3.2.1-2+squeeze1-noshell1_all.deb \
	nagios3_3.2.1-2+squeeze1-noshell1_i386.deb

Bash: Split a string into columns by white-space without invoking a subshell

The classical approach is:

RESULT="$(echo "$LINE"| awk '{print $1}')" # executes in a subshell 

Processing thousands of lines this way however fork()’s thousands of processes, which affects performance and makes your script CPU hungry.

Here is the effective solution which I found with my colleagues at work:

COLS=( $LINE ); # parses columns without executing a subshell
RESULT="${COLS[0]}"; # returns first column (0-based indexes)

Here is an example:

LINE="col0 col1  col2     col3  col4      " # white-space including tab chars
COLS=( $LINE ); # parses columns without executing a subshell

echo "${COLS[0]}"; # prints "col0"
echo "${COLS[1]}"; # prints "col1"
echo "${COLS[2]}"; # prints "col2"
echo "${COLS[3]}"; # prints "col3"
echo "${COLS[4]}"; # prints "col4"

If you want to split not by white-space but by any other character, you can temporarily change the IFS variable which determines how Bash recognizes fields and word boundaries.

Perl Net::Ping not working properly with ICMP by default

If you tried to ping a host with Perl Net::Ping using the ICMP protocol and that failed, even though the “ping” command-line utility can ping the host, you’re not alone :) I had the same problem and it turned out to be due to the fact that Net::Ping by default sends no DATA in the ICMP request and thus its requests are rather short and non-standard. Here are some tcpdump examples:

  • Net::Ping with ICMP protocol, everything else is defaults: “$p = new Net::Ping(‘icmp’)“, no replies from remote host, note that the length is just 8 bytes:
    12:29:02.898083 IP source-addr > source-addr: ICMP echo request, id 2194, seq 41, length 8
    12:29:03.711595 IP source-addr > dest-addr: ICMP echo request, id 2194, seq 42, length 8
    
  • Linux “ping” command-line utility, remote host replies accordingly, the length is 64 bytes total:
    12:30:18.278865 IP source-addr > dest-addr: ICMP echo request, id 2488, seq 1, length 64
    12:30:18.289922 IP dest-addr > source-addr: ICMP echo reply, id 2488, seq 1, length 64
    12:30:18.790610 IP source-addr > dest-addr: ICMP echo request, id 2488, seq 2, length 64
    12:30:18.811029 IP dest-addr > source-addr: ICMP echo reply, id 2488, seq 2, length 64
    
  • Net::Ping with ICMP protocol with user-defined length, “$p = new Net::Ping(‘icmp’, 1, 56)“, remote host replies accordingly, the length is 64 bytes total:
    12:30:48.377496 IP source-addr > dest-addr: ICMP echo request, id 2488, seq 6, length 64
    12:30:48.433690 IP dest-addr > source-addr: ICMP echo reply, id 2488, seq 6, length 64
    12:30:48.934310 IP source-addr > dest-addr: ICMP echo request, id 2488, seq 7, length 64
    12:30:48.946152 IP dest-addr > source-addr: ICMP echo reply, id 2488, seq 7, length 64
    

Bottom line is that if you are going to use Net::Ping with ICMP, specify 56 for the “bytes” parameter when creating an instance of the Net::Ping object. This way you will be sending standard ICMP requests with total length of 64 bytes.

Auto screenshot on Windows

I recently migrated my desktop back to Windows, and while I’m at work I need to have regular screenshots of my monitor, for investigation and other purposes. I easily found a solution to record desktop activity by making regular screenshots on Ubuntu, and I thought that Windows solutions will be even more. It turned out to be the opposite — they were all either paid or not working/lacking features.

Here is how “Auto Screen Capture FV” was born. Two screenshots of the interface follow:



It has the following features:

  • Runs on Windows
  • Free as speech; open-source, developed with Microsoft Visual C# 2010 Express
  • Captures a screenshot automatically without disrupting user activity
  • Saves the snapshot images as compressed JPEG files, in order to save disk space
  • The destination directory where the images are saved is selected by the user
  • Rotates too old image files by deleting them, in order to save disk space
  • All settings are permanently saved in the registry, so next program starts remember what you configured
  • Auto screen capture can be easily temporarily suspended
  • Program can run in background; it minimizes to system tray

Old image files are actually moved to “Recycle bin”, in order to be on the safe side — if we have a bug, no files are lost. Auto Screen Capture FV has been tested on Windows 7.

Download links:


Resources:

DIY temperature and humidity wireless data logger

Monitoring and controlling relative humidity is important for humans health. Too low or too high humidity feels uncomfortable, but most importantly high moisture is a factor for growing mold in your home, which could be health threatening (according to EPA and CDC). I will not go into details on how to control humidity. Instead I’ll describe what motivated me to design and create my own temperature and humidity sensor which reports its readings every minute to a central Linux server.

The main requirements for my design were the following:

  • Affordable price, as I wanted to install four sensors.
  • Great accuracy both for temperature and humidity readings.
  • Over-the-air communication, as I wanted to be able to install a sensor even in my bathroom, where I can’t run data or power wires. Support for wired communication too, so that we can reduce the overall price by not installing the wireless module.
  • Data logging to a computer, because both temperature and humidity change with time, for example when you sleep in the room, and you can’t look at a mechanical temperature or humidity meter every minute, in order to write down the results.
  • Battery operated, in order to avoid any wiring.
  • Open-source hardware and software toolchain, so I chose Atmel AVR microcontrollers. I got sick of Microchip and their commercial C compilers.
  • To have fun with electronics but at the same time create the device as fast as possible, as free time turned out to be a pretty limited resource recently.

I managed to accomplish most of the requirements I set with two exceptions: the device operates only a month on batteries, and cumulatively I spent almost a week to design, solder, develop the firmware, and test the device. Now all the sensors operate from a wall-plug power adapter, and my hunger for environmental control in my house is satisfied.

I’ll now try to describe the whole process and the reasons behind my engineer decisions. Note that I’m an amateur hobbyist.

Idea and requirements
I wrote down all my thoughts in a text editor. Then re-designed all the sticky notes into requirements, and did so a few more times, in order to finally decide what I want to design and not get distracted by new random ideas in my head.

Power supply
I wanted the device to be able to operate both via USB, and thus be powered by 5V, as well as to be powered by an accumulator or a battery with an input voltage up to 12V, so that it could be used in a car too. I put a polarity protection diode D1 in series with the power line, so that an accidental polarity mismatch doesn’t burn out the power regulator. Such a protection diode must have very low voltage drop and thus low power loss, and the Schottky diode 1N5819 seemed like a good match.

Operating from a battery also means that the voltage regulator must be extremely efficient and with a low bias current consumption, which means that it should draw almost nothing while there is nothing connected to it at its output as a load. Most battery operated devices “sleep” during most of their life cycle, so their consumption is close to zero. I used the ultra low-dropout fixed voltage regulator LP2986-33, marked as U1 in the schematics. The whole circuit operates at 3.3V because of the XBee wireless modules, and also because operating at a lower voltage usually gives lower power consumption.

Since we can have two different power sources, there must be a way to choose which one is active. You can switch between the power sources using the PWR_SELECT jumpers.

Wired communication via USB
I wanted to have the option to use the sensors by directly connecting them to a computer. This way we could save the money for an XBee wireless module. I used the classical USB-to-Serial solution FT232R, which is also quite inexpensive and requires almost no external components. You can see it in the schematics as U2. Note that the I/O lines of FT232R must be configured to operate at 3.3V too. This is done by connecting pin 17, which is the internal 3.3V regulator of FT232R, to pin 4. The internal 3.3V regulator is not used for anything else, and in theory I could have powered the I/O lines, pin 4, directly from the main voltage regulator U1.

Wireless communication
The XBee modules is something I wanted to play with for a long time. They seem very easy to work with and are packed with all kind of features. Though in my case I’m not using almost any of them, not even the AES encryption which could secure the data channel. I’m using the Series 1 XBee low-power embedded RF modules (XB24), which have a power of 1 mW and 30 m indoor range. There are many comments in Internet that the indoor range of the XBee modules is poor and I can confirm that. The range really depends on what the signal must travel through. Sometimes you lose the link even through one wall, sometimes it can go through a few walls. The XBee & XBee-PRO OEM RF Module Antenna Considerations is a great article by the XBee manufacturers. After all, probably by using such a low-power module, we shouldn’t expect so great results. It works well in my apartment though — all rooms report to the central XBee module successfully. On the server’s side, the receiver, I first had an XBee with chip antenna, which I replaced with an XBee-PRO with whip antenna. This made no difference.

Wiring the XBee module is very easy. It requires no external components. If you read the PDF datasheet, you’ll see how many great features an XBee has. I’m using only three of them:

  • Sleep mode — the microcontroller puts the XBee to sleep by controlling the SLEEP_RQ pin 9.
  • Networking addressing — each XBee is configured with a unique address, so that the receiver on the server side knows which reading belongs to which sensor probe.
  • API operation — the receiver XBee module operates in an API mode, which is a frame-based protocol that provides greater flexibility and more control. For example, besides the received data payload, an API frame gives information about the sender’s address and the signal quality.

Temperature and humidity sensor
I wanted to interface the sensor directly using a digital protocol, so that we can minimize the ADC stuff and errors. The SHT11 turned out to be the sensor I was looking for:

  • Relative humidity accuracy: +/- 3% in the range 20% to 80% RH. The sensor comes fully calibrated.
  • Temperature accuracy: +/- 1.5 degrees Celsius in the range -15 to +65 degrees Celsius.
  • Digital two-wire interface.
  • Very low energy consumption: 80uW (at 12bit, 3V, 1 measurement/s).

The SHT11 is a bit pricey but works very easily and accurately out of the box, so I decided to go with it. There is a very good alternative at Sparkfun — the RHT03 humidity and temperature sensor (also known as “RHT-22″). There were some contradictive comments by Sparkfun users — some say it works very well, some doubt its accuracy. I haven’t tried it but have left space JP7 on the current board, so that at some later time I could solder one RHT03 and use it with the existing schematics.

One note about the SHT11 two-wire interface. Definitely use a pull-up resistor on the DATA wire, as advised in the PDF! I tried to do some magic by the microcontoller and failed. With the exception of the pull-up resistor, everything else worked with no other problems with the SHT11 sensor. The manufacturer Sensirion provides Sample code for the SHTxx sensors which turned out to be very useful. I was able to re-code it for the AVR GNU C compiler in a couple of minutes.

The CRC calculation got me a bit confused. There are multiple different ways to calculate a CRC checksum, and they all provide different results. Each CRC calculation depends on the selected CRC polynomial, which is something like a bit-mask that defines the algorithm for the CRC calculation. After lots of struggle, I finally found an excellent Online CRC Calculation web wizard, which also includes a hardware implementation example, and sample C and VERILOG implementations, which you can copy-paste in your program. Thank you Kay Gorontzi!

Microcontroller
Initially I worked with ATmega8. Then I switched to ATmega168 because of the much lower power consumption. I could have used any other Atmel AVR microcontroller which has USART, internal oscillator, and sleep mode. Though ATmega8 or ATmega168 are always available in my local electronics shop, so I chose one of them. Besides the lower power consumption, ATmega168 has one other major advantage for my application — the watchdog timer can wake the chip from sleep mode and directly execute an interrupt, thus not re-starting the program from the very beginning.

Firmware
I’m working on Windows 7 64-bit and used a USBasp programmer to download the code into the microcontroller. The whole development toolchain is packaged into the WinAVR suite. It includes the AVR GCC compiler and the avrdude programmer. I also downloaded a sample Makefile which makes compilation and firmware download easy.

The main loop of the program does two tasks — measures and displays the readings over the serial port (which goes to the USB or over-the-air via XBee), and sleeps for about 60 seconds. As already mentioned, I use the new feature of ATmega168 which allows for the Watchdog timer to generate an interrupt, which wakes the chip from sleep mode. This is very handy as it allows you to continue the program at the point where you put it to sleep. The sleep mode was something new for me; there are some URLs in the source code which show what online articles helped me to master it. Note that the XBee RF transmitter is also put into sleep mode, in order to save battery.

Data collector
All the sensor readings are collected to a Linux server over-the-air. I use an XBee Explorer USB by Sparkfun to connect the XBee receiver with the Linux server. The XBee is seen as a serial device on the Linux box. The frame protocol of the XBee API is easy to understand and I implemented a Perl script to parse it. Here is a sample reading which is received from one of my wireless sensors (0×0001 is the address of the probe standing outside of my apartment):

[Mon Dec 26 17:32:11 2011] RX_packet: source=0×0001, rssi=-55dBm (opt=0×00): 4.24;69.04

As you can see, now it’s winter here — 4.24 degrees Celsius temperature; 69.04% relative humidity (RH).

Board design
Both the schematics and PCB board were designed using Eagle PCB by CadSoft. This is a great piece of software. Most PCB factories accept Eagle board files directly. You’ll find my Eagle files in the Resources section at the end of this article.

Lessons learned
There are a few things which I discovered only once I already built and tested the schematics:

  • Battery-operated devices are hard to design — in theory my sensors were supposed to last for about 3 months with a 9V battery. In practice only one of them lasted for a month, the others – for a week.
  • Electronics components, boards and/or assembly could differ a lot — see above. Also one of the SHT11 sensors is sometimes giving CRC errors.
  • XBee indoor range is not excellent.
  • Research and development takes a lot of time, usually 2x or 3x the time you planned. Furthermore, building something with love takes even more time, but in the end it pays off with great results and satisfaction.
  • You can create an electronics device with a lower price than what is currently offered on the marked. But this has its price too — your time, and you get no guarantee whatsoever.

I had different plans for this blog article but it got so lengthy that I wrote it in four different days (and it’s Christmas now). The main idea was to sketch the device and all its components, and to show that they can work together as a finished product. If there is any interest by other people, I’m happy to answer to any questions.

Happy holidays and best of luck in 2012!!! :)

Resources:

URL escape in Bash

I recently needed to escape some user-supplied input for an URL address variable, in a Bash script. This is what the PHP urlencode(), and Perl URI::Escape::uri_escape() functions do, for example. My initial approach was to call Perl from the Bash script:

#!/bin/bash
function urlencode() {
	echo -n "$1" | perl -MURI::Escape -ne 'print uri_escape($_)'
}

Though I wanted to optimize the Bash script by not having to fork() a Perl interpreter every time, which could be CPU intensive if you execute the Bash script often. So I ended up with the following solution, entirely coded in Bash, using Bash string manipulation and Bash hash arrays:

#!/bin/bash
set -u

declare -A ord_hash # associative hash; requires Bash version 4

function init_urlencode() {
	# this is the whole ASCII set, without the chr(0) and chr(255) characters
	ASCII='...!"#$%&'\''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ЂЃ‚ѓ„…†‡€‰Љ‹ЊЌЋЏђ‘’“”•–—˜™љ›њќћџ ЎўЈ¤Ґ¦§Ё©Є«¬­®Ї°±Ііґµ¶·ё№є»јЅѕїАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэю...'
	# download the script, don't copy paste it from the blog page!

	# chr(0) cannot be stored in a Bash variable

	local idx
	for idx in {0..253}; do # 0..253 = 254 elements = length($ASCII)
		local c="${ASCII:$idx:1}" # VERY SLOW
		local store_idx=$(($idx+1))
		ord_hash["$c"]="$store_idx"
		# chr(255) cannot be used as a key
	done
}

function urlencode() {
	local inp="$1"
	local len="${#inp}"
	local n=0
	local val
	while [ "$n" -lt "$len" ]; do
		local c="${inp:$n:1}" # VERY SLOW
		if [ "$c" == "я" ]; then # chr(255) cannot be used as a key
			val=255
		else
			val="${ord_hash[$c]}"
		fi
		printf '%%%02X' "$val"
		n=$((n+1))
	done
}

init_urlencode # call only once
urlencode 'some^fancy#text'

The logic works pretty well, but the performance is terrible. It turned out that the Bash string manipulation methods are rather slow. So I finally ended up by using Perl, the same way I did it initially. For very small strings in the order of a few characters, you should be fine. But for anything else, this implementation is not recommended.

If you still want to use the Bash code, please download it directly from here, because the blog page messed up some of the special ASCII characters.