/contrib/famzah

Enthusiasm never stops


Leave a comment

Properly stop a Firebase app in Node.js

Normally, the Node.js process will exit when there is no work scheduled (docs). You shouldn’t call process.exit() unless it is necessary to terminate the Node.js process immediately due to an error condition. Calling process.exit() doesn’t let pending events to complete which may lead to unpredictable results as demonstrated in the docs.

Now that we know how to naturally terminate a Node.js application, how do we achieve it if we are using the Firebase JavaScript SDK?

First you need to cancel any asynchronous listeners. For example, if you subscribed for data changes, you need to unsubscribe:

let func = firebase.database().ref(userDB).on("value", myHandler);
...
firebase.database().ref(userDB).off("value", func);

Some people suggest that you also call firebase.database().goOffline() in the final stage.

Additionally, as described in these bug reports (#1 and #2), if you used firebase.auth() you need to call firebase.auth().signOut().

And finally, you also need to destroy the Firebase application by calling app.delete().

This has worked for me using Node.js version 10 and Firebase JS SDK version 8.


Leave a comment

Debug the usage of anonymously shared memory regions

PHP-FPM keeps a shared Opcache memory between the parent process and all its child processes in a pool. The idea is to compile source code once and then reuse it in all child processes directly as byte code. This is efficient but as a System administrator I recently stumbled across a problem – how to find out the real memory usage by the Opcache in the operating system?

I thought a simple “ps” list would reveal the memory usage because it would be accounted to the parent process, because the parent process created the anonymously shared mmap() region. Linux doesn’t work this way. In order to debug this easily, I created a simple program which does the following:

  • The parent process creates a shared anonymous memory region using mmap() with a size of 2000 MB. The parent process does not use the memory region in any way. It doesn’t change any data in it.
  • Two child processes are fork()’ed and then:
    • The first process writes 500 MB of data in the beginning of shared memory region passed by the parent process.
    • The second process writes 1000 MB of data in the beginning of the shared memory region passed by the parent process.

Here is how the process list looks like:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
famzah 9256 0.0 0.0 2052508 816 pts/15 S+ 18:00 0:00 ./a.out # parent
famzah 9257 10.7 10.1 2052508 512008 pts/15 S+ 18:00 0:01 \_ ./a.out # child 1
famzah 9258 29.0 20.2 2052508 1023952 pts/15 S+ 18:00 0:03 \_ ./a.out # child 2
famzah@vbox64:~$ free -m
total used free shared buff/cache available
Mem: 4940 549 1943 1012 2447 3097
view raw ps output hosted with ❤ by GitHub

A quick explanation of this process list snapshot:

  • VSZ (virtual size) of the parent and child processes is 2000 MB because the parent process has allocated 2000 MB of anonymous memory using mmap(). No additional allocations were made by the child processes as they were passed a reference to the anonymously shared memory in the parent process. Therefore the virtual memory footprint of all processes is the same.
  • RSS (resident set size, or simply “the real usage”) is:
    • Almost none for the parent process because it never used any memory. It only “requested” the memory block by mmap().
    • 500 MB for the first child processes because it wrote 500 MB of data at the beginning of the shared memory region.
    • 1000 MB for the second child processes because it wrote 1000 MB of data at the beginning of the shared memory region.
  • The “free -m” command shows that 1012 MB of anonymously shared memory is being used.

So far things seem kind of logical. We can roughly determine the real usage of the shared memory region by looking at the child processes. This however is also not really true because if they write at completely different regions in the anonymous memory, we would need to sum their usage. If they write to the very same memory region, we need to look at the max() value.

The pmap command doesn’t provide any additional information and shows the same values that we see in the “ps” output:

famzah@vbox64:~$ pmap -XX 9256
9256: ./a.out
Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)
famzah@vbox64:~$ pmap -XX 9257
9257: ./a.out
Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 512000 256000 0 512000 0 0 512000 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)
famzah@vbox64:~$ pmap -XX 9258
9258: ./a.out
Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 1024000 768000 0 512000 512000 0 1024000 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)
view raw pmap output hosted with ❤ by GitHub

Things get even more messy when the child processes terminate (and get replaced by new ones which never touched the shared anonymous memory). Here is how the process list looks like:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
famzah 9256 0.0 0.0 2052508 816 pts/15 S+ 18:00 0:00 ./a.out # parent
famzah@vbox64:~$ free -m
total used free shared buff/cache available
Mem: 4940 549 1943 1012 2447 3097
view raw ps output hosted with ❤ by GitHub

The RSS (resident set size, or simply “the real usage”) of the parent process continues to show no usage. But the anonymous memory region is actually used because the child processes wrote data in it. And the region is not automatically free()’d, because the parent process is still alive. The “free -m” command clearly shows that there are 1000 MB of data stored in anonymous shared memory.

How can we reliably find out the memory usage of the anonymous shared region and account it to a given process?

We will look at /proc/[pid]/maps:

A file containing the currently mapped memory regions and their access permissions. See mmap(2) for some further information about memory mappings.

If the pathname field is blank, this is an anonymous mapping as obtained via mmap(2). There is no easy way to coordinate this back to a process’s source, short of running it through gdb(1), strace(1), or similar.

Wikipedia gives the following additional information:

When “/dev/zero” is memory-mapped, e.g., with mmap(), to the virtual address space, it is equivalent to using anonymous memory; i.e. memory not connected to any file.

Now we know how to find out the virtual address of the anonymously memory-mapped region. Here I demostrate two different ways of obtaining the address:

famzah@vbox64:~$ cat /proc/9256/maps | grep /dev/zero
7f052ea9b000-7f05aba9b000 rw-s 00000000 00:05 733825 /dev/zero (deleted)
famzah@vbox64:~$ ls -la /proc/9256/map_files/ | grep /dev/zero # same region of 7f052ea9b000-7f05aba9b000
lrw——- 1 famzah famzah 64 Nov 11 18:21 7f052ea9b000-7f05aba9b000 -> '/dev/zero (deleted)'

The man page of tmpfs gives further insight:

An internal shared memory filesystem is used for […] shared anonymous mappings (mmap(2) with the MAP_SHARED and MAP_ANONYMOUS flags).

The amount of memory consumed by all tmpfs filesystems is shown in the Shmem field of /proc/meminfo and in the shared field displayed by free(1).

We verify that the memory-mapped region is a “tmpfs” file:

famzah@vbox64:~$ sudo stat -Lf /proc/9256/map_files/7f052ea9b000-7f05aba9b000
File: "/proc/9256/map_files/7f052ea9b000-7f05aba9b000"
ID: 0 Namelen: 255 Type: tmpfs

💚 We can then finally get the real memory usage of this shared anonymous memory block in terms of VSS (virtual memory size) and RSS (resident set size, or simply “the real usage”):

# stat doesn't give us the real usage, only the virtual
famzah@vbox64:~$ sudo stat -L /proc/9256/map_files/7f052ea9b000-7f05aba9b000
File: /proc/9256/map_files/7f052ea9b000-7f05aba9b000
Size: 2097152000 Blocks: 2048000 IO Block: 4096 regular file
Device: 5h/5d Inode: 733825 Links: 0
Access: (0777/-rwxrwxrwx) Uid: ( 1000/ famzah) Gid: ( 1000/ famzah)
# VSS (virtual memory size)
famzah@vbox64:~$ sudo du –apparent-size -BM -D /proc/9256/map_files/7f052ea9b000-7f05aba9b000
2000M /proc/9256/map_files/7f052ea9b000-7f05aba9b000
# RSS (resident set size, or simply "the real usage")
famzah@vbox64:~$ sudo du -BM -D /proc/9256/map_files/7f052ea9b000-7f05aba9b000
1000M /proc/9256/map_files/7f052ea9b000-7f05aba9b000

Since we have access to the memory region as a file, we can even read this memory mapped region:

famzah@vbox64:~$ sudo cat /proc/9256/map_files/7f052ea9b000-7f05aba9b000 | wc -c
2097152000


Leave a comment

Google Cloud API and Python

Confusion is what I got the first time I wanted to automate Google Cloud using Python. While the documentation of Google Cloud is not bad, it’s far from ideal. Let me try to clarify things if you’re struggling with this like I did.

In its very core, Google Cloud API is an HTTP REST service (or gRPC for selected APIs). And you have two ways of accessing the Google Cloud API using Python:

While the “Client Libraries” is Google’s recommended option, my own experience shows that you’d better use the second option which accesses the HTTP REST API directly. That is because:

  • The Google Cloud HTTP REST API is documented better. I guess the reason for that is because it’s used by all the languages, not only Python, and because Google developers use it internally, too.
  • The official documentation of the Google Cloud services always gives an example for the REST APIs along with the other options like the web console, gcloud/gsutil, and the “Client Libraries” (where they are applicable).
  • The Google Cloud web console interface gives the resulting HTTP REST API or command line code of what you populated in the forms online. You can see this at the very bottom of the page. Therefore, you can use the web interface to enter what you indent to do and then easily see what you need to send to the HTTP REST API, in order to achieve the same as you’d do if you clicked it online.
  • Not all Google Cloud APIs are wrapped in “Client Libraries” for a specific language like Python. For example, Compute Engine is not, so you would most probably need to use the HTTP REST API directly anyway. If that’s the case, why learn two libraries when you can learn only one and use only the HTTP REST API.

The installation of the Python library for Google Cloud direct HTTP REST API access is easy. There is documentation but the process is as straightforward as “pip install google-api-python-client”. You can then access any Google Cloud HTTP API and here is an oversimplified example:

import googleapiclient.discovery
compute = googleapiclient.discovery.build('compute', 'v1')
result = compute.regions().list(project='my_project_id', maxResults=2).execute()

Before you can execute the example, you have to set up your authentication. I personally find Authenticating as a service account very convenient but you can also review the other Google Cloud API authentication options.

Please note that even the simple Python library for Google Cloud direct HTTP REST API access has its minor peculiarities and is not a 1:1 mapping of the API. For example, when you call an update() API the HTTP REST API documentation names the selector field “resourceId” while the Python library’s update() method names this field “healthCheck”, for example for the healthChecks().update() method. Therefore, you always need to consult both documentations when you develop your scripts.

Here is what I keep in my bookmarks when working with the Google Cloud API using Python:

  • Google APIs Explorer — the core documentation of every service.
  • Python library for Google Cloud direct HTTP REST API access — I consult it for the named arguments of the methods if they differ from the official API (see the example in the previous paragraph). Additionally, if you use the list() methods, the results are usually paginated and you have to call list_next() which is well documented.


Leave a comment

Oculus Quest VR headset review

Here is a quick recap of my short experience with Oculus Quest:

  • It’s a wonderful device to use without a computer. The immersion is incredible, the controllers are easy to use, the interface is easy and intuitive, and it’s comfortable to wear even for my 8-year old daughter who played Beat Saber a few times.
  • Resolution is much better than the previous generation of VR headsets. There is noticeable flickering in bright scenes. The overall brightness of the screen is high and I had to lower it from the settings of the PC games.
  • The Oculus Link works with my laptop Dell G5 15 (5587) which has an NVIDIA GeForce GTX 1060M (Max-Q design) video card. This card is listed as not supported but I saw no problems whatsoever. I had to install the latest video drivers and updated all other drivers of my laptop.
  • For the Oculus Link connection I used a $14 USB-A to USB-C cable with 2.5m length manufactured by Vivanco. The cable test by the Oculus app resulted in 1.7 Gbps transfer via USB 3. The original USB-C to USB-C cable that came with the Oculus Quest resulted in a USB 2 connection with about 0.3 Gbps transfer rate and I couldn’t get the Oculus Link to work with it.
  • The battery life when used with a computer via the Oculus Link is practically unlimited because the device is simultaneously charged while being used. The same applies if you just use the cable to charge the device.
  • I had no problems playing the Oculus Rift compatible game Dirt Rally, and I had no problems playing Euro Truck Simulator 2 via SteamVR. I’m pleasantly surprised how well SteamVR and my already installed games via Steam integrated with Oculus Quest via the Oculus Link.
  • You need a powerful CPU and video card to play the games with highest graphic details and with the highest FPS. With my Dell G5 15 I had to downgrade the graphic mode to lower settings, in order to be able to sustain 36 FPS. If I wanted the full 72 FPS, then I had to run the graphics in the lowest detail.
  • I couldn’t get Firefox to play 3D 360° content via the Oculus Link.
  • Uploading a 3D movie to Oculus Quest via the USB cable is very fast. The Gallery app lets you easily play the movie.
  • Playing car racing simulators is absolutely more realistic in VR mode! Very enjoyable! It’s actually a bit too much realistic because I got motion sickness after a dozen of seconds. I overcame this problem by shaking my head a bit while driving like you’d do if you were driving on uneven road. Therefore, I suspect that having a seat motion platform will eliminate my motion sickness entirely.
  • Field of view (FOV) while driving a car sim game is enough. I guess it’s the same if you wear a helmet. I’m used to driving go karts with a helmet.
  • I didn’t test many Oculus VR apps but they seem promising – interactive 3D movies, many games, landmark tours, etc.
  • The guardian system is easy to set up and does its job.
  • My WiFi router wasn’t discovered until I disabled WiFi channels 12 and 13. Then I could pair and setup the Oculus Quest easily using my phone.

My final verdict – Oculus Quest is an awesome product and you will definitely enjoy VR! Using it without a PC in standalone mode is easy and there are enough games and apps to enjoy. Using it with a PC via the Oculus Link requires a powerful PC to play in the highest graphics details but in lower graphics details works flawlessly, too.

After all I returned the Oculus Quest headset because of the following reasons:

  • I bought it mainly to play games on my laptop but it’s not powerful enough to support the highest graphic details in VR like it does on a monitor. If I’m about to buy a gaming rig, then maybe I’ll opt in for Oculus Rift S (or similar) because it was designed to be connected to a PC while the Oculus Link could introduce problems (compatibility, video compression artefacts, etc).
  • I was hoping for a slightly better resolution. It’s not that resolution is noticeably bad now. Compared to the first gen VR headsets, it’s much better now. But I think I’ll just wait another gen of VR hardware. Or maybe I’ll try the Pimax 8K VR headset.
  • The noticeable flickering in bright scenes and the slight motion sickness in games could be caused by the relatively low refresh rate of 72 Hz.


Leave a comment

Unexpected issues with the AWS opt-in regions

AWS cloud services operate from many different regions (18 as of today).

It wasn’t long before I stumbled across the first problem — not all of them are enabled by default. The documentation says “Regions introduced after March 20, 2019, such as Asia Pacific (Hong Kong) and Middle East (Bahrain), are disabled by default. You must enable these Regions before you can use them.”

Enabling the Hong Kong (ap-east-1) and Bahrain (me-south-1) regions was super easy by following the documentation. I could manage all resources from the AWS web console.

Today I tried some operations from the AWS Command line interface (CLI) and got the following errors:

An error occurred (IllegalLocationConstraintException) when calling the DeleteBucket operation: The ap-east-1 location constraint is incompatible for the region specific endpoint this request was sent to.

An error occurred (IllegalLocationConstraintException) when calling the DeleteBucket operation: The me-south-1 location constraint is incompatible for the region specific endpoint this request was sent to.

fatal error: An error occurred (IllegalLocationConstraintException) when calling the ListObjects operation: The ap-east-1 location constraint is incompatible for the region specific endpoint this request was sent to.

fatal error: An error occurred (IllegalLocationConstraintException) when calling the ListObjects operation: The me-south-1 location constraint is incompatible for the region specific endpoint this request was sent to.

fatal error: An error occurred (InvalidToken) when calling the ListObjectsV2 operation: The provided token is malformed or otherwise invalid.

It turns out that the CLI authenticates using the “global” endpoint of the AWS Security Token Service (AWS STS). And by default, the “global” STS endpoint will not work with the two new regions: Hong Kong (ap-east-1) and Bahrain (me-south-1). There is an official documentation on how to fix this compatibility issue by making the “global” STS endpoint “Valid in all AWS Regions”.

If you do a lot of AWS API calls, it’s probably worth to consider the new default of AWS and to try the “regional” STS endpoints: “AWS recommends using Regional AWS STS endpoints instead of the global endpoint to reduce latency, build in redundancy, and increase session token validity.” This is already supported in the CLI, too.


Leave a comment

Make your Dell G5 15 fast on battery

I changed my old laptop with the incredible Dell G5 5587 because I wanted a faster system. This new gaming laptop doesn’t disappoint and is very fast, even suitable for modern games. But… the moment you unplug the power adapter, the system becomes very slow. Sometimes the CPUs won’t go over 800 MHz, sometimes they will for a while but the overall experience is very sluggish.

What I tried to fix this on my Windows 10? Changed the power plan to “High performance”. Customized the power plan and configured maximum performance for the Intel graphics card and the Processor power management. Updated the BIOS. Disabled Battery Boost™ in NVIDIA GeForce Experience control panel. I even spoke with Dell technical support. Nothing helped!

Long live “juscheng” who finally suggested a working solution: “Go into your BIOS settings and turn off Intel Speed Shift and Intel SpeedStep“.

Battery life is the same after this change. It takes 2:15h to discharge the battery to 10% with office work (browser, mail client, Excel). Screen brightness was at 40%.


Leave a comment

ESP32 custom NodeMCU firmware build and flash

Here is my success story on how to compile and flash your own firmware for NodeMCU running on the ESP32 microcontroller. Although there is official documentation on how to achieve this, there were some minor issues that I stumbled across.

Using the Docker image is the recommended and most easy way to proceed. I didn’t have a Docker server so I started a cheap VPS from Digital Ocean. They have a one-click installer for Docker. Once the VPS machine is running, execute the following which covers the step “Clone the NodeMCU firmware repository” from the docs:

### log in via SSH to the Docker host machine

cd /opt

git clone --recurse-submodules https://github.com/nodemcu/nodemcu-firmware.git

cd nodemcu-firmware

git checkout dev-esp32
git submodule update --recursive
# https://github.com/marcelstoer/docker-nodemcu-build/issues/58
git submodule update --init

Then follow the instructions for the step “Build for ESP32” from the docs:

docker run --rm -ti -v `pwd`:/opt/nodemcu-firmware marcelstoer/nodemcu-build configure-esp32
docker run --rm -ti -v `pwd`:/opt/nodemcu-firmware marcelstoer/nodemcu-build build

Note: If you use the Heltec ESP32 WiFi Kit 32 with OLED, when configuring the firmware options you have to change the Main XTAL frequency from the default 40 Mhz to either “Autodetect” or 26 MHz, as explained here. I selected “Autodetect” (26 MHz) under “Component config” -> “ESP32-Specific” -> “Main XTAL frequency”.

For the impatient — most probably the only configuration you need to do is in the section “Component config” -> “NodeMCU modules”.

I encountered a compilation error which was caused by a defined-but-not-used function:

make[2]: Entering directory '/opt/nodemcu-firmware/build/modules'
CC build/modules/mqtt.o
CC build/modules/ucg.o
/opt/nodemcu-firmware/components/modules/ucg.c:598:12: error: 'ldisplay_hw_spi' defined but not used [-Werror=unused-function]
 static int ldisplay_hw_spi( lua_State *L, ucg_dev_fnptr device, ucg_dev_fnptr extension )
            ^
cc1: some warnings being treated as errors
/opt/nodemcu-firmware/sdk/esp32-esp-idf/make/component_wrapper.mk:285: recipe for target 'ucg.o' failed
make[2]: *** [ucg.o] Error 1
make[2]: Leaving directory '/opt/nodemcu-firmware/build/modules'
/opt/nodemcu-firmware/sdk/esp32-esp-idf/make/project.mk:505: recipe for target 'component-modules-build' failed
make[1]: *** [component-modules-build] Error 2

You can simply comment out the function in the firmware sources or configure to ignore the warning. Start a shell into the Docker image and edit the sources:

docker run --rm -ti -v `pwd`:/opt/nodemcu-firmware marcelstoer/nodemcu-build bash
# you are in the Docker image now; fix the source code

vi /opt/nodemcu-firmware/components/modules/ucg.c # add the following (including the starting shebang sign):
#pragma GCC diagnostic ignored "-Wunused-function"

exit

# you are in the Docker host; redo the build as usual
docker run --rm -ti -v `pwd`:/opt/nodemcu-firmware marcelstoer/nodemcu-build build

Because it took me a few attempts to configure everything properly and to finally build the image, I used the following alternative commands to make the process more easy:

docker run --rm -ti -v `pwd`:/opt/nodemcu-firmware marcelstoer/nodemcu-build bash
# you are in the Docker image now
cd /opt/nodemcu-firmware
make menuconfig
make

Once the build completes it produces three files on the Docker host:

/opt/nodemcu-firmware/build/bootloader/bootloader.bin
/opt/nodemcu-firmware/build/NodeMCU.bin
/opt/nodemcu-firmware/build/partitions_singleapp.bin

Copy those three files on your local development machine where you are about to flash the NodeMCU controller. My first attempt was rather naive and the chip wouldn’t boot. It issued the following error on the USB console:

Early flash read error: "flash read err, 1000"

It turned out that I accidentally overwrote the bootloader and the partitions. Fortunately, the ESP32 official docs explain this in details here and here.

Here is what worked for me to flash my custom NodeMCU firmware into ESP32 from my Windows workstation:

esptool.py.exe --chip esp32 -p COM5 erase_flash

esptool.py.exe --chip esp32 -p COM5 write_flash 0x1000 .\bootloader.bin 0x8000 .\partitions_singleapp.bin 0x10000 .\NodeMCU.bin


5 Comments

posix_spawn() performance benchmarks and usage examples

The glibc library has an efficient posix_spawn() implementation since glibc version 2.24 (2016-08-05). I have awaited this feature for a long time.

TL;DR: posix_spawn() in glibc 2.24+ is really fast. You should replace the old system() and popen() calls with posix_spawn().

Today I ran all benchmarks of the popen_noshell() library, which basically emulates posix_spawn(). Here are the results:

Test Uses pipes User CPU System CPU Total CPU Slower with
vfork() + exec(), standard Libc No 7.4 1.6 9.0
the new noshell, default clone(), compat=1 Yes 7.7 2.1 9.7 8%
the new noshell, default clone(), compat=0 Yes 7.8 2.0 9.9 9%
posix_spawn() + exec() no pipes, standard Libc No 9.4 2.0 11.5 27%
the new noshell, posix_spawn(), compat=0 Yes 9.6 2.7 12.3 36%
the new noshell, posix_spawn(), compat=1 Yes 9.6 2.7 12.3 37%
fork() + exec(), standard Libc No 40.5 43.8 84.3 836%
the new noshell, debug fork(), compat=1 No 41.6 45.2 86.8 863%
the new noshell, debug fork(), compat=0 No 41.6 45.3 86.9 865%
system(), standard Libc No 67.3 48.1 115.4 1180%
popen(), standard Libc Yes 70.4 47.1 117.5 1204%

The fastest way to run something externally is to call vfork() and immediately exec() after it. This is the best solution if you don’t need to capture the output of the command, nor you need to supply any data to its standard input. As you can see, the standard system() call is about 12 times slower in performing the same operation. The good news is that posix_spawn() + exec() is almost as fast as vfork() + exec(). If we don’t care about the 27% slowdown, we can use the standard posix_spawn() interface.

It gets more complicated and slower if you want to capture the output or send data to stdin. In such a case you have to duplicate stdin/stdout descriptors, close one of the pipe ends, etc. The popen_noshell.c source code gives a full example of all this work.

We can see that the popen_noshell() library is still the fastest option to run an external process and be able to communicate with it. The command popen_noshell() is just 8% slower than the absolute ideal result of a simple vfork() + exec().

There is another good news — posix_spawn() is also very efficient! It’s a fact that it lags with 36% behind the vfork() + exec() marker, but still it’s 12 times faster than the popen() old-school glibc alternative. Using the standard posix_spawn() makes your source code easier to read, better supported for bugs by the mainstream glibc library, and you have no external library dependencies.

The replacement of system() using posix_spawn() is rather easy as we can see in the “popen-noshell/performance_tests/fork-performance.c” function posix_spawn_test():

# the same as system() but using posix_spawn() which is 12 times faster
void posix_spawn_test() {
	pid_t pid;
	char * const argv[] = { "./tiny2" , NULL };

	if (posix_spawn(&pid, "./tiny2", NULL, NULL, argv, environ) != 0) {
		err(EXIT_FAILURE, "posix_spawn()");
	}

	parent_waitpid(pid);
}

If you want to communicate with the external process, there are a few more steps which you need to perform like creating pipes, etc. Have a look at the source code of “popen_noshell.c“. If you search for the string “POPEN_NOSHELL_MODE”, you will find two alternative blocks of code — one for the standard way to start a process and manage pipes in C, and the other block will show how to perform the same steps using the posix_spawn() family functions.

Please note that posix_spawn() is a completely different implementation than system() or popen(). If it’s not safe to use the faster way, posix_spawn() may fall back to the slow fork().


Leave a comment

Amazon EFS benchmarks

The Amazon Elastic File System (EFS) is a very intriguing storage product. It provides simple, scalable, elastic file storage for use on an EC2 virtual machine. The file system can be mounted over NFS at one or more EC2 machines simultaneously, and it also supports file locking.

Here are some important facts which I found out while doing my tests:

  • I/O operations per second (IOPS) are not the same metric that we’re used to measure when dealing with block devices like HDD or SSD disks. When working with EFS, we measure the NFS I/O operations per second. These correspond 1:1 to the read() or write() system calls that your applications make.
  • The size of the issued I/O requests are another very important metric for EFS. This is the real bytes transfer between your EC2 instance and the NFS server.
  • Therefore, we’re limited by both the NFS I/O requests per second, and the total transferred bytes for those NFS I/O per second.
  • The EFS performance and EFS limits documentation pages give a lot of insight. You have to monitor your EFS metrics using CloudWatch.
  • NFS I/O requests smaller than 4096 bytes are accounted as 4096 bytes. Regardless if you request 1 bytes, 1000 bytes, or 4096 bytes, you will get 4096 bytes accounted. Once you request more than 4096 bytes, they are accounted correctly.
  • You need more than one reader/writer thread or program, in order to achieve the full IOPS potential. One writer thread in my tests did 130 op/s, while 20 writer threads did 1500 op/s, for example.
  • The documentation says: “In General Purpose mode, there is a limit of 7000 file system operations per second. This operations limit is calculated for all clients connected to a single file system”. Our tests confirm this — we could do 3500 reading or 3000 writing operations per second.
  • CloudWatch has different aggregation functions for the *IOBytes metrics: min/max/average; sum; count. They represent different aspects of your EFS metrics, namely: the min/max/average IO operation size in bytes; the total transferred bytes in a minute (you need to divide to 60 to get the “per second” value); the total operations in a minute (you need to divide to 60 to get the “per second” value).
  • The CloudWatch EFS metrics “DataReadIOBytes” and “DataWriteIOBytes” reflect exactly what we see on the Linux system for “kB/s” and “ops/s” by the nfsiostat program. The transferred bytes reflect exactly the used bandwidth on the Linux network interfaces.
  • The “Metered size” in the AWS Console which is the same value as what you see by the “df” command is not updated in real-time. It could take more than an hour to reflect the real disk usage.
  • There is plenty of initial burst credit balance which lets you do some heavy I/O on your freshly created EFS file system. Our benchmark tests ran for hours with block sizes between 1 byte and 10k bytes, and we still had some positive burst credit balance left at the end.

I’m using the default NFS settings by the NFS mount helper provided in the “Amazon Linux 2” OS:

[root@ip-172-31-11-75 ~]# mount -t efs fs-7513e02c:/ /efs

[root@ip-172-31-11-75 ~]# mount
fs-7513e02c.efs.eu-central-1.amazonaws.com:/ on /efs type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.31.11.75,local_lock=none,addr=172.31.15.76)

The tests were performed using two “m4.xlarge” EC2 instances in the “eu-central” AWS region. This EC2 instance type provides “High” network performance.

The NFS I/O operations per second limits were tested using a simple C program which basically does the following:

fd = open(testfile, O_RDWR|O_DIRECT|O_SYNC);

while (1) {
  lseek(fd, SEEK_SET, 0);

  read(fd, buf, sizeof(buf));
  // or
  write(fd, buf, sizeof(buf));
}

I created 40 different files, so that I can run 40 different single benchmark programs on an EC2 instance – one for each file. This increases concurrency and lets the total throughput scale better.

Sequential writing and reading

Sequential writing and reading performed as expected – up to the “PermittedThroughput” limit shown in the CloudWatch metrics. In my case, for such a small EFS file system, the limit was 105 MB/s.

Writing: NFS I/O operations per second

Here are the results:

  • Writing from one EC2 instance using 1 byte, 1k bytes, or 10k bytes: regardless of the request size, we get up to 2000 IOPS. Typically the IOPS are between 1400 and 1700.
  • Writing from two EC2 instances using 1 byte, 1k bytes, or 10k bytes: regardless of the request size, we get up to 3000 IOPS in total which are equally spread across the two EC2 instances.
  • The “PercentIOLimit” CloudWatch metric shows 84% when we do 2880 ops/s, for example. Therefore, the total IOPS limit for writing is about 3500 ops/s.
  • When doing only write() system calls with 1 byte data, only “DataWriteIOBytes” is accounted by EFS which is an advantage for us. A real block file system needs to read the block (usually 4k bytes), update 1 byte in it, and then write it back on disk. I feel like this needs additional testing with more random data, so test for yourself, too. Note that the minimum accounted request size in EFS is 4kB.

Reading: NFS I/O operations per second

Here are the results:

  • Reading from one EC2 instance using 1 byte or 10k bytes: regardless of the request size, we get up to 3500 IOPS. One EC2 instance is enough to saturate the EFS limit.
  • Reading from two EC2 instances using 1 byte or 10k bytes: regardless of the request size, we get up to 3500 IOPS in total which are equally spread across the two EC2 instances.
  • The “PercentIOLimit” CloudWatch metric shows 100% when we do 3500 ops/s. Therefore, the total IOPS limit for reading is 3500 ops/s.


Leave a comment

HTTP Keep-Alive timeout values used by popular websites

Here is the command to test the Keep-Alive timeout of a website:

VHOST="www.google.bg"; time openssl s_client -ign_eof -connect "$VHOST:443" <<< "$( echo -ne "GET / HTTP/1.1\r\nHost: $VHOST\r\nConnection: Keep-Alive\r\n\r\n" )"

And here are the today’s results for some popular websites:

slashdot.org: 0s
LWN.net: 5s
snag.gy: 5s
yahoo.com: 10s
readthedocs.org: 10s
www.superhosting.bg: 15s
httpd.apache.org: 30s
nginx.org: 1m
en.wikipedia.org: 1m
famzah.wordpress.com: 1m15s
aws.amazon.com: 2m50s
www.facebook.com: 3m
www.google.bg: 4m
cloudplatform.googleblog.com: 4m
www.cloudflare.com: 6m40s
www.mozilla.org: 6m40s
www.tagesschau.de: 8m
access.redhat.com: 8m20s
stackoverflow.com: 10m
www.timeanddate.com: 10m
www.dreamhost.com: 10m
www.reddit.com: 10m
twitter.com: 15m