/contrib/famzah

Enthusiasm never stops


Leave a comment

MQTT QoS level between publisher and subscribers

Quality of Service (QoS) in MQTT is well explained by the HiveMQ Team. With the exception of one subtle detail: QoS messages are never “upgraded”, so if the original publisher sent a message with QoS 0, a QoS 2 subscriber will still receive the message as QoS 0. That’s what Dominik from the HiveMQ Team explained in a comment and it was also reiterated by his colleague Dasha in another comment.

This applies for regular messages, as well as for the Last Will and Testament (LWT).

Another discussion about this explains the same thing but points to an interesting feature of the Mosquitto MQTT broker:

upgrade_outgoing_qos [ true | false ]

The MQTT specification requires that the QoS of a message delivered to a subscriber is never upgraded to match the QoS of the subscription.

Enabling this option changes this behavior. If "upgrade_outgoing_qos" is set "true", messages sent to a subscriber will always match the QoS of its subscription. This is a non-standard option not provided for by the spec.

Defaults to "false".


Leave a comment

Properly stop a Firebase app in Node.js

Normally, the Node.js process will exit when there is no work scheduled (docs). You shouldn’t call process.exit() unless it is necessary to terminate the Node.js process immediately due to an error condition. Calling process.exit() doesn’t let pending events to complete which may lead to unpredictable results as demonstrated in the docs.

Now that we know how to naturally terminate a Node.js application, how do we achieve it if we are using the Firebase JavaScript SDK?

First you need to cancel any asynchronous listeners. For example, if you subscribed for data changes, you need to unsubscribe:

let func = firebase.database().ref(userDB).on("value", myHandler);
...
firebase.database().ref(userDB).off("value", func);

Some people suggest that you also call firebase.database().goOffline() in the final stage.

Additionally, as described in these bug reports (#1 and #2), if you used firebase.auth() you need to call firebase.auth().signOut().

And finally, you also need to destroy the Firebase application by calling app.delete().

This has worked for me using Node.js version 10 and Firebase JS SDK version 8.


Leave a comment

Debug the usage of anonymously shared memory regions

PHP-FPM keeps a shared Opcache memory between the parent process and all its child processes in a pool. The idea is to compile source code once and then reuse it in all child processes directly as byte code. This is efficient but as a System administrator I recently stumbled across a problem – how to find out the real memory usage by the Opcache in the operating system?

I thought a simple “ps” list would reveal the memory usage because it would be accounted to the parent process, because the parent process created the anonymously shared mmap() region. Linux doesn’t work this way. In order to debug this easily, I created a simple program which does the following:

  • The parent process creates a shared anonymous memory region using mmap() with a size of 2000 MB. The parent process does not use the memory region in any way. It doesn’t change any data in it.
  • Two child processes are fork()’ed and then:
    • The first process writes 500 MB of data in the beginning of shared memory region passed by the parent process.
    • The second process writes 1000 MB of data in the beginning of the shared memory region passed by the parent process.

Here is how the process list looks like:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
famzah 9256 0.0 0.0 2052508 816 pts/15 S+ 18:00 0:00 ./a.out # parent
famzah 9257 10.7 10.1 2052508 512008 pts/15 S+ 18:00 0:01 \_ ./a.out # child 1
famzah 9258 29.0 20.2 2052508 1023952 pts/15 S+ 18:00 0:03 \_ ./a.out # child 2
famzah@vbox64:~$ free -m
total used free shared buff/cache available
Mem: 4940 549 1943 1012 2447 3097
view raw ps output hosted with ❤ by GitHub

A quick explanation of this process list snapshot:

  • VSZ (virtual size) of the parent and child processes is 2000 MB because the parent process has allocated 2000 MB of anonymous memory using mmap(). No additional allocations were made by the child processes as they were passed a reference to the anonymously shared memory in the parent process. Therefore the virtual memory footprint of all processes is the same.
  • RSS (resident set size, or simply “the real usage”) is:
    • Almost none for the parent process because it never used any memory. It only “requested” the memory block by mmap().
    • 500 MB for the first child processes because it wrote 500 MB of data at the beginning of the shared memory region.
    • 1000 MB for the second child processes because it wrote 1000 MB of data at the beginning of the shared memory region.
  • The “free -m” command shows that 1012 MB of anonymously shared memory is being used.

So far things seem kind of logical. We can roughly determine the real usage of the shared memory region by looking at the child processes. This however is also not really true because if they write at completely different regions in the anonymous memory, we would need to sum their usage. If they write to the very same memory region, we need to look at the max() value.

The pmap command doesn’t provide any additional information and shows the same values that we see in the “ps” output:

famzah@vbox64:~$ pmap -XX 9256
9256: ./a.out
Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)
famzah@vbox64:~$ pmap -XX 9257
9257: ./a.out
Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 512000 256000 0 512000 0 0 512000 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)
famzah@vbox64:~$ pmap -XX 9258
9258: ./a.out
Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 1024000 768000 0 512000 512000 0 1024000 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)
view raw pmap output hosted with ❤ by GitHub

Things get even more messy when the child processes terminate (and get replaced by new ones which never touched the shared anonymous memory). Here is how the process list looks like:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
famzah 9256 0.0 0.0 2052508 816 pts/15 S+ 18:00 0:00 ./a.out # parent
famzah@vbox64:~$ free -m
total used free shared buff/cache available
Mem: 4940 549 1943 1012 2447 3097
view raw ps output hosted with ❤ by GitHub

The RSS (resident set size, or simply “the real usage”) of the parent process continues to show no usage. But the anonymous memory region is actually used because the child processes wrote data in it. And the region is not automatically free()’d, because the parent process is still alive. The “free -m” command clearly shows that there are 1000 MB of data stored in anonymous shared memory.

How can we reliably find out the memory usage of the anonymous shared region and account it to a given process?

We will look at /proc/[pid]/maps:

A file containing the currently mapped memory regions and their access permissions. See mmap(2) for some further information about memory mappings.

If the pathname field is blank, this is an anonymous mapping as obtained via mmap(2). There is no easy way to coordinate this back to a process’s source, short of running it through gdb(1), strace(1), or similar.

Wikipedia gives the following additional information:

When “/dev/zero” is memory-mapped, e.g., with mmap(), to the virtual address space, it is equivalent to using anonymous memory; i.e. memory not connected to any file.

Now we know how to find out the virtual address of the anonymously memory-mapped region. Here I demostrate two different ways of obtaining the address:

famzah@vbox64:~$ cat /proc/9256/maps | grep /dev/zero
7f052ea9b000-7f05aba9b000 rw-s 00000000 00:05 733825 /dev/zero (deleted)
famzah@vbox64:~$ ls -la /proc/9256/map_files/ | grep /dev/zero # same region of 7f052ea9b000-7f05aba9b000
lrw——- 1 famzah famzah 64 Nov 11 18:21 7f052ea9b000-7f05aba9b000 -> '/dev/zero (deleted)'

The man page of tmpfs gives further insight:

An internal shared memory filesystem is used for […] shared anonymous mappings (mmap(2) with the MAP_SHARED and MAP_ANONYMOUS flags).

The amount of memory consumed by all tmpfs filesystems is shown in the Shmem field of /proc/meminfo and in the shared field displayed by free(1).

We verify that the memory-mapped region is a “tmpfs” file:

famzah@vbox64:~$ sudo stat -Lf /proc/9256/map_files/7f052ea9b000-7f05aba9b000
File: "/proc/9256/map_files/7f052ea9b000-7f05aba9b000"
ID: 0 Namelen: 255 Type: tmpfs

💚 We can then finally get the real memory usage of this shared anonymous memory block in terms of VSS (virtual memory size) and RSS (resident set size, or simply “the real usage”):

# stat doesn't give us the real usage, only the virtual
famzah@vbox64:~$ sudo stat -L /proc/9256/map_files/7f052ea9b000-7f05aba9b000
File: /proc/9256/map_files/7f052ea9b000-7f05aba9b000
Size: 2097152000 Blocks: 2048000 IO Block: 4096 regular file
Device: 5h/5d Inode: 733825 Links: 0
Access: (0777/-rwxrwxrwx) Uid: ( 1000/ famzah) Gid: ( 1000/ famzah)
# VSS (virtual memory size)
famzah@vbox64:~$ sudo du –apparent-size -BM -D /proc/9256/map_files/7f052ea9b000-7f05aba9b000
2000M /proc/9256/map_files/7f052ea9b000-7f05aba9b000
# RSS (resident set size, or simply "the real usage")
famzah@vbox64:~$ sudo du -BM -D /proc/9256/map_files/7f052ea9b000-7f05aba9b000
1000M /proc/9256/map_files/7f052ea9b000-7f05aba9b000

Since we have access to the memory region as a file, we can even read this memory mapped region:

famzah@vbox64:~$ sudo cat /proc/9256/map_files/7f052ea9b000-7f05aba9b000 | wc -c
2097152000


Leave a comment

Google Cloud API and Python

Confusion is what I got the first time I wanted to automate Google Cloud using Python. While the documentation of Google Cloud is not bad, it’s far from ideal. Let me try to clarify things if you’re struggling with this like I did.

In its very core, Google Cloud API is an HTTP REST service (or gRPC for selected APIs). And you have two ways of accessing the Google Cloud API using Python:

While the “Client Libraries” is Google’s recommended option, my own experience shows that you’d better use the second option which accesses the HTTP REST API directly. That is because:

  • The Google Cloud HTTP REST API is documented better. I guess the reason for that is because it’s used by all the languages, not only Python, and because Google developers use it internally, too.
  • The official documentation of the Google Cloud services always gives an example for the REST APIs along with the other options like the web console, gcloud/gsutil, and the “Client Libraries” (where they are applicable).
  • The Google Cloud web console interface gives the resulting HTTP REST API or command line code of what you populated in the forms online. You can see this at the very bottom of the page. Therefore, you can use the web interface to enter what you indent to do and then easily see what you need to send to the HTTP REST API, in order to achieve the same as you’d do if you clicked it online.
  • Not all Google Cloud APIs are wrapped in “Client Libraries” for a specific language like Python. For example, Compute Engine is not, so you would most probably need to use the HTTP REST API directly anyway. If that’s the case, why learn two libraries when you can learn only one and use only the HTTP REST API.

The installation of the Python library for Google Cloud direct HTTP REST API access is easy. There is documentation but the process is as straightforward as “pip install google-api-python-client”. You can then access any Google Cloud HTTP API and here is an oversimplified example:

import googleapiclient.discovery
compute = googleapiclient.discovery.build('compute', 'v1')
result = compute.regions().list(project='my_project_id', maxResults=2).execute()

Before you can execute the example, you have to set up your authentication. I personally find Authenticating as a service account very convenient but you can also review the other Google Cloud API authentication options.

Please note that even the simple Python library for Google Cloud direct HTTP REST API access has its minor peculiarities and is not a 1:1 mapping of the API. For example, when you call an update() API the HTTP REST API documentation names the selector field “resourceId” while the Python library’s update() method names this field “healthCheck”, for example for the healthChecks().update() method. Therefore, you always need to consult both documentations when you develop your scripts.

Here is what I keep in my bookmarks when working with the Google Cloud API using Python:

  • Google APIs Explorer — the core documentation of every service.
  • Python library for Google Cloud direct HTTP REST API access — I consult it for the named arguments of the methods if they differ from the official API (see the example in the previous paragraph). Additionally, if you use the list() methods, the results are usually paginated and you have to call list_next() which is well documented.


Leave a comment

Unexpected issues with the AWS opt-in regions

AWS cloud services operate from many different regions (18 as of today).

It wasn’t long before I stumbled across the first problem — not all of them are enabled by default. The documentation says “Regions introduced after March 20, 2019, such as Asia Pacific (Hong Kong) and Middle East (Bahrain), are disabled by default. You must enable these Regions before you can use them.”

Enabling the Hong Kong (ap-east-1) and Bahrain (me-south-1) regions was super easy by following the documentation. I could manage all resources from the AWS web console.

Today I tried some operations from the AWS Command line interface (CLI) and got the following errors:

An error occurred (IllegalLocationConstraintException) when calling the DeleteBucket operation: The ap-east-1 location constraint is incompatible for the region specific endpoint this request was sent to.

An error occurred (IllegalLocationConstraintException) when calling the DeleteBucket operation: The me-south-1 location constraint is incompatible for the region specific endpoint this request was sent to.

fatal error: An error occurred (IllegalLocationConstraintException) when calling the ListObjects operation: The ap-east-1 location constraint is incompatible for the region specific endpoint this request was sent to.

fatal error: An error occurred (IllegalLocationConstraintException) when calling the ListObjects operation: The me-south-1 location constraint is incompatible for the region specific endpoint this request was sent to.

fatal error: An error occurred (InvalidToken) when calling the ListObjectsV2 operation: The provided token is malformed or otherwise invalid.

It turns out that the CLI authenticates using the “global” endpoint of the AWS Security Token Service (AWS STS). And by default, the “global” STS endpoint will not work with the two new regions: Hong Kong (ap-east-1) and Bahrain (me-south-1). There is an official documentation on how to fix this compatibility issue by making the “global” STS endpoint “Valid in all AWS Regions”.

If you do a lot of AWS API calls, it’s probably worth to consider the new default of AWS and to try the “regional” STS endpoints: “AWS recommends using Regional AWS STS endpoints instead of the global endpoint to reduce latency, build in redundancy, and increase session token validity.” This is already supported in the CLI, too.


Leave a comment

posix_spawn() on Linux

Many years ago I wrote the library popen_noshell which improves the speed of the popen() call significantly. It seems that now there is a standard and very efficient way to achieve the same. Use the posix_spawn() call. Its interface is a bit grumpy and complicated, but it can’t be very simple after all, because posix_spawn() provides both great efficiency and lots of flexibility.

UPDATE: Here are some benchmarks for posix_spawn().

Let us first examine the different ways of spawning a process on Linux 4.10. Here are the different implementations of the following functions:

  • fork(): _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
  • vfork(): _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0, 0, NULL, NULL, 0);
  • clone(): _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
  • posix_spawn(): implemented by using clone(); no native Linux kernel syscall, yet

In the latest versions of the GNU libc, posix_spawn() uses a clone() call which is equivalent to the vfork() arguments of clone(). Therefore, a logical question pops up – why not use vfork() directly. “The problem are the atfork handlers which can be registered. In the child process they can modify the address space.”

Of course, it would be best if posix_spawn() was implemented as a system call in the Linux kernel. Then we wouldn’t need to depend on the GNU libc implementations, which by the way differ with the different versions of glibc. Additionally, the Linux kernel could spawn processes even faster.

The current implementation of posix_spawn() in the GNU libc is basically a vfork() with a limited, safe set of functions which can be executed inside the vfork()’ed child. When using vfork(), the child shares the memory and the stack of the parent process, so we need to be extra careful indeed. There are plenty of warnings in the man pages about the usage of vfork().

I am glad that my implementation and this of the GNU libc guys is very similar. They did a better job though, because they handle a few corner cases like custom signal handlers in the parent, etc. It’s worth to review the comments and the source code of the patch which introduces the new, very efficient posix_spawn() implementation in the GNU libc.

The above patch got into mainstream with glibc 2.24 on 2016-08-05.

When glibc 2.24 gets into the most mainstream Linux distributions, we can start to use posix_spawn() which should be as efficient as my popen_noshell implementation.

P.S. If you want to read even more technical details about the *fork() calls, try this and this pages.


1 Comment

Two AWS CLI tips for S3 — UTF-8 when piping, and migrating the Storage Class

While working on the “youtube-mp3-archive” project, I stumbled across two issues which are worth to be documented for future use.

“aws s3 ls” shows “???” instead of the UTF-8 key names of the S3 objects

On my machine this happens when I pipe the output of “aws s3 ls” to another program. Here is an example:

$ aws s3 ls --recursive s3://youtube-mp3.famzah/ | tee | grep 4185710
2016-10-30 08:08:49    4185710 mp3/Youtube/??????? - ?? ???? ?????-BF6KuR8vWN0.mp3

There is already a discussion about this at the AWS CLI project. The solution in my case was to tamper with the PYTHONIOENCODING environment variable and force UTF-8:

$ PYTHONIOENCODING=utf8 aws s3 ls --recursive s3://youtube-mp3.famzah/ | tee | grep 4185710
2016-10-30 08:08:49    4185710 mp3/Youtube/Аналгин - Тя беше ангел-BF6KuR8vWN0.mp3

How to convert all stored S3 objects to another Storage Class

As already explained, the Storage Class cannot be set on a per-bucket basis. It must be specified with each upload operation in your client.

The migration procedure is already documented at the AWS CLI project. Here are the commands to check the current Storage Class of all objects in an S3 bucket, and how to convert them to a different Storage Class:

# all our S3 objects are using the "Standard" Storage Class
$ aws s3api list-objects --bucket youtube-mp3.famzah | grep StorageClass | sort | uniq -c
749  "StorageClass": "STANDARD"

# convert without re-uploading the objects from your computer
aws s3 cp --recursive --storage-class STANDARD_IA s3://youtube-mp3.famzah/ s3://youtube-mp3.famzah/

# all our S3 objects are now using the "Standard-Infrequent" Storage Class
$ aws s3api list-objects --bucket youtube-mp3.famzah | grep StorageClass | sort | uniq -c
749  "StorageClass": "STANDARD_IA"

The reason to use a different Storage Class is pricing.

AWS S3 icon by isdownrightnow.net


Leave a comment

Dynamic DNS using AWS Route 53

The Internet ecosystem and technologies advanced so much lately that you can rebuild an entire business from scratch in a few hours of coding and at pretty acceptable costs. I’m referring to the dynamic DNS (aka. DDNS or DynDNS) service which was a hit a few years back. It took me less than a hundred lines of code to create a simple dynamic DNS using AWS Route 53. The AWS API and backend provide the DNS service, while the free service “ipify” lets you look up your real remote IP address. While this solution is not free as speech, it’s free as beer and costs less than a dollar per month.

DNS icon by PRchecker


2 Comments

Bash: Process null-terminated results piped from external commands

Usually when working with filenames we need to terminate each result record uniquely using the special null-character. That’s because filenames may contain special symbols, including white-space and even the newline character “\n”.

There is already a great answer how to do this in the StackOverflow topic “Capturing output of find . -print0 into a bash array”. The proposed solution doesn’t invoke any sub-shells, which is great, and also explains all caveats in detail. In order to become really universal, this solution must not rely on the static file-descriptor “3”. Another great answer at SO gives an example on how to dynamically use the next available file-descriptor.

Here is the solution which works without using sub-shells and without depending on a static FD:

a=()
while IFS='' read -r -u"$FD" -d $'\0' file; do
  # note that $IFS is having the default value here
  a+=("$file") # or however you want to process each file
done {FD}< <(find /tmp -type f -print0)
exec {FD}<&- # close the file descriptor

# the result is available outside the loop, too
echo "${a[0]}" # 1st file
echo "${a[1]}" # 2nd file

Terminal icon created by Julian Turner


52 Comments

C++ vs. Python vs. PHP vs. Java vs. Others performance benchmark (2016 Q3)

The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, dynamic arrays with numbers, basic math operations.

This is an improved redo of the tests done in previous years. You are strongly encouraged to read the additional information about the tests in the article.

Here are the benchmark results:

Language CPU time Slower than Language
version
Source
code
User System Total C++ previous
C++ (optimized with -O2) 0.899 0.053 0.951 g++ 6.1.1 link
Rust 0.898 0.129 1.026 7% 7% 1.12.0 link
Java 8 (non-std lib) 1.090 0.006 1.096 15% 6% 1.8.0_102 link
Python 2.7 + PyPy 1.376 0.120 1.496 57% 36% PyPy 5.4.1 link
C# .NET Core Linux 1.583 0.112 1.695 78% 13% 1.0.0-preview2 link
Javascript (nodejs) 1.371 0.466 1.837 93% 8% 4.3.1 link
Go 2.622 0.083 2.705 184% 47% 1.7.1 link
C++ (not optimized) 2.921 0.054 2.975 212% 9% g++ 6.1.1 link
PHP 7.0 6.447 0.178 6.624 596% 122% 7.0.11 link
Java 8 (see notes) 12.064 0.080 12.144 1176% 83% 1.8.0_102 link
Ruby 12.742 0.230 12.972 1263% 6% 2.3.1 link
Python 3.5 17.950 0.126 18.077 1800% 39% 3.5.2 link
Perl 25.054 0.014 25.068 2535% 38% 5.24.1 link
Python 2.7 25.219 0.114 25.333 2562% 1% 2.7.12 link

The big difference this time is that we use a slightly modified benchmark method. Programs are no longer limited to just 10 loops. Instead they run for 90 wall-clock seconds, and then we divide and normalize their performance as if they were running for only 10 loops. This way we can compare with the previous results. The benefit of doing the tests like this is that the startup and shutdown times of the interpreters should make almost no difference now. It turned out that the new method doesn’t significantly change the outcome compared to the previous benchmark runs, which is good as the old way of benchmarks seems also correct.

For the curious readers, the raw results also show the maximum used memory (RSS).

Brief analysis of the results:

  • Rust, which we benchmark for the first time, is very fast. 🙂
  • C# .NET Core on Linux, which we also benchmark for the first time, performs very well by being as fast as NodeJS and only 78% slower than C++. Memory usage peak was at 230 MB which is the same as Python 3.5 and PHP 7.0, and two times less than Java 8 and NodeJS.
  • NodeJS version 4.3.x got much slower than the previous major version 4.2.x. This is the only surprise. It turned out to be a minor glitch in the parser which was easy to fix. NodeJS 4.3.x is performing the same as 4.2.x.
  • Python and Perl seem a bit slower than before but this is probably due to the fact that C++ performed even better because of the new benchmark method.
  • Java 8 didn’t perform much faster as we expected. Maybe it gets slower as more and more loops are done, which also allocated more RAM.
  • Also review the analysis in the old 2016 tests for more information.

The tests were run on a Debian Linux 64-bit machine.

You can download the source codes, raw results, and the benchmark batch script at:
https://github.com/famzah/langs-performance

Update @ 2016-10-15: Added the Rust implementation. The minor versions of some languages were updated as well.
Update @ 2016-10-19: A redo which includes the NodeJS fix.
Update @ 2016-11-04: Added the C# .NET Core implementation.