Linux | /contrib/famzah

October 15, 2022
by Ivan Zahariev Leave a comment

Force Exim to process outgoing queue quickly

There are times when a lot of messages queue up in Exim. For example, it could be due to an intermittent network problem.

I always struggled to force Exim to process the outgoing queue with lots of parallel connections to the remote SMTP servers. Note that my use-case is rather special where all mails are delivered to the same recipient domain.

There are suggestions to use "queue_run_max = 30" or "remote_max_parallel = 30" to increase the maximum parallel outgoing SMTP connections. When I execute "exim -qf" or "exim -Rf domain" to process the mail queue immediately, the parallel SMTP connections are still capped to just about 5.

Today I found a way to control the parallelism for SMTP deliveries when we want to process the queue manually (forced):

exiqgrep -ir example.com|xargs -P 30 -n 10 exim -M

This effectively launches 30 parallel SMTP connections and a queue with thousands of messages gets processed in a few minutes. If you want to process all messages regardless of their domain name, use only "exiqgrep -i". The command “exiqgrep” has other filtering capabilities to help you with the selection of messages to process.

Note that if the messages have queued up because the remote MX server was marked as unreachable, you will have to clear the “retry” database before you start to process the queue manually. Additionally, I also clear the database about messages waiting for remote SMTP hosts:

exim_tidydb -t 1s /var/spool/exim4 retry
exim_tidydb -t 1s /var/spool/exim4 wait-remote_smtp

The man page of “exim_tidydb” explains in detail what this command does.

December 4, 2020
by Ivan Zahariev Leave a comment

Debug the usage of anonymously shared memory regions

PHP-FPM keeps a shared Opcache memory between the parent process and all its child processes in a pool. The idea is to compile source code once and then reuse it in all child processes directly as byte code. This is efficient but as a System administrator I recently stumbled across a problem – how to find out the real memory usage by the Opcache in the operating system?

I thought a simple “ps” list would reveal the memory usage because it would be accounted to the parent process, because the parent process created the anonymously shared mmap() region. Linux doesn’t work this way. In order to debug this easily, I created a simple program which does the following:

The parent process creates a shared anonymous memory region using mmap() with a size of 2000 MB. The parent process does not use the memory region in any way. It doesn’t change any data in it.
Two child processes are fork()’ed and then:
- The first process writes 500 MB of data in the beginning of shared memory region passed by the parent process.
- The second process writes 1000 MB of data in the beginning of the shared memory region passed by the parent process.

Here is how the process list looks like:

	USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
	famzah 9256 0.0 0.0 2052508 816 pts/15 S+ 18:00 0:00 ./a.out # parent
	famzah 9257 10.7 10.1 2052508 512008 pts/15 S+ 18:00 0:01 \_ ./a.out # child 1
	famzah 9258 29.0 20.2 2052508 1023952 pts/15 S+ 18:00 0:03 \_ ./a.out # child 2

	famzah@vbox64:~$ free -m
	total used free shared buff/cache available
	Mem: 4940 549 1943 1012 2447 3097

view raw ps output hosted with ❤ by GitHub

A quick explanation of this process list snapshot:

VSZ (virtual size) of the parent and child processes is 2000 MB because the parent process has allocated 2000 MB of anonymous memory using mmap(). No additional allocations were made by the child processes as they were passed a reference to the anonymously shared memory in the parent process. Therefore the virtual memory footprint of all processes is the same.
RSS (resident set size, or simply “the real usage”) is:
- Almost none for the parent process because it never used any memory. It only “requested” the memory block by mmap().
- 500 MB for the first child processes because it wrote 500 MB of data at the beginning of the shared memory region.
- 1000 MB for the second child processes because it wrote 1000 MB of data at the beginning of the shared memory region.
The “free -m” command shows that 1012 MB of anonymously shared memory is being used.

So far things seem kind of logical. We can roughly determine the real usage of the shared memory region by looking at the child processes. This however is also not really true because if they write at completely different regions in the anonymous memory, we would need to sum their usage. If they write to the very same memory region, we need to look at the max() value.

The pmap command doesn’t provide any additional information and shows the same values that we see in the “ps” output:

	famzah@vbox64:~$ pmap -XX 9256
	9256: ./a.out
	Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
	7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)

	famzah@vbox64:~$ pmap -XX 9257
	9257: ./a.out
	Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
	7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 512000 256000 0 512000 0 0 512000 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)

	famzah@vbox64:~$ pmap -XX 9258
	9258: ./a.out
	Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlagsMapping
	7f052ea9b000 rw-s 00000000 00:05 733825 2048000 4 4 1024000 768000 0 512000 512000 0 1024000 0 0 0 0 0 0 0 0 0 rd wr sh mr mw me ms sd zero (deleted)

view raw pmap output hosted with ❤ by GitHub

Things get even more messy when the child processes terminate (and get replaced by new ones which never touched the shared anonymous memory). Here is how the process list looks like:

	USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
	famzah 9256 0.0 0.0 2052508 816 pts/15 S+ 18:00 0:00 ./a.out # parent

	famzah@vbox64:~$ free -m
	total used free shared buff/cache available
	Mem: 4940 549 1943 1012 2447 3097

view raw ps output hosted with ❤ by GitHub

The RSS (resident set size, or simply “the real usage”) of the parent process continues to show no usage. But the anonymous memory region is actually used because the child processes wrote data in it. And the region is not automatically free()’d, because the parent process is still alive. The “free -m” command clearly shows that there are 1000 MB of data stored in anonymous shared memory.

How can we reliably find out the memory usage of the anonymous shared region and account it to a given process?

We will look at /proc/[pid]/maps:

A file containing the currently mapped memory regions and their access permissions. See mmap(2) for some further information about memory mappings.
…
If the pathname field is blank, this is an anonymous mapping as obtained via mmap(2). There is no easy way to coordinate this back to a process’s source, short of running it through gdb(1), strace(1), or similar.

Wikipedia gives the following additional information:

When “/dev/zero” is memory-mapped, e.g., with mmap(), to the virtual address space, it is equivalent to using anonymous memory; i.e. memory not connected to any file.

Now we know how to find out the virtual address of the anonymously memory-mapped region. Here I demostrate two different ways of obtaining the address:

	famzah@vbox64:~$ cat /proc/9256/maps \| grep /dev/zero
	7f052ea9b000-7f05aba9b000 rw-s 00000000 00:05 733825 /dev/zero (deleted)

	famzah@vbox64:~$ ls -la /proc/9256/map_files/ \| grep /dev/zero # same region of 7f052ea9b000-7f05aba9b000
	lrw——- 1 famzah famzah 64 Nov 11 18:21 7f052ea9b000-7f05aba9b000 -> '/dev/zero (deleted)'

view raw virtual address of the mmap()'ed region hosted with ❤ by GitHub

The man page of tmpfs gives further insight:

An internal shared memory filesystem is used for […] shared anonymous mappings (mmap(2) with the MAP_SHARED and MAP_ANONYMOUS flags).
…
The amount of memory consumed by all tmpfs filesystems is shown in the Shmem field of /proc/meminfo and in the shared field displayed by free(1).

We verify that the memory-mapped region is a “tmpfs” file:

	famzah@vbox64:~$ sudo stat -Lf /proc/9256/map_files/7f052ea9b000-7f05aba9b000
	File: "/proc/9256/map_files/7f052ea9b000-7f05aba9b000"
	ID: 0 Namelen: 255 Type: tmpfs

view raw Prove that the memory region is a tmpfs file hosted with ❤ by GitHub

💚 We can then finally get the real memory usage of this shared anonymous memory block in terms of VSS (virtual memory size) and RSS (resident set size, or simply “the real usage”):

	# stat doesn't give us the real usage, only the virtual
	famzah@vbox64:~$ sudo stat -L /proc/9256/map_files/7f052ea9b000-7f05aba9b000
	File: /proc/9256/map_files/7f052ea9b000-7f05aba9b000
	Size: 2097152000 Blocks: 2048000 IO Block: 4096 regular file
	Device: 5h/5d Inode: 733825 Links: 0
	Access: (0777/-rwxrwxrwx) Uid: ( 1000/ famzah) Gid: ( 1000/ famzah)

	# VSS (virtual memory size)
	famzah@vbox64:~$ sudo du –apparent-size -BM -D /proc/9256/map_files/7f052ea9b000-7f05aba9b000
	2000M /proc/9256/map_files/7f052ea9b000-7f05aba9b000

	# RSS (resident set size, or simply "the real usage")
	famzah@vbox64:~$ sudo du -BM -D /proc/9256/map_files/7f052ea9b000-7f05aba9b000
	1000M /proc/9256/map_files/7f052ea9b000-7f05aba9b000

view raw The virtual and real memory usage of mmap()'ed memory hosted with ❤ by GitHub

Since we have access to the memory region as a file, we can even read this memory mapped region:

	famzah@vbox64:~$ sudo cat /proc/9256/map_files/7f052ea9b000-7f05aba9b000 \| wc -c
	2097152000

view raw Read the content of a memory mapped region hosted with ❤ by GitHub

December 19, 2018
by Ivan Zahariev 5 Comments

posix_spawn() performance benchmarks and usage examples

The glibc library has an efficient posix_spawn() implementation since glibc version 2.24 (2016-08-05). I have awaited this feature for a long time.

TL;DR: posix_spawn() in glibc 2.24+ is really fast. You should replace the old system() and popen() calls with posix_spawn().

Today I ran all benchmarks of the popen_noshell() library, which basically emulates posix_spawn(). Here are the results:

Test	Uses pipes	User CPU	System CPU	Total CPU	Slower with
vfork() + exec(), standard Libc	No	7.4	1.6	9.0	–
the new noshell, default clone(), compat=1	Yes	7.7	2.1	9.7	8%
the new noshell, default clone(), compat=0	Yes	7.8	2.0	9.9	9%
posix_spawn() + exec() no pipes, standard Libc	No	9.4	2.0	11.5	27%
the new noshell, posix_spawn(), compat=0	Yes	9.6	2.7	12.3	36%
the new noshell, posix_spawn(), compat=1	Yes	9.6	2.7	12.3	37%
fork() + exec(), standard Libc	No	40.5	43.8	84.3	836%
the new noshell, debug fork(), compat=1	No	41.6	45.2	86.8	863%
the new noshell, debug fork(), compat=0	No	41.6	45.3	86.9	865%
system(), standard Libc	No	67.3	48.1	115.4	1180%
popen(), standard Libc	Yes	70.4	47.1	117.5	1204%

The fastest way to run something externally is to call vfork() and immediately exec() after it. This is the best solution if you don’t need to capture the output of the command, nor you need to supply any data to its standard input. As you can see, the standard system() call is about 12 times slower in performing the same operation. The good news is that posix_spawn() + exec() is almost as fast as vfork() + exec(). If we don’t care about the 27% slowdown, we can use the standard posix_spawn() interface.

It gets more complicated and slower if you want to capture the output or send data to stdin. In such a case you have to duplicate stdin/stdout descriptors, close one of the pipe ends, etc. The popen_noshell.c source code gives a full example of all this work.

We can see that the popen_noshell() library is still the fastest option to run an external process and be able to communicate with it. The command popen_noshell() is just 8% slower than the absolute ideal result of a simple vfork() + exec().

There is another good news — posix_spawn() is also very efficient! It’s a fact that it lags with 36% behind the vfork() + exec() marker, but still it’s 12 times faster than the popen() old-school glibc alternative. Using the standard posix_spawn() makes your source code easier to read, better supported for bugs by the mainstream glibc library, and you have no external library dependencies.

The replacement of system() using posix_spawn() is rather easy as we can see in the “popen-noshell/performance_tests/fork-performance.c” function posix_spawn_test():

# the same as system() but using posix_spawn() which is 12 times faster
void posix_spawn_test() {
	pid_t pid;
	char * const argv[] = { "./tiny2" , NULL };

	if (posix_spawn(&pid, "./tiny2", NULL, NULL, argv, environ) != 0) {
		err(EXIT_FAILURE, "posix_spawn()");
	}

	parent_waitpid(pid);
}

If you want to communicate with the external process, there are a few more steps which you need to perform like creating pipes, etc. Have a look at the source code of “popen_noshell.c“. If you search for the string “POPEN_NOSHELL_MODE”, you will find two alternative blocks of code — one for the standard way to start a process and manage pipes in C, and the other block will show how to perform the same steps using the posix_spawn() family functions.

Please note that posix_spawn() is a completely different implementation than system() or popen(). If it’s not safe to use the faster way, posix_spawn() may fall back to the slow fork().

April 29, 2018
by Ivan Zahariev Leave a comment

Amazon EFS benchmarks

The Amazon Elastic File System (EFS) is a very intriguing storage product. It provides simple, scalable, elastic file storage for use on an EC2 virtual machine. The file system can be mounted over NFS at one or more EC2 machines simultaneously, and it also supports file locking.

Here are some important facts which I found out while doing my tests:

I/O operations per second (IOPS) are not the same metric that we’re used to measure when dealing with block devices like HDD or SSD disks. When working with EFS, we measure the NFS I/O operations per second. These correspond 1:1 to the read() or write() system calls that your applications make.
The size of the issued I/O requests are another very important metric for EFS. This is the real bytes transfer between your EC2 instance and the NFS server.
Therefore, we’re limited by both the NFS I/O requests per second, and the total transferred bytes for those NFS I/O per second.
The EFS performance and EFS limits documentation pages give a lot of insight. You have to monitor your EFS metrics using CloudWatch.
NFS I/O requests smaller than 4096 bytes are accounted as 4096 bytes. Regardless if you request 1 bytes, 1000 bytes, or 4096 bytes, you will get 4096 bytes accounted. Once you request more than 4096 bytes, they are accounted correctly.
You need more than one reader/writer thread or program, in order to achieve the full IOPS potential. One writer thread in my tests did 130 op/s, while 20 writer threads did 1500 op/s, for example.
The documentation says: “In General Purpose mode, there is a limit of 7000 file system operations per second. This operations limit is calculated for all clients connected to a single file system”. Our tests confirm this — we could do 3500 reading or 3000 writing operations per second.
CloudWatch has different aggregation functions for the *IOBytes metrics: min/max/average; sum; count. They represent different aspects of your EFS metrics, namely: the min/max/average IO operation size in bytes; the total transferred bytes in a minute (you need to divide to 60 to get the “per second” value); the total operations in a minute (you need to divide to 60 to get the “per second” value).
The CloudWatch EFS metrics “DataReadIOBytes” and “DataWriteIOBytes” reflect exactly what we see on the Linux system for “kB/s” and “ops/s” by the nfsiostat program. The transferred bytes reflect exactly the used bandwidth on the Linux network interfaces.
The “Metered size” in the AWS Console which is the same value as what you see by the “df” command is not updated in real-time. It could take more than an hour to reflect the real disk usage.
There is plenty of initial burst credit balance which lets you do some heavy I/O on your freshly created EFS file system. Our benchmark tests ran for hours with block sizes between 1 byte and 10k bytes, and we still had some positive burst credit balance left at the end.

I’m using the default NFS settings by the NFS mount helper provided in the “Amazon Linux 2” OS:

[root@ip-172-31-11-75 ~]# mount -t efs fs-7513e02c:/ /efs

[root@ip-172-31-11-75 ~]# mount
fs-7513e02c.efs.eu-central-1.amazonaws.com:/ on /efs type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.31.11.75,local_lock=none,addr=172.31.15.76)

The tests were performed using two “m4.xlarge” EC2 instances in the “eu-central” AWS region. This EC2 instance type provides “High” network performance.

The NFS I/O operations per second limits were tested using a simple C program which basically does the following:

fd = open(testfile, O_RDWR|O_DIRECT|O_SYNC);

while (1) {
  lseek(fd, SEEK_SET, 0);

  read(fd, buf, sizeof(buf));
  // or
  write(fd, buf, sizeof(buf));
}

I created 40 different files, so that I can run 40 different single benchmark programs on an EC2 instance – one for each file. This increases concurrency and lets the total throughput scale better.

Sequential writing and reading

Sequential writing and reading performed as expected – up to the “PermittedThroughput” limit shown in the CloudWatch metrics. In my case, for such a small EFS file system, the limit was 105 MB/s.

Writing: NFS I/O operations per second

Here are the results:

Writing from one EC2 instance using 1 byte, 1k bytes, or 10k bytes: regardless of the request size, we get up to 2000 IOPS. Typically the IOPS are between 1400 and 1700.
Writing from two EC2 instances using 1 byte, 1k bytes, or 10k bytes: regardless of the request size, we get up to 3000 IOPS in total which are equally spread across the two EC2 instances.
The “PercentIOLimit” CloudWatch metric shows 84% when we do 2880 ops/s, for example. Therefore, the total IOPS limit for writing is about 3500 ops/s.
When doing only write() system calls with 1 byte data, only “DataWriteIOBytes” is accounted by EFS which is an advantage for us. A real block file system needs to read the block (usually 4k bytes), update 1 byte in it, and then write it back on disk. I feel like this needs additional testing with more random data, so test for yourself, too. Note that the minimum accounted request size in EFS is 4kB.

Reading: NFS I/O operations per second

Here are the results:

Reading from one EC2 instance using 1 byte or 10k bytes: regardless of the request size, we get up to 3500 IOPS. One EC2 instance is enough to saturate the EFS limit.
Reading from two EC2 instances using 1 byte or 10k bytes: regardless of the request size, we get up to 3500 IOPS in total which are equally spread across the two EC2 instances.
The “PercentIOLimit” CloudWatch metric shows 100% when we do 3500 ops/s. Therefore, the total IOPS limit for reading is 3500 ops/s.

April 24, 2018
by Ivan Zahariev Leave a comment

HTTP Keep-Alive timeout values used by popular websites

Here is the command to test the Keep-Alive timeout of a website:

VHOST="www.google.bg"; time openssl s_client -ign_eof -connect "$VHOST:443" <<< "$( echo -ne "GET / HTTP/1.1\r\nHost: $VHOST\r\nConnection: Keep-Alive\r\n\r\n" )"

And here are the today’s results for some popular websites:

slashdot.org: 0s
LWN.net: 5s
snag.gy: 5s
yahoo.com: 10s
readthedocs.org: 10s
www.superhosting.bg: 15s
httpd.apache.org: 30s
nginx.org: 1m
en.wikipedia.org: 1m
famzah.wordpress.com: 1m15s
aws.amazon.com: 2m50s
www.facebook.com: 3m
www.google.bg: 4m
cloudplatform.googleblog.com: 4m
www.cloudflare.com: 6m40s
www.mozilla.org: 6m40s
www.tagesschau.de: 8m
access.redhat.com: 8m20s
stackoverflow.com: 10m
www.timeanddate.com: 10m
www.dreamhost.com: 10m
www.reddit.com: 10m
twitter.com: 15m

April 24, 2018
by Ivan Zahariev 2 Comments

Find the repository of all installed packages on Debian or Ubuntu

It turns out that there is no standard “apt” command which lists where a package was installed from. You may need this information if you have added additional APT repositories to your Debian/Ubuntu installation. I see a lot of questions at the forums (1, 2, 3, 4) and the proper solution tends to be “parse apt-cache output yourself”. Here is my solution which is very similar to this one:

#!/bin/bash
set -u

errors=0

for PKGNAME in $(dpkg -l|grep ^i|awk '{print $2}'); do
        INFO="$(apt-cache policy "$PKGNAME")"
        IVER="$(echo "$INFO" | grep Installed: | awk '{print $2}')"
        IPRIO="$(echo "$INFO" | fgrep "*** $IVER" | awk '{print $3}')"
        REPO="$(echo "$INFO" | fgrep -A1 "*** $IVER" | tail -n+2 | head -n1 | awk '{print $2 " " $3}')"

        echo "$PKGNAME repo=$REPO"

        if [ "$REPO" == '' ]; then
                errors=$(( $errors + 1 ))
                echo "ERROR: Unable to find the repo for package \"$PKGNAME\"" >&2
        fi
done

if [ "$errors" -ne 0 ]; then
        echo "ERROR: $errors errors encountered" >&2
        exit 1
else
        exit 0
fi

October 1, 2017
by Ivan Zahariev 2 Comments

Postfix: Redirect all local and remote mails to a single email address

Virtual servers like EC2 usually get a random external IP address which is not suitable for outgoing SMTP. That’s because these “pool” IP addresses lack reverse DNS resolving, and their spam reputation is unknown because somebody before you may have used them to send out spam.

Still you need to be able to get email notifications from these machines because many vital services like the crontab, for example, send diagnostic emails to “root” or other local mailboxes, depending on the user that a cron job is being executed with.

One possible solution is to catch all mail sent to any email address (local or remote), forward it to Amazon Simple Email Service (SES), and let SES do the actual SMTP delivery for you.

Open the file “/etc/postfix/main.cf” and add the following two statements there:

smtp_generic_maps = regexp:/etc/postfix/email_rewrites
alias_maps = regexp:/etc/postfix/email_rewrites

The first directive ensures that the “From” address is being rewritten to your single external destination email (read the docs), while the second directive forwards all locally delivered mail to the same single external email address (SF article). Note that if “alias_maps” directive already exists in the “main.cf” file, you need to comment it out.

You can configure the single external email address to forward to by creating the file “/etc/postfix/email_rewrites” and then putting the following in it:

/.+/ mailbox@example.com

Finally, execute the following commands, so that Postfix picks up the new configuration:

postmap /etc/postfix/email_rewrites
/etc/init.d/postfix restart

If you decided to use Amazon SES for email delivery, there are a few additional steps to do:

Verify the destination email address in Amazon SES.
Integrate Amazon SES with Postfix.
White-list the email in your mail provider (Gmail, for example), because the crontab system emails often get classified to the Junk folder.

If you are not using Postfix, then review the Amazon SES documentation about integration with other mail servers like Exim, Sendmail, Microsoft Exchange, etc.

April 29, 2017
by Ivan Zahariev Leave a comment

posix_spawn() on Linux

Many years ago I wrote the library popen_noshell which improves the speed of the popen() call significantly. It seems that now there is a standard and very efficient way to achieve the same. Use the posix_spawn() call. Its interface is a bit grumpy and complicated, but it can’t be very simple after all, because posix_spawn() provides both great efficiency and lots of flexibility.

UPDATE: Here are some benchmarks for posix_spawn().

Let us first examine the different ways of spawning a process on Linux 4.10. Here are the different implementations of the following functions:

fork(): _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
vfork(): _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0, 0, NULL, NULL, 0);
clone(): _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
posix_spawn(): implemented by using clone(); no native Linux kernel syscall, yet

In the latest versions of the GNU libc, posix_spawn() uses a clone() call which is equivalent to the vfork() arguments of clone(). Therefore, a logical question pops up – why not use vfork() directly. “The problem are the atfork handlers which can be registered. In the child process they can modify the address space.”

Of course, it would be best if posix_spawn() was implemented as a system call in the Linux kernel. Then we wouldn’t need to depend on the GNU libc implementations, which by the way differ with the different versions of glibc. Additionally, the Linux kernel could spawn processes even faster.

The current implementation of posix_spawn() in the GNU libc is basically a vfork() with a limited, safe set of functions which can be executed inside the vfork()’ed child. When using vfork(), the child shares the memory and the stack of the parent process, so we need to be extra careful indeed. There are plenty of warnings in the man pages about the usage of vfork().

I am glad that my implementation and this of the GNU libc guys is very similar. They did a better job though, because they handle a few corner cases like custom signal handlers in the parent, etc. It’s worth to review the comments and the source code of the patch which introduces the new, very efficient posix_spawn() implementation in the GNU libc.

The above patch got into mainstream with glibc 2.24 on 2016-08-05.

When glibc 2.24 gets into the most mainstream Linux distributions, we can start to use posix_spawn() which should be as efficient as my popen_noshell implementation.

P.S. If you want to read even more technical details about the *fork() calls, try this and this pages.

February 23, 2017
by Ivan Zahariev 2 Comments

MySQL Galera Cluster: How many nodes do you really need?

The MySQL Galera Cluster is a fine piece of software which brings synchronous multi-master replication. This ensures high availability of your database. The following has been tested with Percona XtraDB Cluster.

In order to achieve a desired fault tolerance, we must examine and understand how the cluster responds to node failures. This Percona blog article gives some good insight, but some more clarifications are needed.

There are two different kinds of failures that may occur while the cluster is operating:

Simultaneous failure of two or more nodes.
One-by-one failure. This is the case when a node fails, the cluster notices this and reacts before another node fails, too. The usual time for reaction is 5 seconds which is controlled by the “suspect timeout” setting (evs.suspect_timeout).

When one or more nodes fail, the cluster reacts in a very clever way:

The remaining alive nodes run a quorum vote, in order to determine if their count is >50% of the last cluster size.
If the cluster can continue to operate with a quorum, it re-adjusts its size! This is a crucial feature which lets the cluster lose nodes one-by-one until only two active nodes are online.

❓ Now the question is, if we have a cluster with 3 data nodes, is it possible to lose 2 of them, and still continue to operate? The answer is “yes”, but only if you lose them one-by-one, and only if you run an additional arbitrator node in your cluster. Note that the arbitrator does not store any data but still participates in the whole network replication traffic, in order to be able to vote.

Three data nodes and an arbitrator node make a cluster with a size of four nodes. When one of the nodes fails, 3/4 of the nodes are alive which is >50% quorum and the cluster continues to operate by reducing its size to three. When a second node fails, 2/3 of the nodes are alive which is >50% quorum and the cluster continues to operate with a size of two. You cannot lose a third data node since you have only three initially anyway. 🙂

If you lose two data nodes simultaneously, or if you lose one data node and the arbitrator (total of two nodes) simultaneously, you are out of luck. Only 2/4 of the nodes are alive which is not >50% quorum. The cluster stops to operate, in order to prevent a possible split-brain situation.

Here is a summary on how many nodes you can lose with the two different cluster configurations. Note that the arbitrator counts as a regular node when it comes to losing it:

3 nodes with an arbitrator (initial cluster size is 4):
- You can lose 2 nodes in a one-by-one fashion.
- You can lose only 1 node simultaneously. Losing 2 nodes simultaneously kills your whole cluster.
3 nodes without an arbitrator (initial cluster size is 3):
- You can lose only 1 node even in a one-by-one fashion.
- You can lose only 1 node simultaneously. Losing 2 nodes simultaneously kills your whole cluster.

So running an arbitrator in a three-node MySQL Galera Cluster makes total sense, if you can allocate one more separate machine with the same network capabilities.

✩ Note that regular MySQL node shutdowns are handled differently by the cluster. When a node leaves the cluster via a normal shutdown, it informs all members of the cluster about this. Therefore, it should be safe to shut down even 2 out of your 3 data nodes simultaneously.

February 21, 2017
by Ivan Zahariev Leave a comment

Configure MySQL Galera Cluster to listen on a specific IP address

If you have a separate private network for your MySQL Galera Cluster, it is a good security practice to configure it to listen only on the private IP address. This way you have less firewall settings to set up and rely on. The following has been tested with Percona XtraDB Cluster.

A MySQL Galera Cluster listens all the time on two different ports, in order to provide the following services:

port 4567 – Galera Cluster communication
port 3306 – MySQL client connections and State Snapshot Transfer that use the “mysqldump” method

While those two services could be bound on different IP addresses, they are usually using the same IP address. Each of these services are configured using different MySQL settings in “my.cnf”:

port 4567 – “wsrep_cluster_address=gcomm://%CLUSTER_IP1%,%CLUSTER_IP2%,%CLUSTER_IP3%?gmcast.listen_addr=tcp://%THIS_NODE_LISTEN_IP%:4567”
port 3306 – “bind-address=%THIS_NODE_LISTEN_IP%”

If we had a cluster, and the current node has an IP address of 169.254.50.1, we would have the following in “my.cnf” regarding the networking configuration:

wsrep_provider_options="gmcast.listen_addr=tcp://169.254.50.1:4567"
wsrep_node_address=169.254.50.1
bind-address=169.254.50.1

There are two other ports which are opened on demand: port 4568 for Incremental State Transfer, and port 4444 for all other State Snapshot Transfer operations. Those two ports are controlled by “wsrep_sst_receive_address” and the “ist.recv_addr” option in “wsrep_provider_options”, as explained at this page. The default listening IP address is the same as configured for “wsrep_node_address”, and therefore doesn’t need any additional tweaks.

EDIT: It turns out that regardless of what is specified for the above two options for ports 4444 and 4568, at least the “other” State Snapshot Transfer port 4444 is always listening on the catch-all IP address “0.0.0.0” which accepts connections on any network interface and local address. I’ve observed this while a node was in a “Donor” state because another node was just joining the cluster.

/contrib/famzah

Enthusiasm never stops

Category Archives: Linux