sudo hangs and leaves the executed program as “zombie”

Today I discovered a non-security bug in sudo – the executed program finishes successfully, but sudo hangs forever waiting for something. The executed program is left in a “zombie” process state. Here is how the process list looks like, for example:

root 6368 0.0 0.0 2808 1592 pts/6 Ss 18:39 0:00 | \_ -bash
root 1103 0.0 0.0 2200 1000 pts/6 S+ 21:45 0:00 | \_ ./sudo -u root sleep 5
root 1104 0.0 0.0 0 0 pts/6 Z+ 21:45 0:00 | \_ [sleep] <defunct>

If we try to trace the system calls of the sudo command, here is what we get:

[root@tester2 ~]# strace -fF -p 1103
select(4, [3], [], NULL, NULL

The sudo process waits endlessly in a select() system call which waits for file descriptor #3. So we quickly check what corresponds to file descriptor #3:

[root@tester2 ~]# ls -la /proc/1103/fd
total 0
dr-x—— 2 root root 0 Nov 1 21:45 .
dr-xr-xr-x 5 root root 0 Nov 1 21:45 ..
lrwx—— 1 root root 64 Nov 1 21:45 0 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:46 1 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:45 2 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:46 3 -> socket:[1136261781]

Socket “socket:[1136261781]” is already closed tough (blinking in red on my console). Thus sudo is waiting in a blocked select() for a change on file descriptor #3, which is already closed and will never change its state. Sudo will therefore wait forever.

In order to understand why this happens, we will look at the source code of sudo. Here is snippet from “exec.c” – only the relevant code is left for clarity:

int
sudo_execve(path, argv, envp, uid, cstat, dowait, bgmode)
{
    int sv[2];

    if (socketpair(PF_UNIX, SOCK_DGRAM, 0, sv) != 0)
    error(1, "cannot create sockets");

    zero_bytes(&sa, sizeof(sa));
    sigemptyset(&sa.sa_mask);

    /* Note: HP-UX select() will not be interrupted if SA_RESTART set */
    sa.sa_flags = SA_INTERRUPT; /* do not restart syscalls */
    sa.sa_handler = handler;
    sigaction(SIGCHLD, &sa, NULL);
    sigaction(SIGHUP, &sa, NULL);
    sigaction(SIGINT, &sa, NULL);
    sigaction(SIGPIPE, &sa, NULL);
    sigaction(SIGQUIT, &sa, NULL);
    sigaction(SIGTERM, &sa, NULL);

    /* Max fd we will be selecting on. */
    maxfd = sv[0];

    child = fork()
    close(sv[1]);

    fdsr = (fd_set *)emalloc2(howmany(maxfd + 1, NFDBITS), sizeof(fd_mask));
    fdsw = (fd_set *)emalloc2(howmany(maxfd + 1, NFDBITS), sizeof(fd_mask));

    for (;;) {

    zero_bytes(fdsw, howmany(maxfd + 1, NFDBITS) * sizeof(fd_mask));
    zero_bytes(fdsr, howmany(maxfd + 1, NFDBITS) * sizeof(fd_mask));

    FD_SET(sv[0], fdsr);

    if (recvsig[SIGCHLD])
        continue;
    nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
    if (nready == -1) {
        if (errno == EINTR)
        continue;
        error(1, "select failed");
    }

    }
}

/*
 * Generic handler for signals passed from parent -> child.
 * The recvsig[] array is checked in the main event loop.
 */
void
handler(s)
    int s;
{
    recvsig[s] = TRUE;
}

The race-condition happens right before the select() on line #40, and just after the “if” on lines #38 and #39. If the parent process gets re-scheduled after the “if” was executed, and at this very time the child process finishes and SIGCHLD is sent to the parent process, sudo gets in trouble. The SIGCHLD handler accounts in the variable “recvsig[]” that the signal was received, and then the parent process calls select(). This select will never be interrupted, as the author had it in mind. In 99% of the cases, the parent process will enter in the select() blocking state before the child process ended. The child would then send SIGCHLD, which will be accounted in the handler procedure, and will also interrupt select() which will return -1 in “nready”, and “errno” will be set to EINTR.

Here is an easy way to reproduce the bug. We add a sleep() of 10 seconds between the “if” and select(), thus simulating that the system was very busy and re-scheduled the parent sudo process right between these two operations. Here is the source diff:

--- sudo-orig/sudo-1.7.4p4/exec.c       Sat Sep  4 00:40:19 2010
+++ sudo-1.7.4p4/exec.c Mon Nov  1 21:48:24 2010
@@ -307,6 +307,10 @@
 
        if (recvsig[SIGCHLD])
            continue;
+       printf("debug: Missed the check for SIGCHLD, the child is still running. SIGCHLD status: %d\n", recvsig[SIGCHLD]);
+       sleep(10); // this will be interrupted by SIGCHLD, because the child exists at some time here (we run "sudo sleep 5")
+       printf("debug: We should have got SIGCHLD by now. SIGCHLD status: %d\n", recvsig[SIGCHLD]);
+       printf("debug: Entering the endless select()...\n");
        nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
        if (nready == -1) {
            if (errno == EINTR)

After that we execute sudo, and observe the bug every time:

[root@tester2 sudo-1.7.4p4]# make >/dev/null && ./sudo -u root sleep 5
debug: Missed the check for SIGCHLD, the child is still running. SIGCHLD status: 0
debug: We should have got SIGCHLD by now. SIGCHLD status: 1
debug: Entering the endless select()…
…(this never finishes)…

The sudo author actually tried to avoid this potential race condition if SIGCHLD is received immediately
before we call select() – changeset 5334:99adc5ea7f0a. The proper way to fix this is to use a timeout in the select() call:

--- sudo-orig/sudo-1.7.4p4/exec.c       Sat Sep  4 00:40:19 2010
+++ sudo-1.7.4p4/exec.c Mon Nov  1 21:50:26 2010
@@ -307,7 +307,11 @@
 
        if (recvsig[SIGCHLD])
            continue;
-       nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
+       struct timeval timeout;
+       timeout.tv_sec = 1; // Linux resets this, so set it everytime
+       timeout.tv_usec = 0;
+       nready = select(maxfd + 1, fdsr, fdsw, NULL, &timeout);
        if (nready == -1) {
            if (errno == EINTR)
                continue;

The select() mechanism and this bug were introduced somewhere between sudo versions 1.7.2 and 1.7.3. At least that is what I managed to see from the Changelog:

2010-06-29  Todd C. Miller  <Todd.Miller@courtesan.com>
	[72fd1f510a08] [SUDO_1_7_3] <1.7>
...
2009-11-15  Todd C. Miller  <Todd.Miller@courtesan.com>
	* script.c, sudo.c, sudo.h, sudoreplay.c, term.c, tgetpass.c:
	Use a socketpair to pass signals from parent to child.
...
2009-07-12  Todd C. Miller  <Todd.Miller@courtesan.com>
	[f5ad45f69f05] [SUDO_1_7_2]

I’ve reported this bug (#447) to the sudo maintainer, and I’m sure he will fix it when time permits. Because we all depend on sudo and love it. :)

About these ads

5 thoughts on “sudo hangs and leaves the executed program as “zombie”

  1. You are truly a lifesaver. Been dealing with an issue with sudo/php/apache on Debian on a dev machine and this turned out to be the issue. Unfortunately Debian don’t seem to be updating to 1.7.5 (which contains the fix) in any of the distros.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s