sudo hangs and leaves the executed program as “zombie”

November 1, 2010 by Ivan Zahariev 5 Comments

Today I discovered a non-security bug in sudo – the executed program finishes successfully, but sudo hangs forever waiting for something. The executed program is left in a “zombie” process state. Here is how the process list looks like, for example:

root 6368 0.0 0.0 2808 1592 pts/6 Ss 18:39 0:00 | \_ -bash
root 1103 0.0 0.0 2200 1000 pts/6 S+ 21:45 0:00 | \_ ./sudo -u root sleep 5
root 1104 0.0 0.0 0 0 pts/6 Z+ 21:45 0:00 | \_ [sleep] <defunct>

If we try to trace the system calls of the sudo command, here is what we get:

[root@tester2 ~]# strace -fF -p 1103
select(4, [3], [], NULL, NULL

The sudo process waits endlessly in a select() system call which waits for file descriptor #3. So we quickly check what corresponds to file descriptor #3:

[root@tester2 ~]# ls -la /proc/1103/fd
total 0
dr-x—— 2 root root 0 Nov 1 21:45 .
dr-xr-xr-x 5 root root 0 Nov 1 21:45 ..
lrwx—— 1 root root 64 Nov 1 21:45 0 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:46 1 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:45 2 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:46 3 -> socket:[1136261781]

Socket “socket:[1136261781]” is already closed tough (blinking in red on my console). Thus sudo is waiting in a blocked select() for a change on file descriptor #3, which is already closed and will never change its state. Sudo will therefore wait forever.

In order to understand why this happens, we will look at the source code of sudo. Here is snippet from “exec.c” – only the relevant code is left for clarity:

int
sudo_execve(path, argv, envp, uid, cstat, dowait, bgmode)
{
    int sv[2];

    if (socketpair(PF_UNIX, SOCK_DGRAM, 0, sv) != 0)
    error(1, "cannot create sockets");

    zero_bytes(&sa, sizeof(sa));
    sigemptyset(&sa.sa_mask);

    /* Note: HP-UX select() will not be interrupted if SA_RESTART set */
    sa.sa_flags = SA_INTERRUPT; /* do not restart syscalls */
    sa.sa_handler = handler;
    sigaction(SIGCHLD, &sa, NULL);
    sigaction(SIGHUP, &sa, NULL);
    sigaction(SIGINT, &sa, NULL);
    sigaction(SIGPIPE, &sa, NULL);
    sigaction(SIGQUIT, &sa, NULL);
    sigaction(SIGTERM, &sa, NULL);

    /* Max fd we will be selecting on. */
    maxfd = sv[0];

    child = fork()
    close(sv[1]);

    fdsr = (fd_set *)emalloc2(howmany(maxfd + 1, NFDBITS), sizeof(fd_mask));
    fdsw = (fd_set *)emalloc2(howmany(maxfd + 1, NFDBITS), sizeof(fd_mask));

    for (;;) {

    zero_bytes(fdsw, howmany(maxfd + 1, NFDBITS) * sizeof(fd_mask));
    zero_bytes(fdsr, howmany(maxfd + 1, NFDBITS) * sizeof(fd_mask));

    FD_SET(sv[0], fdsr);

    if (recvsig[SIGCHLD])
        continue;
    nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
    if (nready == -1) {
        if (errno == EINTR)
        continue;
        error(1, "select failed");
    }

    }
}

/*
 * Generic handler for signals passed from parent -> child.
 * The recvsig[] array is checked in the main event loop.
 */
void
handler(s)
    int s;
{
    recvsig[s] = TRUE;
}

The race-condition happens right before the select() on line #40, and just after the “if” on lines #38 and #39. If the parent process gets re-scheduled after the “if” was executed, and at this very time the child process finishes and SIGCHLD is sent to the parent process, sudo gets in trouble. The SIGCHLD handler accounts in the variable “recvsig[]” that the signal was received, and then the parent process calls select(). This select will never be interrupted, as the author had it in mind. In 99% of the cases, the parent process will enter in the select() blocking state before the child process ended. The child would then send SIGCHLD, which will be accounted in the handler procedure, and will also interrupt select() which will return -1 in “nready”, and “errno” will be set to EINTR.

Here is an easy way to reproduce the bug. We add a sleep() of 10 seconds between the “if” and select(), thus simulating that the system was very busy and re-scheduled the parent sudo process right between these two operations. Here is the source diff:

--- sudo-orig/sudo-1.7.4p4/exec.c       Sat Sep  4 00:40:19 2010
+++ sudo-1.7.4p4/exec.c Mon Nov  1 21:48:24 2010
@@ -307,6 +307,10 @@
 
        if (recvsig[SIGCHLD])
            continue;
+       printf("debug: Missed the check for SIGCHLD, the child is still running. SIGCHLD status: %d\n", recvsig[SIGCHLD]);
+       sleep(10); // this will be interrupted by SIGCHLD, because the child exists at some time here (we run "sudo sleep 5")
+       printf("debug: We should have got SIGCHLD by now. SIGCHLD status: %d\n", recvsig[SIGCHLD]);
+       printf("debug: Entering the endless select()...\n");
        nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
        if (nready == -1) {
            if (errno == EINTR)

After that we execute sudo, and observe the bug every time:

[root@tester2 sudo-1.7.4p4]# make >/dev/null && ./sudo -u root sleep 5
debug: Missed the check for SIGCHLD, the child is still running. SIGCHLD status: 0
debug: We should have got SIGCHLD by now. SIGCHLD status: 1
debug: Entering the endless select()…
…(this never finishes)…

The sudo author actually tried to avoid this potential race condition if SIGCHLD is received immediately
before we call select() – changeset 5334:99adc5ea7f0a. The proper way to fix this is to use a timeout in the select() call:

--- sudo-orig/sudo-1.7.4p4/exec.c       Sat Sep  4 00:40:19 2010
+++ sudo-1.7.4p4/exec.c Mon Nov  1 21:50:26 2010
@@ -307,7 +307,11 @@
 
        if (recvsig[SIGCHLD])
            continue;
-       nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
+       struct timeval timeout;
+       timeout.tv_sec = 1; // Linux resets this, so set it everytime
+       timeout.tv_usec = 0;
+       nready = select(maxfd + 1, fdsr, fdsw, NULL, &timeout);
        if (nready == -1) {
            if (errno == EINTR)
                continue;

The select() mechanism and this bug were introduced somewhere between sudo versions 1.7.2 and 1.7.3. At least that is what I managed to see from the Changelog:

2010-06-29  Todd C. Miller  <Todd.Miller@courtesan.com>
	[72fd1f510a08] [SUDO_1_7_3] <1.7>
...
2009-11-15  Todd C. Miller  <Todd.Miller@courtesan.com>
	* script.c, sudo.c, sudo.h, sudoreplay.c, term.c, tgetpass.c:
	Use a socketpair to pass signals from parent to child.
...
2009-07-12  Todd C. Miller  <Todd.Miller@courtesan.com>
	[f5ad45f69f05] [SUDO_1_7_2]

I’ve reported this bug (#447) to the sudo maintainer, and I’m sure he will fix it when time permits. Because we all depend on sudo and love it. 🙂

Author: Ivan Zahariev

An experienced Linux & IT enthusiast, Engineer by heart, Systems architect & developer.

5 thoughts on “sudo hangs and leaves the executed program as “zombie””

Kevin
April 9, 2011 at 10:19 am

Great article! Unfortunately Debian stable has this problem, and i was searching what was the problem for hours..

Stephen
July 29, 2011 at 12:41 pm

You are truly a lifesaver. Been dealing with an issue with sudo/php/apache on Debian on a dev machine and this turned out to be the issue. Unfortunately Debian don’t seem to be updating to 1.7.5 (which contains the fix) in any of the distros.

Hristo Malinov (@hmalinov)
August 9, 2013 at 11:39 am

Hey Ivan,

this is a great article that saves me a lot of time.

I was having the same problem, trying to restart a daemon over Apache with REST api and
it was hanging forever.

Thanks a lot
и блaгодаря 😉
Hristo

- Ivan Zahariev
  August 9, 2013 at 12:14 pm
  
  You’re welcome. What “sudo” version are you using?
  
  - Hristo Malinov (@hmalinov)
    August 9, 2013 at 1:25 pm
    
    It was just the same one 1.7.4 on Debian Squeeze.
    
    I did not want to move to Wheezy yet, so I just installed 1.8.5 from backports.