/contrib/famzah

Enthusiasm never stops


5 Comments

An “xargs” alternative to GNU Parallel

I wanted to use GNU Parallel on my Ubuntu system, in order to process some data in parallel. It turned out that there is no official package for Ubuntu. As of Ubuntu Quantal released on April/2014, this has been corrected and the package is in the official repository.

Reading a bit more brought me to the astonishing fact that “xargs” can run commands in parallel. The “xargs” utility is something I use every day and this parallelism feature made it even more useful.

Let’s try it by running the following:

famzah@vbox:~$ echo 10 20 30 40 50 60 | xargs -n 1 -P 4 sleep

The use of “-n 1” is vital if you want to pass only one command-line argument from the list to each parallel process.

Here is the result:

# right after we launched "xargs"
famzah@vbox:~$ ps f -o pid,command
  PID COMMAND
 5068 /bin/bash
 7007  \_ xargs -n 1 -P 4 sleep
 7008      \_ sleep 10
 7009      \_ sleep 20
 7010      \_ sleep 30
 7011      \_ sleep 40

# 10 seconds later (the first "sleep" has just exited)
famzah@vbox:~$ ps f -o pid,command
  PID COMMAND
 5068 /bin/bash
 7007  \_ xargs -n 1 -P 4 sleep
 7009      \_ sleep 20
 7010      \_ sleep 30
 7011      \_ sleep 40
 7017      \_ sleep 50

# 20 seconds later (the second and third "sleep" commands have exited)
# we now have only 3 simultaneous processes (no more arguments to process)
famzah@vbox:~$ ps f -o pid,command
  PID COMMAND
 5068 /bin/bash
 7007  \_ xargs -n 1 -P 4 sleep
 7011      \_ sleep 40
 7017      \_ sleep 50
 7023      \_ sleep 60

It’s worth mentioning that if “xargs” fails to execute the binary, it prematurely terminates the failed parallel processing queue, which leaves some of the stdin arguments not processed:

famzah@vbox:~$ echo 10 20 30 40 50 60 | xargs -n 1 -P 4 badexec-name
xargs: badexec-namexargs: badexec-name: No such file or directory: No such file or directory

xargs: badexec-namexargs: badexec-name: No such file or directory
: No such file or directory

The output is scrambled because all parallel processes write to the screen with no locking synchronization. This seems to be a known issue. The point is that we could expect that “xargs” would try to execute “badexec-name” for every command-line argument (total of six attempts in our example). It turns out that “xargs” bails out the same way even if we don’t use the “-P” option:

# standard usage of "xargs"
famzah@vbox:~$ echo 10 20 30 40 50 60 | xargs -n 1 badexec-name
xargs: badexec-name: No such file or directory

Not a very cool behavior. I’ve reported this as a bug to the GNU community. If you review the responses to the bug report, you will find out that this actually is an intended feature. 🙂

If the provided command to “xargs” is a valid one but it fails during the execution, there are no surprises and “xargs” continues with the next command-line argument by executing a new command:

famzah@vbox:~$ echo 10 20 30 40 50 60 | xargs -n 1 -P 4 rm
rm: rm: cannot remove `10'cannot remove `40': No such file or directory
: No such file or directory
rm: cannot remove `20': No such file or directory
rm: cannot remove `30': No such file or directory
rm: cannot remove `60': No such file or directory
rm: cannot remove `50': No such file or directory

The output here is scrambled too because all parallel processes write to the screen with no locking synchronization. We see however that all command-line arguments from “10” to “60” were processed by executing a command for each of them.