I wanted to use GNU Parallel on my Ubuntu system, in order to process some data in parallel. It turned out that there is no official package for Ubuntu. As of Ubuntu Quantal released on April/2014, this has been corrected and the package is in the official repository.
Reading a bit more brought me to the astonishing fact that “xargs” can run commands in parallel. The “xargs” utility is something I use every day and this parallelism feature made it even more useful.
Let’s try it by running the following:
famzah@vbox:~$ echo 10 20 30 40 50 60 | xargs -n 1 -P 4 sleep
The use of “-n 1” is vital if you want to pass only one command-line argument from the list to each parallel process.
Here is the result:
# right after we launched "xargs" famzah@vbox:~$ ps f -o pid,command PID COMMAND 5068 /bin/bash 7007 \_ xargs -n 1 -P 4 sleep 7008 \_ sleep 10 7009 \_ sleep 20 7010 \_ sleep 30 7011 \_ sleep 40 # 10 seconds later (the first "sleep" has just exited) famzah@vbox:~$ ps f -o pid,command PID COMMAND 5068 /bin/bash 7007 \_ xargs -n 1 -P 4 sleep 7009 \_ sleep 20 7010 \_ sleep 30 7011 \_ sleep 40 7017 \_ sleep 50 # 20 seconds later (the second and third "sleep" commands have exited) # we now have only 3 simultaneous processes (no more arguments to process) famzah@vbox:~$ ps f -o pid,command PID COMMAND 5068 /bin/bash 7007 \_ xargs -n 1 -P 4 sleep 7011 \_ sleep 40 7017 \_ sleep 50 7023 \_ sleep 60
It’s worth mentioning that if “xargs” fails to execute the binary, it prematurely terminates the failed parallel processing queue, which leaves some of the stdin arguments not processed:
famzah@vbox:~$ echo 10 20 30 40 50 60 | xargs -n 1 -P 4 badexec-name xargs: badexec-namexargs: badexec-name: No such file or directory: No such file or directory xargs: badexec-namexargs: badexec-name: No such file or directory : No such file or directory
The output is scrambled because all parallel processes write to the screen with no locking synchronization. This seems to be a known issue. The point is that we could expect that “xargs” would try to execute “badexec-name” for every command-line argument (total of six attempts in our example). It turns out that “xargs” bails out the same way even if we don’t use the “-P” option:
# standard usage of "xargs" famzah@vbox:~$ echo 10 20 30 40 50 60 | xargs -n 1 badexec-name xargs: badexec-name: No such file or directory
Not a very cool behavior. I’ve reported this as a bug to the GNU community. If you review the responses to the bug report, you will find out that this actually is an intended feature. 🙂
If the provided command to “xargs” is a valid one but it fails during the execution, there are no surprises and “xargs” continues with the next command-line argument by executing a new command:
famzah@vbox:~$ echo 10 20 30 40 50 60 | xargs -n 1 -P 4 rm rm: rm: cannot remove `10'cannot remove `40': No such file or directory : No such file or directory rm: cannot remove `20': No such file or directory rm: cannot remove `30': No such file or directory rm: cannot remove `60': No such file or directory rm: cannot remove `50': No such file or directory
The output here is scrambled too because all parallel processes write to the screen with no locking synchronization. We see however that all command-line arguments from “10” to “60” were processed by executing a command for each of them.
April 5, 2013 at 12:25 pm
Please consider reading http://www.gnu.org/software/parallel/man.html#differences_between_xargs_and_gnu_parallel and see what problems xargs -P can cause you that you will not see with GNU Parallel.
April 5, 2013 at 12:49 pm
The differences are a bit biased towards how more flexible GNU Parallel is. I’m not defending “xargs” here but most of the differences are not in the way “xargs” handles _parallel_ execution but in the design of “xargs”. It’s just that “xargs” lacks those features by design.
If we leave out the fact that “xargs” was never meant to run commands on remote machines (we have SSH for that), as well as the lack of support for context replace (argument placeholders), the other significant difference is that the text output of the _parallel_ commands is out-of-order, which forces you to log the output in a file if you expect any output.
Here is what I saw at the page you quoted:
* xargs deals badly with special characters — true, but the -0 option handles this at 100%; this applies regardless if you use the -P option
* xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel — I guess this means that GNU Parallel has an option to count the CPU cores for you. Easy work around would be: xargs -P “$(cat /proc/cpuinfo | egrep ^processor | wc -l)”
* xargs has no support for grouping the output — true, but you can always redirect output to a file
* xargs has no support for keeping the order of the output — ditto as above
* xargs has no support for running jobs on remote computers — true; this applies regardless if you use the -P option
* xargs has no support for context replace, so you will have to create the arguments — true; this applies regardless if you use the -P option; there is limited support with the “–replace” option which accepts only one argument
April 5, 2013 at 11:46 pm
You write: “””It turned out that there is no package for Ubuntu.”””
This is quite opposite to what it says on the webpage you link to that links to several packages for Ubuntu. Also https://launchpad.net/ubuntu/quantal/i386/parallel/20120422-1 shows it is officially available in Ubuntu Quantal.
You write: “””The output is scrambled because all parallel processes write to the screen with no locking synchronization. This seems to be fixed in latest versions.”””
That is, however, not the case. What is “fixed” is that the man-page now mentions this problem. You will still need extra work to make sure the output is not mixed.
I am puzzled why you would use a lot of effort on trying to get xargs to do things right, when GNU Parallel is designed to do just that without any extra work. For platforms that still do not have official packages (See the list of official packages on http://www.gnu.org/software/parallel/) GNU Parallel can be installed using 2 simple lines (as mentioned in README):
chmod 755 parallel
and will work on most platforms (opposite your example with /proc/cpuinfo – which is limited to platforms with /proc support).
April 6, 2013 at 9:11 am
I meant that there is no _official_ package. This has been corrected in the blog article. In the mean time, Ubuntu corrected this too and as of the current latest Ubuntu, GNU Parallel is available in the official repository. 🙂
The GNU guys really surprised me with the “fix”! I relied on their comments and didn’t even review the patch. Now I did. It really is just an explanation in the man page. I’ve fixed my initial comment.
In reply to your question why I don’t use the package from source — installing packages from source requires extra work to keep them up to date. I always try to save extra work. Furthermore, “xargs” is a very well tested alternative.
Here is my usage scenario: I have an application which extracts text by OCR. There are a bunch of documents I need to be processed. My application never returns anything to the standard output (unless it fails for some unexpected reason). I can reprocess any document — it will be skipped if it was done already.
The “xargs” is a great alternative for me here — I know how to pass arguments very well, including filenames with special characters. I’m not saying that GNU Parallel wouldn’t have done the job too. But it’s not going to do it any better in this case either. 🙂
June 1, 2015 at 11:20 am
Thanks for reminding about xargs. Unfortunately, my copy of parallel (installed by Ubuntu Trusty) seems broken: it echoes or generates gpu-manager.log several times in the beginning, which leads to hazardous stuff when combined with output redirection. I’m lucky that I didn’t use sudo parallel…