November 12, 2011
by Ivan Zahariev 11 Comments

Boot Linux using Windows 7 boot loader

Windows 7 and Linux live together on the same hard disk in perfect harmony. I had Windows 7 installed first, and a few GBytes of free space at the end of the hard drive which I left unpartitioned. Here is how to install Ubuntu:

Download Ubuntu and burn the ISO on a CD.
Boot from the CD, and install it. Make sure that you choose an empty partition, and also make sure that you select to install the boot loader on the Linux partition (example: on “/dev/sda3”, and not on the main MBR “/dev/sda”).

Until here you have an Ubuntu installation which you cannot boot, yet.

Here is how to configure the Windows 7 boot loader to include Ubuntu in the boot choice menu:

Download EasyBCD and install it. EasyBCD is free for non-commercial use and offers a nice GUI to edit the Windows 7 boot loader menu.
Do the following in EasyBCD — Add New Entry -> Operating Systems -> Linux/BSD:
- Type: GRUB 2
- Name: Ubuntu
- Device: (Automatically configured)
Finally, click on “View Settings” in EasyBCD. You should see something similar to the following:

Entry #2
Name: Ubuntu
BCD ID: {1d486d61-64cc-12a5-7d94-af2f5df01535}
Drive: C:\
Bootloader Path: \NST\AutoNeoGrub0.mbr

EasyBCD ships the “stage1” boot loader of GRUB2 (\NST\AutoNeoGrub0.mbr), so you don’t have to do anything else. Just reboot your Windows 7, and the boot menu should present a choice between “Windows 7” and “Ubuntu”.

A note of caution: It is highly recommended that you do a backup of your whole hard disk before you try to install Ubuntu or modify the boot loader options.

P.S. There is no “boot.ini” in Windows 7. You could modify “boot.ini” in Windows XP to achieve the same result, but this does not apply for Windows 7.

References:

September 6, 2011
by Ivan Zahariev 5 Comments

Get default outgoing IP address and interface on Linux

Suppose you have one or more network interfaces, and they have one or more assigned IP addresses, also called aliases. If you need to find out which IP address and interface will be used as a default “source” by your Linux box, you need to execute the following:

ip route get 8.8.8.8

This, of course, assumes that 8.8.8.8 is not directly connected on your networks somehow. Since this is one of the Public Name Servers of Google, I think it is safe to assume so.

A sample output of the ip command follows:

8.8.8.8 via 10.0.2.2 dev eth0  src 10.0.2.15
    cache

The output is pretty much self-explanatory — the route to “8.8.8.8” will originate from device “eth0”, the used source IP address will be “10.0.2.15”, and the next hop, the (default) gateway, will be “10.0.2.2”.

This method is 100% reliable. The man page of “ip” says that “this command gets a single route to a destination and prints its contents exactly as the kernel sees it”.

References:

When IP aliasing how does the OS determine which IP address will be used as source for outbound TCP/IP connections?

November 1, 2010
by Ivan Zahariev 5 Comments

sudo hangs and leaves the executed program as “zombie”

Today I discovered a non-security bug in sudo – the executed program finishes successfully, but sudo hangs forever waiting for something. The executed program is left in a “zombie” process state. Here is how the process list looks like, for example:

root 6368 0.0 0.0 2808 1592 pts/6 Ss 18:39 0:00 | \_ -bash
root 1103 0.0 0.0 2200 1000 pts/6 S+ 21:45 0:00 | \_ ./sudo -u root sleep 5
root 1104 0.0 0.0 0 0 pts/6 Z+ 21:45 0:00 | \_ [sleep] <defunct>

If we try to trace the system calls of the sudo command, here is what we get:

[root@tester2 ~]# strace -fF -p 1103
select(4, [3], [], NULL, NULL

The sudo process waits endlessly in a select() system call which waits for file descriptor #3. So we quickly check what corresponds to file descriptor #3:

[root@tester2 ~]# ls -la /proc/1103/fd
total 0
dr-x—— 2 root root 0 Nov 1 21:45 .
dr-xr-xr-x 5 root root 0 Nov 1 21:45 ..
lrwx—— 1 root root 64 Nov 1 21:45 0 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:46 1 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:45 2 -> /dev/pts/6
lrwx—— 1 root root 64 Nov 1 21:46 3 -> socket:[1136261781]

Socket “socket:[1136261781]” is already closed tough (blinking in red on my console). Thus sudo is waiting in a blocked select() for a change on file descriptor #3, which is already closed and will never change its state. Sudo will therefore wait forever.

In order to understand why this happens, we will look at the source code of sudo. Here is snippet from “exec.c” – only the relevant code is left for clarity:

int
sudo_execve(path, argv, envp, uid, cstat, dowait, bgmode)
{
    int sv[2];

    if (socketpair(PF_UNIX, SOCK_DGRAM, 0, sv) != 0)
    error(1, "cannot create sockets");

    zero_bytes(&sa, sizeof(sa));
    sigemptyset(&sa.sa_mask);

    /* Note: HP-UX select() will not be interrupted if SA_RESTART set */
    sa.sa_flags = SA_INTERRUPT; /* do not restart syscalls */
    sa.sa_handler = handler;
    sigaction(SIGCHLD, &sa, NULL);
    sigaction(SIGHUP, &sa, NULL);
    sigaction(SIGINT, &sa, NULL);
    sigaction(SIGPIPE, &sa, NULL);
    sigaction(SIGQUIT, &sa, NULL);
    sigaction(SIGTERM, &sa, NULL);

    /* Max fd we will be selecting on. */
    maxfd = sv[0];

    child = fork()
    close(sv[1]);

    fdsr = (fd_set *)emalloc2(howmany(maxfd + 1, NFDBITS), sizeof(fd_mask));
    fdsw = (fd_set *)emalloc2(howmany(maxfd + 1, NFDBITS), sizeof(fd_mask));

    for (;;) {

    zero_bytes(fdsw, howmany(maxfd + 1, NFDBITS) * sizeof(fd_mask));
    zero_bytes(fdsr, howmany(maxfd + 1, NFDBITS) * sizeof(fd_mask));

    FD_SET(sv[0], fdsr);

    if (recvsig[SIGCHLD])
        continue;
    nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
    if (nready == -1) {
        if (errno == EINTR)
        continue;
        error(1, "select failed");
    }

    }
}

/*
 * Generic handler for signals passed from parent -> child.
 * The recvsig[] array is checked in the main event loop.
 */
void
handler(s)
    int s;
{
    recvsig[s] = TRUE;
}

The race-condition happens right before the select() on line #40, and just after the “if” on lines #38 and #39. If the parent process gets re-scheduled after the “if” was executed, and at this very time the child process finishes and SIGCHLD is sent to the parent process, sudo gets in trouble. The SIGCHLD handler accounts in the variable “recvsig[]” that the signal was received, and then the parent process calls select(). This select will never be interrupted, as the author had it in mind. In 99% of the cases, the parent process will enter in the select() blocking state before the child process ended. The child would then send SIGCHLD, which will be accounted in the handler procedure, and will also interrupt select() which will return -1 in “nready”, and “errno” will be set to EINTR.

Here is an easy way to reproduce the bug. We add a sleep() of 10 seconds between the “if” and select(), thus simulating that the system was very busy and re-scheduled the parent sudo process right between these two operations. Here is the source diff:

--- sudo-orig/sudo-1.7.4p4/exec.c       Sat Sep  4 00:40:19 2010
+++ sudo-1.7.4p4/exec.c Mon Nov  1 21:48:24 2010
@@ -307,6 +307,10 @@
 
        if (recvsig[SIGCHLD])
            continue;
+       printf("debug: Missed the check for SIGCHLD, the child is still running. SIGCHLD status: %d\n", recvsig[SIGCHLD]);
+       sleep(10); // this will be interrupted by SIGCHLD, because the child exists at some time here (we run "sudo sleep 5")
+       printf("debug: We should have got SIGCHLD by now. SIGCHLD status: %d\n", recvsig[SIGCHLD]);
+       printf("debug: Entering the endless select()...\n");
        nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
        if (nready == -1) {
            if (errno == EINTR)

After that we execute sudo, and observe the bug every time:

[root@tester2 sudo-1.7.4p4]# make >/dev/null && ./sudo -u root sleep 5
debug: Missed the check for SIGCHLD, the child is still running. SIGCHLD status: 0
debug: We should have got SIGCHLD by now. SIGCHLD status: 1
debug: Entering the endless select()…
…(this never finishes)…

The sudo author actually tried to avoid this potential race condition if SIGCHLD is received immediately
before we call select() – changeset 5334:99adc5ea7f0a. The proper way to fix this is to use a timeout in the select() call:

--- sudo-orig/sudo-1.7.4p4/exec.c       Sat Sep  4 00:40:19 2010
+++ sudo-1.7.4p4/exec.c Mon Nov  1 21:50:26 2010
@@ -307,7 +307,11 @@
 
        if (recvsig[SIGCHLD])
            continue;
-       nready = select(maxfd + 1, fdsr, fdsw, NULL, NULL);
+       struct timeval timeout;
+       timeout.tv_sec = 1; // Linux resets this, so set it everytime
+       timeout.tv_usec = 0;
+       nready = select(maxfd + 1, fdsr, fdsw, NULL, &timeout);
        if (nready == -1) {
            if (errno == EINTR)
                continue;

The select() mechanism and this bug were introduced somewhere between sudo versions 1.7.2 and 1.7.3. At least that is what I managed to see from the Changelog:

2010-06-29  Todd C. Miller  <Todd.Miller@courtesan.com>
	[72fd1f510a08] [SUDO_1_7_3] <1.7>
...
2009-11-15  Todd C. Miller  <Todd.Miller@courtesan.com>
	* script.c, sudo.c, sudo.h, sudoreplay.c, term.c, tgetpass.c:
	Use a socketpair to pass signals from parent to child.
...
2009-07-12  Todd C. Miller  <Todd.Miller@courtesan.com>
	[f5ad45f69f05] [SUDO_1_7_2]

I’ve reported this bug (#447) to the sudo maintainer, and I’m sure he will fix it when time permits. Because we all depend on sudo and love it. 🙂

September 14, 2010
by Ivan Zahariev 10 Comments

Linux Cached/Buffers memory

I won’t try to explain in details what Linux Cached/Buffers memory is. In a nutshell, it shows how much of your memory is used for the read cache and for the write cache.

Usually when you look at your system memory usage and see that almost all of the unused memory is allocated for Cached/Buffers, you are happy, because this memory is used for file-system cache, thus your system is running faster.

Today however I observed quite an interesting fact – what I said above is still correct, however you don’t know how often these cache entries are used by the system. After all, it’s not the cache memory usage (or size) which makes the system run faster, but the cache hit ratio. If the file operations get satisfied by the cache (cache hit), then your system is running faster. If the system needs to make a physical disk I/O (cache miss), then you’d need to wait for a good few milliseconds.

What are your options, in order to know if your file cache is actually being used or is just sitting there allocated, giving you a false feeling that your system is running faster thanks to the used cache memory:

(hard) In order to actually know the Cache Hit/Miss ratios for block devices, you’ll need to dig deep into the kernel, as I already explained in the “Is there a way to get Cache Hit/Miss ratios for block devices in Linux” article.
(easy) You can clear the Cached/Buffers memory regularly, see how fast and how much the cache memory grows back, and draw some conclusions about the actual Cache Hit/Miss ratios.

The latter is not a perfect solution, but in all cases gives you a better idea of your file-system cache usage, than just watching the totally used memory in Cached/Buffers, and never actually knowing if it is used/accessed at all.

Therefore, you can run the following every hour in a cron job:

sync ; echo 3 > /proc/sys/vm/drop_caches
sync ; echo 0 > /proc/sys/vm/drop_caches

The commands are safe (see reference for “drop_caches“), and you won’t lose any data, just your caches will be zeroed. The disadvantage of this approach is that if the caches were very actively used indeed, Linux would need to read the data back from the disk.

A real-world example

Here is how the Cached/Buffers graphics of a server of mine looks for the following few days:

Linux memory usage

Pay attention to the points of interest which are marked. Here is the explanation and motivation to write this article:

(Point A) The beginning of the graph shows my system after it just booted, and I did some small administrative tasks on it. After that, one script runs regularly on the machine, and as we see, it doesn’t use much file-system cache, as it doesn’t do many file operations.
Then every day at 06:25 the “/etc/cron.daily” scripts are run and some of them read all files on the file-system. Such a script is the updatedb cron job. Because of the great disk activity, the Buffers/Cache usage gets maximal, as all possible files and meta data are being cached in memory.
(see “Updatedb cron” markers on the graphics) After one hour, the scripts finish and no significant disk activity is done on the system any more.
(Point B) But the Cached/Buffers usage never drops down. The file cache doesn’t seem to expire, and is therefore giving us the false feeling that it is being used by our system all the time, thus making it faster. But it isn’t!
On Sat 15:08 the Cached/Buffers cache is cleared manually by the command I provided above, and I installed it as an hourly crontab too.
As you can see, right after the cache was cleared, we see the sad reality – the Cached/Buffers cache was filled with data that nobody needed or accessed, and the high memory usage by Cached/Buffers actually didn’t speed up our system. I grieve for a while and accept the reality, and also understand why so many I/O requests are issued to my EBS storage, even though the cache was so huge.
(Point C) That’s how the actual daily usage pattern of this machine looks like. The Cached/Buffers memory cache is heavily underused on my system, as it doesn’t do much I/O work. This wouldn’t be visible if I don’t clear the cache every hour.

References:

The Linux Page Cache and pdflush: Theory of Operation and Tuning for Write-Heavy Loads.

August 11, 2010
by Ivan Zahariev 2 Comments

USB: rejected 1 configuration due to insufficient available bus power

If your USB device is not being recognized, execute the command “dmesg” and check if the following output is there:

usb 1-1.4: rejected 1 configuration due to insufficient available bus power

The “1-1.4” ID may be different for your configuration.

If, and only if, you are absolutely sure that your USB hub and/or hardware configuration have a safe way to actually supply enough power, you can override this barrier and force the device to be activated despite of the error message. A possible situation is where you manually applied 5V external power on your USB device and/or USB hub, like I did on my Bifferboard.

Here is how you can override the power safety mechanism:

echo 1 > /sys/bus/usb/devices/1-1.4/bConfigurationValue

Replace “1-1.4” with your USB device ID. Be careful and have fun!

Resources:

Error insufficient available bus power RT2573.

August 1, 2010
by Ivan Zahariev Leave a comment

Speed up RRDtool database manipulations via RRDs (Perl)

Use case
You are doing a lot of data operations on your RRD files (create, update, fetch, last), and every update is done by a separate Perl process which lives a very short time – the process is launched, it updates or reads the data, does something else, and then exits.

The problem
If you are using RRDtool and Perl as described, you surely have noticed that running many of these processes wastes a lot of CPU resources. The question is – can we do some performance optimizations, and lessen the performance hit of loading the RRDs library into Perl? We know that launching often Perl itself is quite expensive, but after all, if we chose to work with Perl, this is a price we should be ready to pay.

The RRDtool shared library is a monolithic piece of code which provides ALL functions of the RRDtool suite – data manipulation, graphics and import/export tools. The last two components bring huge dependencies in regards to other shared libraries. The library from RRDtool version 1.4.4 depends on 34 other libraries on my Linux box! This must add up to the loading time of the RRDtool library into Perl.

Resolution and benchmarks
In order to prove my theory (actually, it was more a theory of zImage, and I just followed, enhanced and tried it), I commented out the implementation of the “graphics” and “import/export tools” modules from the source code of RRDtool. Then I re-compiled the library and did some performance benchmarks. I also re-implemented the RRDs.pm module by replacing the DynaLoader module with the XSLoader one. This made no difference in performance whatsoever. The re-compiled RRD library depends on only 4 other libraries – linux-gate.so.1, libm.so.6, libc.so.6, and /lib/ld-linux.so.2. I think this is the most we can cut down. 🙂

So here are the benchmark results. They show the accumulated time for 1000 invocations of the Perl interpreter with three different configurations:

Only Perl (baseline): 5.454s.
With RRDs, no graphics or import/export functions: 9.744s (+4.290s) +78%.
With standard RRDs: 11.647s (+6.192s) +113%.

As you can see, you can make Perl + RRDs start 35% faster. The speed up for RRDs itself is 44%.

Here are the commands I used for the benchmarks:

Only Perl (baseline): time ( i=1000 ; while [ “$i” -gt 0 ]; do perl -Mwarnings -Mstrict -e ” ; i=$(($i-1)); done )
Perl + RRDs: time ( i=1000 ; while [ “$i” -gt 0 ]; do perl -Mwarnings -Mstrict -MRRDs -e ” ; i=$(($i-1)); done )

February 8, 2010
by Ivan Zahariev 1 Comment

Firefox crashes with “terminate called after throwing an instance of ‘std::bad_alloc'”

If you are here, you probably are as desperate as I was. Though your system has plenty of memory, Firefox keeps crashing with the following error message:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

You can see the above error either by starting “firefox” in your console terminal manually, or by reviewing the file “~/.xsession-errors”, if you are running KDE.

I ran Firefox several times in debug mode via “gdb” and every time the debug output lead me to the wrong direction. Here is a sample full backtrace output:

[New Thread 0xadbfeb70 (LWP 3763)]
[Thread 0xadbfeb70 (LWP 3763) exited]
[New Thread 0xadbfeb70 (LWP 3764)]
[Thread 0xadbfeb70 (LWP 3764) exited]
[New Thread 0xadbfeb70 (LWP 3765)]
[New Thread 0xae3ffb70 (LWP 3766)]
[Thread 0xadbfeb70 (LWP 3765) exited]
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Program received signal SIGABRT, Aborted.
0x00227422 in ?? ()
(gdb) bt full
#0  0x00227422 in ?? ()
No symbol table info available.
#1  0x002524d1 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
        resultvar = <value optimized out>
        pid = 3575796
        selftid = 3708
#2  0x00255932 in *__GI_abort () at abort.c:92
        act = {__sigaction_handler = {sa_handler = 0x7a3ff4, sa_sigaction = 0x7a3ff4}, sa_mask = {__val = {3221183748, 3086869600, 3221183704, 7933961,
              3221183688, 1154680, 3221183676, 8013772, 0, 3086866344, 5, 0, 1, 3221183640, 0, 3221183716, 1356543, 3577255, 3221183636, 3035204, 1,
              3086869160, 0, 3221183748, 3221183676, 3221183688, 0, 4294967295, 1359583, 3086869160, 3221183680, 4294967295}}, sa_flags = 8011764,
          sa_restorer = 0x14b2ff}
        sigs = {__val = {32, 0 <repeats 31 times>}}
#3  0x001cc4df in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:93
        terminating = true
        t = <value optimized out>
#4  0x001ca415 in __cxxabiv1::__terminate (handler=0x1cc390 <__gnu_cxx::__verbose_terminate_handler()>)
    at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:38
No locals.
#5  0x001ca452 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
No locals.
#6  0x001ca591 in __cxa_throw (obj=0xad2f9700, tinfo=0x1f97fc, dest=0x1caaf0 <~bad_alloc>) at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:83
        header = <value optimized out>
#7  0x001cac0f in operator new (sz=2) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:58
        handler = <value optimized out>
        p = <value optimized out>
#8  0x001caced in operator new[] (sz=2) at ../../../../src/libstdc++-v3/libsupc++/new_opv.cc:32
No locals.
#9  0x012ead5c in gfxSkipChars::TakeFrom (this=0xbfff5f1c, aSkipCharsBuilder=0xbfff6f60) at ../../dist/include/thebes/gfxSkipChars.h:152
No locals.
#10 0x012e48fe in BuildTextRunsScanner::BuildTextRunForFrames (this=0xbfff8320, aTextBuffer=0xbfff7280) at nsTextFrameThebes.cpp:1713
        anySmallcapsStyle = 0
        textBreakPoints = {<nsTArray<int>> = {<nsTArray_base> = {static sEmptyHdr = {mLength = 0, mCapacity = 0, mIsAutoArray = 0},
              mHdr = 0xbfff7150}, <No data fields>},
          mAutoBuf = "\001\000\000\000\062\000\000\200\000\000\000\000\220z\377\277\354x\377\277\065\000\000\000\066\000\000\000\000\000\000\000\b\000\000\000\260q\377\277\254q\377\277\220q\377\277\000\202\066\260\030V\241\265\b@q\267\066\000\000\000\240\321\377\263\270\321\377\263\000\000\000\000\b\000\000\000\001\000\000\000\000\000\000\000\b\000\000\000\b\000\000\000\000\000\000\000\304i\005\255\b\000\000\000$\301 \255\364\017\274\001\240\321\377\263\b\000\000\000\254y\377\277\201E\225\001\354x\377\277\240\321\377\263\000\000\000\000\036\352\216\001\000\000\000\000\200g/\255\220z\377\277xr\377\277\256\371\247\001\000\000\000\000\220z\377\277(;\260\001\000\000\000\000\354x\377\277\254r\377\277\364\017\274\001tr\377\277\002\000\000\000<r\377\277"}
        currentTransformedTextOffset = 1
        finalUserData = 0xad2037cc
        userDataToDestroy = 0x0
        nextBreakIndex = 2904569804
        firstFrame = 0xad2037cc
        builder = {mBuffer = {<nsTArray<unsigned char>> = {<nsTArray_base> = {static sEmptyHdr = {mLength = 0, mCapacity = 0, mIsAutoArray = 0},
                mHdr = 0xbfff6f64}, <No data fields>},
            mAutoBuf = "\002\000\000\000\000\001\000\200\001\001\377\277\223\200\223\001\027m\271\000#\000\000\000\031\201\271\000\364\017\274\001<\000\000\000\000\000\000\000\274p\377\277\347\063\225\001\027m\271\000\324\302\355\267\031\201\271\000\256^\005\b@\300\355\267\240\246z\267\004\000\000\000\364\257\005\b\000@\006\255\000\000\000\255\fp\377\277\064u\005\b@\300\355\267\000\000\002\000\320o\377\277\000\000\000\000\062\000\000\200\000\000\000\000[]\005\b\225\351\216\001\240D\006\255\374\301\355\267 \000\000\000\217\350\216\001$p\377\277\002\000\000\000\274p\377\277\225\351\216\001\f\203\377\277\370\202\377\277,p\377\277\000\000\000\000\f\203\377\277\370\202\377\277lp\377\277\000\000\000\000\370\202\377\277\004\000\000\000\002\000\000\000\364\017\274\001,\203\377\277\000\000\000\000lp\377\277_\256.\001,\203\377\277\000\000\000\000\000\000\000\000\225\351\216\001\004", '\000' <repeats 11 times>"\217, \350\216\001\240\201\37---Type <return> to continue, or q <return> to quit---

After much try-and-error attempts, and also thoughts if my laptop’s memory wasn’t faulty or if the shared libraries on my disk weren’t somehow corrupted, I was finally able to track down the cause of this abnormal behavior:

BUG: The Security Device which the Siemens HiPath SIcurity Card API provided. You can read here why I use it.

The problem started somewhere around Firefox version 3.5.5 and later. If the security device dongle/card is not plugged in your computer, Firefox crashes at random pages.

The resolution
Create a second Firefox profile and install the Security Device only there, leaving the default Firefox profile with no Security Device capabilities. Thus if you want to use your online banking, you would need to close Firefox and then start it using the second profile. It’s not that bad, if you are a personal user like me who performs bank transactions relatively rarely.

The MozillaZine Knowledge Base has an excellent article about Firefox Profile Manager.

February 5, 2010
by Ivan Zahariev Leave a comment

“dd” sequential write performance tests on a raw block device may be incorrect

…if you use the inappropriate bytes size (bs) option. See the man page of dd for details on this option.

Hard disks have a typical block size of 512 bytes. LVM on the other hand creates its block devices with a block size of 4096 bytes. So it’s easy to get confused – even if you know that disks should be tested with blocks of 512 bytes, you shouldn’t test LVM block devices with a 512-bytes but with a 4096-bytes block size.

What happens if you make a write performance test by writing directly on the raw block device and you use the wrong bytes size (bs) option?

If you look at the “iostat” statistics, they will show lots of read requests too, when you are only writing. This is not what is expected when you do only writing.
The problem comes by the fact that when you are not using the proper block size for the raw block device, instead of writing whole blocks, you are writing partial blocks. This is however physically not possible – the block device can only write one whole block at a time. In order to update the data in only a part of a block, this block needs to be read back first, then modified with the new partial data in memory and finally written back as a whole block.

The total performance drop is about 3 times on the systems I tested. I’ve tested this on some hard disks and on an Areca RAID-6 volume.

So what’s the lesson here?

When you do sequential write performance tests with “dd” directly on the raw block device, make sure that you use the proper bytes size option, and verify that during the tests you see only write requests in the “iostat” statistics.

Physical hard disk example:

# Here is a bad example for a hard disk device
dd if=/dev/zero of=/dev/sdb1 bs=256 count=5000000

# Here is the proper usage, because /dev/sda physical block size is 512 bytes
dd if=/dev/zero of=/dev/sdb1 bs=512 count=5000000

LVM block device example:

# Another bad example, this time for an LVM block device
dd if=/dev/zero of=/dev/sdb-vol/test bs=512 count=1000000

# Here is the proper usage, because the LVM block size is 4096 bytes
dd if=/dev/zero of=/dev/sdb-vol/test bs=4k count=1000000

Understanding the “iostat” output during a “dd” test:

Here is what “iostat” displays when you are not using the proper bytes size option (lots of read “r/s” and “rsec/s” requests):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  5867.40 3573.20   46.40 28585.60 47310.40 20.97   110.38   30.61   0.28 100.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
sdb2              0.00  5867.40 3572.80   46.40 28582.40 47310.40 20.97   110.38   30.61   0.28 100.00
dm-2              0.00     0.00 3572.80 5913.80 28582.40 47310.40 8.00 13850.92 1465.43   0.11 100.00

Here is what it should display (no read “r/s” or “rsec/s” requests at all):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00 16510.00    0.00  128.60     0.00 131686.40 1024.00   107.82  840.32   7.78 100.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
sdb2              0.00 16510.00    0.00  128.60     0.00 131686.40 1024.00   107.82  840.32   7.78 100.00
dm-2              0.00     0.00    0.00 16640.00     0.00 133120.00 8.00 13674.86  823.73   0.06 100.00

How to be safe?

Fortunately, file systems are smart enough and pay attention to the block size of the block devices they were mounted on. So if you do a “dd” write performance test and write to a file, you should be fine. Though in this case there are some other complications like journaling, commit intervals, barriers, mount options, etc.

January 25, 2010
by Ivan Zahariev 2 Comments

Why /sys/block/dm-0/queue/scheduler exists on my Linux system?

The device-mapper (DM) traditionally didn’t have its own I/O scheduler. Then why suddenly my DM devices have such a scheduler and what does it control?

A new type of device-mapper was introduced recently in the Linux kernel 2.6.31 – the request-based device-mapper. According to the Linux Kernel Newbies changelog for 2.6.31, there is a commit which does “Prepare for request based option”.

The issue is actually not in the new request-based DM option, which is to be used only for multipath block devices. The problem is that when you create a regular LVM device on kernels 2.6.31+, the DM device itself has I/O scheduler parameters. So does the underlying block device on top of which you created the LVM. Thus we are having two I/O schedulers in the path from the LVM device to the physical storage.

According to the author of the kernel patches for the request-based DM device, Kiyoshi Ueda, for a bio-based DM device, only the underlying device’s scheduler should affect performance. This is what my tests shown too, therefore there is no discrepancy.

Let me summarize this:

If you are *not* using multipath block devices in your DM/LVM setup, then only the underlying device’s scheduler (i.e. “/sys/block/sda/queue/scheduler”) takes effect. This applies for the trivial LVM setup which many of us used for years.
If you are using a multipath DM/LVM setup, then only the DM device’s scheduler (i.e. “/sys/block/dm-0/queue/scheduler”) takes effect.

References:

Request-based Device-mapper multipath and Dynamic load balancing PDF paper by Kiyoshi Ueda, Jun’ichi Nomura, and Mike Christie. You can also download my copy of this PDF.
Using Device-Mapper Multipath by RedHat.

January 9, 2010
by Ivan Zahariev Leave a comment

Changing the ISO image in a virtual CDROM drive while KVM-Qemu is running

If you run KVM with enabled monitor management console, you can do some pretty powerful internal stuff while the KVM guest is running.

In order to have a KVM-Qemu management console, you should start KVM with something like:

-monitor telnet:127.0.0.1:3010,server,nowait,ipv4

See the official documentation of Qemu for more details and also the man page of qemu-kvm (unofficial mirror).

Once you have it set up, you can then telnet to the management console and review the available commands:

famzah@famzahpc:~$ telnet localhost 3010
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

QEMU 0.11.0 monitor - type 'help' for more information
(qemu) help

Changing the ISO image of a virtual CDROM drive is quite easy:

First review what the current status of the drives is:

(qemu) info block
virtio0: type=hd removable=0 file=/dev/sdb-vol/win7 ro=0 drv=host_device encrypted=0
ide0-cd0: type=cdrom removable=1 locked=0 file=/shared/win7-eval.iso ro=0 drv=raw encrypted=0
ide1-cd0: type=cdrom removable=1 locked=0 [not inserted]

Then change the mounted ISO image in the CDROM drive on the fly:
```
(qemu) change ide1-cd0 /shared/win-virtio-drivers.iso
```

Double-check that the changes took effect. KVM-Qemu will not issue an error message in case something went wrong (duh!):

(qemu) info block
virtio0: type=hd removable=0 file=/dev/sdb-vol/win7 ro=0 drv=host_device encrypted=0
ide0-cd0: type=cdrom removable=1 locked=0 file=/shared/win7-eval.iso ro=0 drv=raw encrypted=0
ide1-cd0: type=cdrom removable=1 locked=0 file=/shared/win-virtio-drivers.iso ro=1 drv=raw encrypted=0

Use the “help” command to review the other powerful commands which you can use to tune and debug your running KVM guest (“info”, “migrate” and “system_reset” seem like interesting candidates).

/contrib/famzah

Enthusiasm never stops

Tag Archives: Linux

Boot Linux using Windows 7 boot loader

Get default outgoing IP address and interface on Linux

sudo hangs and leaves the executed program as “zombie”

Linux Cached/Buffers memory

A real-world example

USB: rejected 1 configuration due to insufficient available bus power

Speed up RRDtool database manipulations via RRDs (Perl)

Firefox crashes with “terminate called after throwing an instance of ‘std::bad_alloc'”

“dd” sequential write performance tests on a raw block device may be incorrect

Why /sys/block/dm-0/queue/scheduler exists on my Linux system?

Changing the ISO image in a virtual CDROM drive while KVM-Qemu is running