/contrib/famzah

Enthusiasm never stops


10 Comments

Linux Cached/Buffers memory

I won’t try to explain in details what Linux Cached/Buffers memory is. In a nutshell, it shows how much of your memory is used for the read cache and for the write cache.

Usually when you look at your system memory usage and see that almost all of the unused memory is allocated for Cached/Buffers, you are happy, because this memory is used for file-system cache, thus your system is running faster.

Today however I observed quite an interesting fact – what I said above is still correct, however you don’t know how often these cache entries are used by the system. After all, it’s not the cache memory usage (or size) which makes the system run faster, but the cache hit ratio. If the file operations get satisfied by the cache (cache hit), then your system is running faster. If the system needs to make a physical disk I/O (cache miss), then you’d need to wait for a good few milliseconds.

What are your options, in order to know if your file cache is actually being used or is just sitting there allocated, giving you a false feeling that your system is running faster thanks to the used cache memory:

  1. (hard) In order to actually know the Cache Hit/Miss ratios for block devices, you’ll need to dig deep into the kernel, as I already explained in the “Is there a way to get Cache Hit/Miss ratios for block devices in Linux” article.
  2. (easy) You can clear the Cached/Buffers memory regularly, see how fast and how much the cache memory grows back, and draw some conclusions about the actual Cache Hit/Miss ratios.

The latter is not a perfect solution, but in all cases gives you a better idea of your file-system cache usage, than just watching the totally used memory in Cached/Buffers, and never actually knowing if it is used/accessed at all.

Therefore, you can run the following every hour in a cron job:

sync ; echo 3 > /proc/sys/vm/drop_caches
sync ; echo 0 > /proc/sys/vm/drop_caches

The commands are safe (see reference for “drop_caches“), and you won’t lose any data, just your caches will be zeroed. The disadvantage of this approach is that if the caches were very actively used indeed, Linux would need to read the data back from the disk.

A real-world example

 

Here is how the Cached/Buffers graphics of a server of mine looks for the following few days:

Linux memory usage

Pay attention to the points of interest which are marked. Here is the explanation and motivation to write this article:

  1. (Point A) The beginning of the graph shows my system after it just booted, and I did some small administrative tasks on it. After that, one script runs regularly on the machine, and as we see, it doesn’t use much file-system cache, as it doesn’t do many file operations.
  2. Then every day at 06:25 the “/etc/cron.daily” scripts are run and some of them read all files on the file-system. Such a script is the updatedb cron job. Because of the great disk activity, the Buffers/Cache usage gets maximal, as all possible files and meta data are being cached in memory.
  3. (see “Updatedb cron” markers on the graphics) After one hour, the scripts finish and no significant disk activity is done on the system any more.
  4. (Point B) But the Cached/Buffers usage never drops down. The file cache doesn’t seem to expire, and is therefore giving us the false feeling that it is being used by our system all the time, thus making it faster. But it isn’t!
  5. On Sat 15:08 the Cached/Buffers cache is cleared manually by the command I provided above, and I installed it as an hourly crontab too.
  6. As you can see, right after the cache was cleared, we see the sad reality – the Cached/Buffers cache was filled with data that nobody needed or accessed, and the high memory usage by Cached/Buffers actually didn’t speed up our system. I grieve for a while and accept the reality, and also understand why so many I/O requests are issued to my EBS storage, even though the cache was so huge.
  7. (Point C) That’s how the actual daily usage pattern of this machine looks like. The Cached/Buffers memory cache is heavily underused on my system, as it doesn’t do much I/O work. This wouldn’t be visible if I don’t clear the cache every hour.

References:


2 Comments

USB: rejected 1 configuration due to insufficient available bus power

If your USB device is not being recognized, execute the command “dmesg” and check if the following output is there:

usb 1-1.4: rejected 1 configuration due to insufficient available bus power

The “1-1.4” ID may be different for your configuration.

If, and only if, you are absolutely sure that your USB hub and/or hardware configuration have a safe way to actually supply enough power, you can override this barrier and force the device to be activated despite of the error message. A possible situation is where you manually applied 5V external power on your USB device and/or USB hub, like I did on my Bifferboard.

Here is how you can override the power safety mechanism:

echo 1 > /sys/bus/usb/devices/1-1.4/bConfigurationValue

Replace “1-1.4” with your USB device ID. Be careful and have fun!


Resources:


Leave a comment

Speed up RRDtool database manipulations via RRDs (Perl)

Use case
You are doing a lot of data operations on your RRD files (create, update, fetch, last), and every update is done by a separate Perl process which lives a very short time – the process is launched, it updates or reads the data, does something else, and then exits.

The problem
If you are using RRDtool and Perl as described, you surely have noticed that running many of these processes wastes a lot of CPU resources. The question is – can we do some performance optimizations, and lessen the performance hit of loading the RRDs library into Perl? We know that launching often Perl itself is quite expensive, but after all, if we chose to work with Perl, this is a price we should be ready to pay.

The RRDtool shared library is a monolithic piece of code which provides ALL functions of the RRDtool suite – data manipulation, graphics and import/export tools. The last two components bring huge dependencies in regards to other shared libraries. The library from RRDtool version 1.4.4 depends on 34 other libraries on my Linux box! This must add up to the loading time of the RRDtool library into Perl.

Resolution and benchmarks
In order to prove my theory (actually, it was more a theory of zImage, and I just followed, enhanced and tried it), I commented out the implementation of the “graphics” and “import/export tools” modules from the source code of RRDtool. Then I re-compiled the library and did some performance benchmarks. I also re-implemented the RRDs.pm module by replacing the DynaLoader module with the XSLoader one. This made no difference in performance whatsoever. The re-compiled RRD library depends on only 4 other libraries – linux-gate.so.1, libm.so.6, libc.so.6, and /lib/ld-linux.so.2. I think this is the most we can cut down. 🙂

So here are the benchmark results. They show the accumulated time for 1000 invocations of the Perl interpreter with three different configurations:

  • Only Perl (baseline): 5.454s.
  • With RRDs, no graphics or import/export functions: 9.744s (+4.290s) +78%.
  • With standard RRDs: 11.647s (+6.192s) +113%.

As you can see, you can make Perl + RRDs start 35% faster. The speed up for RRDs itself is 44%.


Here are the commands I used for the benchmarks:

  • Only Perl (baseline): time ( i=1000 ; while [ “$i” -gt 0 ]; do perl -Mwarnings -Mstrict -e ” ; i=$(($i-1)); done )
  • Perl + RRDs: time ( i=1000 ; while [ “$i” -gt 0 ]; do perl -Mwarnings -Mstrict -MRRDs -e ” ; i=$(($i-1)); done )


1 Comment

Firefox crashes with “terminate called after throwing an instance of ‘std::bad_alloc'”

If you are here, you probably are as desperate as I was. Though your system has plenty of memory, Firefox keeps crashing with the following error message:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

You can see the above error either by starting “firefox” in your console terminal manually, or by reviewing the file “~/.xsession-errors”, if you are running KDE.

I ran Firefox several times in debug mode via “gdb” and every time the debug output lead me to the wrong direction. Here is a sample full backtrace output:

[New Thread 0xadbfeb70 (LWP 3763)]
[Thread 0xadbfeb70 (LWP 3763) exited]
[New Thread 0xadbfeb70 (LWP 3764)]
[Thread 0xadbfeb70 (LWP 3764) exited]
[New Thread 0xadbfeb70 (LWP 3765)]
[New Thread 0xae3ffb70 (LWP 3766)]
[Thread 0xadbfeb70 (LWP 3765) exited]
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Program received signal SIGABRT, Aborted.
0x00227422 in ?? ()
(gdb) bt full
#0  0x00227422 in ?? ()
No symbol table info available.
#1  0x002524d1 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
        resultvar = <value optimized out>
        pid = 3575796
        selftid = 3708
#2  0x00255932 in *__GI_abort () at abort.c:92
        act = {__sigaction_handler = {sa_handler = 0x7a3ff4, sa_sigaction = 0x7a3ff4}, sa_mask = {__val = {3221183748, 3086869600, 3221183704, 7933961,
              3221183688, 1154680, 3221183676, 8013772, 0, 3086866344, 5, 0, 1, 3221183640, 0, 3221183716, 1356543, 3577255, 3221183636, 3035204, 1,
              3086869160, 0, 3221183748, 3221183676, 3221183688, 0, 4294967295, 1359583, 3086869160, 3221183680, 4294967295}}, sa_flags = 8011764,
          sa_restorer = 0x14b2ff}
        sigs = {__val = {32, 0 <repeats 31 times>}}
#3  0x001cc4df in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:93
        terminating = true
        t = <value optimized out>
#4  0x001ca415 in __cxxabiv1::__terminate (handler=0x1cc390 <__gnu_cxx::__verbose_terminate_handler()>)
    at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:38
No locals.
#5  0x001ca452 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
No locals.
#6  0x001ca591 in __cxa_throw (obj=0xad2f9700, tinfo=0x1f97fc, dest=0x1caaf0 <~bad_alloc>) at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:83
        header = <value optimized out>
#7  0x001cac0f in operator new (sz=2) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:58
        handler = <value optimized out>
        p = <value optimized out>
#8  0x001caced in operator new[] (sz=2) at ../../../../src/libstdc++-v3/libsupc++/new_opv.cc:32
No locals.
#9  0x012ead5c in gfxSkipChars::TakeFrom (this=0xbfff5f1c, aSkipCharsBuilder=0xbfff6f60) at ../../dist/include/thebes/gfxSkipChars.h:152
No locals.
#10 0x012e48fe in BuildTextRunsScanner::BuildTextRunForFrames (this=0xbfff8320, aTextBuffer=0xbfff7280) at nsTextFrameThebes.cpp:1713
        anySmallcapsStyle = 0
        textBreakPoints = {<nsTArray<int>> = {<nsTArray_base> = {static sEmptyHdr = {mLength = 0, mCapacity = 0, mIsAutoArray = 0},
              mHdr = 0xbfff7150}, <No data fields>},
          mAutoBuf = "\001\000\000\000\062\000\000\200\000\000\000\000\220z\377\277\354x\377\277\065\000\000\000\066\000\000\000\000\000\000\000\b\000\000\000\260q\377\277\254q\377\277\220q\377\277\000\202\066\260\030V\241\265\b@q\267\066\000\000\000\240\321\377\263\270\321\377\263\000\000\000\000\b\000\000\000\001\000\000\000\000\000\000\000\b\000\000\000\b\000\000\000\000\000\000\000\304i\005\255\b\000\000\000$\301 \255\364\017\274\001\240\321\377\263\b\000\000\000\254y\377\277\201E\225\001\354x\377\277\240\321\377\263\000\000\000\000\036\352\216\001\000\000\000\000\200g/\255\220z\377\277xr\377\277\256\371\247\001\000\000\000\000\220z\377\277(;\260\001\000\000\000\000\354x\377\277\254r\377\277\364\017\274\001tr\377\277\002\000\000\000<r\377\277"}
        currentTransformedTextOffset = 1
        finalUserData = 0xad2037cc
        userDataToDestroy = 0x0
        nextBreakIndex = 2904569804
        firstFrame = 0xad2037cc
        builder = {mBuffer = {<nsTArray<unsigned char>> = {<nsTArray_base> = {static sEmptyHdr = {mLength = 0, mCapacity = 0, mIsAutoArray = 0},
                mHdr = 0xbfff6f64}, <No data fields>},
            mAutoBuf = "\002\000\000\000\000\001\000\200\001\001\377\277\223\200\223\001\027m\271\000#\000\000\000\031\201\271\000\364\017\274\001<\000\000\000\000\000\000\000\274p\377\277\347\063\225\001\027m\271\000\324\302\355\267\031\201\271\000\256^\005\b@\300\355\267\240\246z\267\004\000\000\000\364\257\005\b\000@\006\255\000\000\000\255\fp\377\277\064u\005\b@\300\355\267\000\000\002\000\320o\377\277\000\000\000\000\062\000\000\200\000\000\000\000[]\005\b\225\351\216\001\240D\006\255\374\301\355\267 \000\000\000\217\350\216\001$p\377\277\002\000\000\000\274p\377\277\225\351\216\001\f\203\377\277\370\202\377\277,p\377\277\000\000\000\000\f\203\377\277\370\202\377\277lp\377\277\000\000\000\000\370\202\377\277\004\000\000\000\002\000\000\000\364\017\274\001,\203\377\277\000\000\000\000lp\377\277_\256.\001,\203\377\277\000\000\000\000\000\000\000\000\225\351\216\001\004", '\000' <repeats 11 times>"\217, \350\216\001\240\201\37---Type <return> to continue, or q <return> to quit---

After much try-and-error attempts, and also thoughts if my laptop’s memory wasn’t faulty or if the shared libraries on my disk weren’t somehow corrupted, I was finally able to track down the cause of this abnormal behavior:

BUG: The Security Device which the Siemens HiPath SIcurity Card API provided. You can read here why I use it.

The problem started somewhere around Firefox version 3.5.5 and later. If the security device dongle/card is not plugged in your computer, Firefox crashes at random pages.

The resolution
Create a second Firefox profile and install the Security Device only there, leaving the default Firefox profile with no Security Device capabilities. Thus if you want to use your online banking, you would need to close Firefox and then start it using the second profile. It’s not that bad, if you are a personal user like me who performs bank transactions relatively rarely.

The MozillaZine Knowledge Base has an excellent article about Firefox Profile Manager.


Leave a comment

“dd” sequential write performance tests on a raw block device may be incorrect

…if you use the inappropriate bytes size (bs) option. See the man page of dd for details on this option.

Hard disks have a typical block size of 512 bytes. LVM on the other hand creates its block devices with a block size of 4096 bytes. So it’s easy to get confused – even if you know that disks should be tested with blocks of 512 bytes, you shouldn’t test LVM block devices with a 512-bytes but with a 4096-bytes block size.

What happens if you make a write performance test by writing directly on the raw block device and you use the wrong bytes size (bs) option?

If you look at the “iostat” statistics, they will show lots of read requests too, when you are only writing. This is not what is expected when you do only writing.
The problem comes by the fact that when you are not using the proper block size for the raw block device, instead of writing whole blocks, you are writing partial blocks. This is however physically not possible – the block device can only write one whole block at a time. In order to update the data in only a part of a block, this block needs to be read back first, then modified with the new partial data in memory and finally written back as a whole block.

The total performance drop is about 3 times on the systems I tested. I’ve tested this on some hard disks and on an Areca RAID-6 volume.

So what’s the lesson here?

When you do sequential write performance tests with “dd” directly on the raw block device, make sure that you use the proper bytes size option, and verify that during the tests you see only write requests in the “iostat” statistics.

Physical hard disk example:

# Here is a bad example for a hard disk device
dd if=/dev/zero of=/dev/sdb1 bs=256 count=5000000

# Here is the proper usage, because /dev/sda physical block size is 512 bytes
dd if=/dev/zero of=/dev/sdb1 bs=512 count=5000000 

LVM block device example:

# Another bad example, this time for an LVM block device
dd if=/dev/zero of=/dev/sdb-vol/test bs=512 count=1000000

# Here is the proper usage, because the LVM block size is 4096 bytes
dd if=/dev/zero of=/dev/sdb-vol/test bs=4k count=1000000

Understanding the “iostat” output during a “dd” test:

Here is what “iostat” displays when you are not using the proper bytes size option (lots of read “r/s” and “rsec/s” requests):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  5867.40 3573.20   46.40 28585.60 47310.40 20.97   110.38   30.61   0.28 100.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
sdb2              0.00  5867.40 3572.80   46.40 28582.40 47310.40 20.97   110.38   30.61   0.28 100.00
dm-2              0.00     0.00 3572.80 5913.80 28582.40 47310.40 8.00 13850.92 1465.43   0.11 100.00 

Here is what it should display (no read “r/s” or “rsec/s” requests at all):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00 16510.00    0.00  128.60     0.00 131686.40 1024.00   107.82  840.32   7.78 100.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
sdb2              0.00 16510.00    0.00  128.60     0.00 131686.40 1024.00   107.82  840.32   7.78 100.00
dm-2              0.00     0.00    0.00 16640.00     0.00 133120.00 8.00 13674.86  823.73   0.06 100.00 

How to be safe?

Fortunately, file systems are smart enough and pay attention to the block size of the block devices they were mounted on. So if you do a “dd” write performance test and write to a file, you should be fine. Though in this case there are some other complications like journaling, commit intervals, barriers, mount options, etc.


2 Comments

Why /sys/block/dm-0/queue/scheduler exists on my Linux system?

The device-mapper (DM) traditionally didn’t have its own I/O scheduler. Then why suddenly my DM devices have such a scheduler and what does it control?


A new type of device-mapper was introduced recently in the Linux kernel 2.6.31 – the request-based device-mapper. According to the Linux Kernel Newbies changelog for 2.6.31, there is a commit which does “Prepare for request based option”.

The issue is actually not in the new request-based DM option, which is to be used only for multipath block devices. The problem is that when you create a regular LVM device on kernels 2.6.31+, the DM device itself has I/O scheduler parameters. So does the underlying block device on top of which you created the LVM. Thus we are having two I/O schedulers in the path from the LVM device to the physical storage.

According to the author of the kernel patches for the request-based DM device, Kiyoshi Ueda, for a bio-based DM device, only the underlying device’s scheduler should affect performance. This is what my tests shown too, therefore there is no discrepancy.

Let me summarize this:

  • If you are *not* using multipath block devices in your DM/LVM setup, then only the underlying device’s scheduler (i.e. “/sys/block/sda/queue/scheduler”) takes effect. This applies for the trivial LVM setup which many of us used for years.
  • If you are using a multipath DM/LVM setup, then only the DM device’s scheduler (i.e. “/sys/block/dm-0/queue/scheduler”) takes effect.

References:


Leave a comment

Changing the ISO image in a virtual CDROM drive while KVM-Qemu is running

If you run KVM with enabled monitor management console, you can do some pretty powerful internal stuff while the KVM guest is running.

In order to have a KVM-Qemu management console, you should start KVM with something like:

-monitor telnet:127.0.0.1:3010,server,nowait,ipv4

See the official documentation of Qemu for more details and also the man page of qemu-kvm (unofficial mirror).

Once you have it set up, you can then telnet to the management console and review the available commands:

famzah@famzahpc:~$ telnet localhost 3010
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

QEMU 0.11.0 monitor - type 'help' for more information
(qemu) help

Changing the ISO image of a virtual CDROM drive is quite easy:

  • First review what the current status of the drives is:
    (qemu) info block
    virtio0: type=hd removable=0 file=/dev/sdb-vol/win7 ro=0 drv=host_device encrypted=0
    ide0-cd0: type=cdrom removable=1 locked=0 file=/shared/win7-eval.iso ro=0 drv=raw encrypted=0
    ide1-cd0: type=cdrom removable=1 locked=0 [not inserted]
    
  • Then change the mounted ISO image in the CDROM drive on the fly:
    (qemu) change ide1-cd0 /shared/win-virtio-drivers.iso
    
  • Double-check that the changes took effect. KVM-Qemu will not issue an error message in case something went wrong (duh!):
    (qemu) info block
    virtio0: type=hd removable=0 file=/dev/sdb-vol/win7 ro=0 drv=host_device encrypted=0
    ide0-cd0: type=cdrom removable=1 locked=0 file=/shared/win7-eval.iso ro=0 drv=raw encrypted=0
    ide1-cd0: type=cdrom removable=1 locked=0 file=/shared/win-virtio-drivers.iso ro=1 drv=raw encrypted=0
    

Use the “help” command to review the other powerful commands which you can use to tune and debug your running KVM guest (“info”, “migrate” and “system_reset” seem like interesting candidates).


11 Comments

KVM-Qemu Virtio storage and network drivers for 32-bit/64-bit Windows 7, Windows Vista, Windows XP and Windows 2000

…bundled as ISO images, so that you can easily mount and use them in a KVM guest.

UPDATE: It seems that Fedora started to provide the latest drivers bundled as an ISO. Check the official Windows VirtIO Drivers page for links.


Download locations follow:

These are static ISO images, and I’ve built them by downloading the ZIP sources dated 24.09.2009 from the official WindowsGuestDrivers KVM page and then converting them to ISO image files by using K3b.

Note that Virtio provides noticeably faster disk and network access.

Please review the official page of Virtio for sample KVM command line arguments which set up Virtio storage and network devices. You may notice that there is an (undocumented) parameter “boot=on” specified for the “-drive” option. This “boot=on” parameter is vital for the “-drive” option, or else Windows 7 won’t like your drive and won’t install on it.

Note about Virtio storage drives and the Windows 7 installer
I was able to install Windows 7 right from the start by using a Virtio storage drive within the KVM guest. At first the Windows installer didn’t see the Virtio disk at all but there is an option to install additional storage drivers. I installed the Virtio Windows drivers from the above ISO images, the Windows installer detected the Virtio storage disk properly and everything went quite smooth afterwards.


Resources:


Leave a comment

Linux non-root user processes which run with group=root cannot change their process group to an arbitrary one

Don’t be fooled like me, here is what the Linux kernel experts say regarding this matter:

There is no such thing as a “super-user group”. No group has any special privileges.

And also:

An effective group ID of zero does not accord any special privileges to change groups. This is a potential source of confusion: it is tempting to assume incorrectly that since appropriate privileges are carried by the euid in the setuid-like calls, they will be carried by the egid in the setgid-like calls, but this is not how it actually works.

You can review the whole thread with subject “EUID != root + EGID = root, and CAP_SETGID” at the Linux Kernel Mailing List.


If you run a Linux process with “user” privileges which are non-root and “group” privileges which are root, you will not be able to change the “group” of this process run-time by setgid() functions to an arbitrary group of the system.

I expected that if a process runs with “group” privileges set to “root”, then this process has the CAP_SETGID capability and thus is able to change its “group” to any group of the system. This turns out not to be the case. Such a process can only work with files which are accessible by the “root” group, just as it would have been if the “group” was not “root” but any other arbitrary group.

Why would I want to change the group to an arbitrary one? Because the process may start with “group” root, do some privileged work and then completely drop “group” root privileges to some non-root “group”, for security reasons.


I tested this on Linux kernel 2.6.31 but the situation for some previous versions was the same too:

famzah@famzahpc:~$ cat /proc/version
Linux version 2.6.31-14-generic (buildd@rothera) (gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu8) ) #48-Ubuntu SMP Fri Oct 16 14:04:26 UTC 2009

Note that a process can run with non-root “user” and a root “group” for one of the following reasons:

  • The process was started by the “root” super-user. This case is of no interest for this article.
  • The process was started by a “user” which has “root” defined for group in /etc/passwd, for example:

    testor:x:1003:0:,,,:/home/testor:/bin/bash

  • The process was started by a non-root “user” but the “group” owner of the file is root and this file has the set-group-ID on execution by “chmod g+s” permission. For example:

    -rwxr-sr-x 1 famzah root 10616 2009-12-11 11:17 a.out

Here is (one of) the corresponding code in the kernel which checks if a process can switch its running “group”:

setregid() in kernel/sys.c:
        if (egid != (gid_t) -1) {
                if ((old_rgid == egid) ||
                    (current->egid == egid) ||
                    (current->sgid == egid) ||
                    capable(CAP_SETGID))
                        new_egid = egid;
                else
                        return -EPERM;
        }

As you can see, even though a process may have root for “group”, it will not be able to change it to an arbitrary one if the process doesn’t have the CAP_SETGID capability.

I think that once a process gets an effective group ID which is root, this process must be granted the CAP_SETGID capability. I’ve sent a request for comment to the Linux Kernel Mailing List and I’m awaiting their reply on this matter.


You can easily test it yourself that a process running with group “root” doesn’t get the CAP_SETGID regardless if it was run with the set-group-ID file permission and “group” owner root, or by a non-root user which has root set for “group” in /etc/passwd.

Here are my results:

famzah@famzahpc:~$ ls -la a.out && ./a.out
-rwxr-xr-x 1 famzah famzah 8650 2009-12-11 12:06 a.out
RUID=1000, EUID=1000, SUID=1000
RGID=1000, EGID=1000, SGID=1000

Capabilities list returned by cap_to_text(): =
CAP_SETUID: CAP_EFFECTIVE=n CAP_INHERITABLE=n CAP_PERMITTED=n
CAP_SETGID: CAP_EFFECTIVE=n CAP_INHERITABLE=n CAP_PERMITTED=n


famzah@famzahpc:~$ ls -la a.out && ./a.out
-rwxr-sr-x 1 famzah root 8650 2009-12-11 12:06 a.out
RUID=1000, EUID=1000, SUID=1000
RGID=1000, EGID=0, SGID=0

Capabilities list returned by cap_to_text(): =
CAP_SETUID: CAP_EFFECTIVE=n CAP_INHERITABLE=n CAP_PERMITTED=n
CAP_SETGID: CAP_EFFECTIVE=n CAP_INHERITABLE=n CAP_PERMITTED=n


famzah@famzahpc:~$ cat /etc/passwd|grep testor
testor:x:1003:0:,,,:/home/testor:/bin/bash
famzah@famzahpc:~$ cp a.out /tmp/
famzah@famzahpc:~$ sudo -u testor /tmp/a.out
[sudo] password for famzah:
RUID=1003, EUID=1003, SUID=1003
RGID=0, EGID=0, SGID=0

Capabilities list returned by cap_to_text(): =
CAP_SETUID: CAP_EFFECTIVE=n CAP_INHERITABLE=n CAP_PERMITTED=n
CAP_SETGID: CAP_EFFECTIVE=n CAP_INHERITABLE=n CAP_PERMITTED=n


# Only if you are run by "user" root (or set-user-ID root),
# you can change your "group", because you gain CAP_SETGID.

famzah@famzahpc:~$ sudo -u root /tmp/a.out
RUID=0, EUID=0, SUID=0
RGID=0, EGID=0, SGID=0

Capabilities list returned by cap_to_text(): =ep
CAP_SETUID: CAP_EFFECTIVE=y CAP_INHERITABLE=n CAP_PERMITTED=y
CAP_SETGID: CAP_EFFECTIVE=y CAP_INHERITABLE=n CAP_PERMITTED=y

The source code of the capabilities dumper follows:

/* Compile with: gcc -Wall -lcap capdump.c */

#define _GNU_SOURCE
#include <sys/capability.h>
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>

#define NUM_CAPS_TESTED 2

void manually_dump_caps(cap_t caps) {
        const cap_value_t cap_value_codes[NUM_CAPS_TESTED] = {CAP_SETUID, CAP_SETGID};
        const char *cap_value_names[NUM_CAPS_TESTED] = {"CAP_SETUID", "CAP_SETGID"};
        const cap_flag_t cap_flag_codes[3] = {CAP_EFFECTIVE, CAP_INHERITABLE, CAP_PERMITTED};
        const char *cap_flag_names[3] = {"CAP_EFFECTIVE", "CAP_INHERITABLE", "CAP_PERMITTED"};
        // -- //
        cap_flag_value_t flag_got_value;
        int i, j;

        for (i = 0; i < NUM_CAPS_TESTED; ++i) {
                printf("%s: ", cap_value_names[i]);
                for (j = 0; j < 3; ++j) {
                        if (cap_get_flag(caps, cap_value_codes[i], cap_flag_codes[j], &flag_got_value) != 0) {
                                err(EXIT_FAILURE, "cap_get_flag()");
                        }
                        printf("%s=%s ", cap_flag_names[j], (flag_got_value ? "y" : "n"));
                }
                printf("\n");
        }
}

void safe_cap_free(void *obj_d) {
        if (cap_free(obj_d) != 0) {
                err(EXIT_FAILURE, "cap_free()");
        }
}

int main() {
        cap_t caps;
        char *human_readable_s;
        uid_t ruid, euid, suid;
        gid_t rgid, egid, sgid;

        if (getresuid(&ruid, &euid, &suid) != 0) {
                err(EXIT_FAILURE, "getresuid()");
        }
        if (getresgid(&rgid, &egid, &sgid) != 0) {
                err(EXIT_FAILURE, "getresgid()");
        }

        printf("RUID=%d, EUID=%d, SUID=%d\n", ruid, euid, suid);
        printf("RGID=%d, EGID=%d, SGID=%d\n\n", rgid, egid, sgid);

        // -- //

        caps = cap_get_proc();
        if (caps == NULL) {
                err(EXIT_FAILURE, "cap_get_proc()");
        }

        human_readable_s = cap_to_text(caps, NULL /* do not store length */);
        if (human_readable_s == NULL) {
                err(EXIT_FAILURE, "cap_to_text()");
        }

        printf("Capabilities list returned by cap_to_text(): %s\n", human_readable_s);
        manually_dump_caps(caps);

        // -- //

        safe_cap_free(human_readable_s);
        safe_cap_free(caps);

        return 0;
}


1 Comment

Bifferboard performance benchmarks

The benchmarks were done while Bifferboard was running Linux kernel 2.6.30.5 and Debian Lenny.

Boot time
Total boot time: 1 minute 11 seconds (standard Debian Lenny base installation)

The boot process goes like this:

  • Initial boot. Mounted root device (5 seconds wasted on waiting for the USB mass storage to be initialized). Executing INIT. [+21 seconds elapsed]
  • Waiting for udev to be initialized (most of the time spent here), configuring some misc settings, no dhcp network, entering Runlevel 2. [+41 seconds elapsed]
  • Started “rsyslogd”, “sshd”, “crond”. Got prompt on the serial console. [+9 seconds elapsed]

Therefore, if a very limited and custom Linux installation is used, the total boot time could be reduced almost twice.

CPU speed
Calculated BogoMips: 56.32
According to a quite complete BogoMips list table, this is an equivalent of Pentium@133MHz or 486DX4@100MHz.

According to another CPU benchmarks comparison table for Linux, Bifferboard falls into the category of Pentium@100Mhz.

Memory speed
A “dd” write to “/dev/shm” performs with a speed of 6.3 MB/s.
The MBW memory bandwidth benchmark shows the following results:

bifferboard:/tmp# wget
http://de.archive.ubuntu.com/ubuntu/pool/universe/m/mbw/mbw_1.1.1-1_i386.deb
bifferboard:/tmp# dpkg -i mbw_1.1.1-1_i386.deb
bifferboard:/tmp# mbw 4 -n 20|egrep ^AVG
AVG Method: MEMCPY Elapsed: 0.15602 MiB: 4.00000 Copy:
25.637 MiB/s
AVG Method: DUMB Elapsed: 0.06611 MiB: 4.00000 Copy:
60.502 MiB/s
AVG Method: MCBLOCK Elapsed: 0.06619 MiB: 4.00000 Copy:
60.431 MiB/s

For comparison, my Pentium Dual-Core @ 2.50GHz with DDR2 @ 800 MHz (1.2 ns) shows a “dd” copy speed to “/dev/shm” of 271 MB/s, while the MBW test shows a maximum average speed of 7670.954 MiB/s. Bifferboard is an embedded device after all… 🙂

Available memory for applications
My base Debian installation with “udevd”, “dhclient3”, “rsyslogd”, “sshd”, “getty” and one tty session running shows 24908 kbytes free memory. You surely cannot put CNN.com on this little machine, but compared to the PIC16F877A which has 368 bytes (yes, bytes) total RAM memory, Bifferboard is a monster.

Disk system
All tests are done on an Ext3 file-system and a very fast USB Flash 8GB A-Data Xupreme 200X.
A “dd” copy to a file completes with a write speed of 6.1 MB/s.
The Bonnie++ benchmark test shows the following results:

Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
bifferboard.lo 300M   822  96  5524  61  4305  42   855  99 16576  67 143.2  12
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP
                 16  1274  94 14830 100  1965  85  1235  90 22519 100 2015  87

bifferboard.localdomain,300M,822,96,5524,61,4305,42,855,99,16576,67,143.2,12,16,1274,94,14830,100,1965,85,1235,90,22519,100,2015,87

Therefore, the sequential write speed is about 5.5 MB/s, while the sequential read speed is about 16.5 MB/s.

It’s worth mentioning that while the write tests were running, there was a very high CPU System load (not CPU I/O waiting) which indicates that the write throughput of Bifferboard may be a bit better if the file-system is not a journaling one. However, the tests for “Memory speed” above show that writing to “/dev/shm” (a memory-based file-system) completes with a rate of 6.3 MB/s. Therefore, this is probably the limit with this configuration.

Network
Both Netperf and Wget show a throughput of 6.5 MB/s.
The packets-per-second tests complete at a rate of 8000 packets/second.
Modern systems can handle several hundred thousand packets-per-second without an issue. However, the measured network performance of Bifferboard is more than enough for trivial network communication with the device. During the network benchmark tests, there was very high CPU System usage, but that was expected.

Encryption and SSH transfers
The maximum encryption rate for an eCryptfs mount with AES cipher and 16 bytes key length is 536 KB/s. The standard SSH Protocol 2 transfer rate using the OpenSSH server is about the same – 587.1 KB/s. If you try to transfer a file over SSH and store it on an eCryptfs mounted volume, the transfer rate is 272.2 KB/s, which is logical, as the processing power is split between the SSH transfer and the eCryptfs encryption.
You can try to tweak your OpenSSH ciphers, in order to get much better performance. The OpenSSH ciphers performance benchmark page will give you a starting point.

Conclusion
Bifferboard performs pretty well for its price. It’s my personal choice over the 8-bit 16F877A and the other 16-bit Microchip / ARM microcontrollers, when a project does not require very fast I/O.