Kernel Planet

November 06, 2009

Kernel Podcast: 2009/11/05 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091105.mp3

For Thursday, November 5th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: CVE-2009-2584, Generic per-cpu counter arrays, MM locking, page types, performance events, and the scheduler.

CVE-2009-2584. A security issue was recently found in a procfs function contained within the sgi-gru driver. It involved unsafe use of strncpy_from_user. Various people posted fix suggestions for it, while Linus noted that most of the logic in the offending function (options_write) was “utter sh*t as far as I can tell”. He posted a couple of entirely untested patches (Linus style) for others to take a look at. Meanwhile, it was also noted that few people had the hardware, which helped to mitigate the issue.

Generic per-cpu counter arrays. Kamezawa Hiroyuki, noting that the patch had been “ony my queue for a month”, posted an RFC patch intended to add support for generic percpu counter arrays. His patch uses the recent dynamic percpu support to create arrays of per-cpu data on the fly, using some macros such as DEFINE_COUNTER_ARRAY, and functions such as counter_array_init, and counter_array_add to manage entries being added to an existing array.

MM locking. Christoph Lameter posted an RFC MM patch implementing a variety of “accessors for mm locking”. Essentially, the idea is to abstract and wrap up use of mmap_sem such that it could eventually be ripped out and replaced without having to touch a lot of MM code once again. Christoph notes that the patch is “currently incomplete” but it does at least build.

Page Types. Fengguang Wu posted a followup to his previous patch enabling one to specify new page type information on the command line of the “page-types” utility (used to decode various VM data) with an example of how one could educate page-types about new types of page flags on the command line.

Performance Events. Hitoshi Mitake posted version 5 of a 7 part patch series implementing the “perf bench” command, and incorporating Rusty Russell’s original “hackbench” scheduler benchmark code.

Scheduler. Lai Jiangshan noted that a previous patch from Mike Galbraith didn’t seem to be mitigating the problems with the scheduler running tasks on the wrong CPU. In his case, the built-in kernel thread named “events” for CPU 1 was in fact shown (by using Ftrace) to be running on CPU0. Mike noted that the problem was likely to be in the migration code not holding the runqueue lock and thus not being safe against pre-emption and subsequent chaos.

In today’s announcements: AlacrityVM version 0.2. Gregory Haskins announced the 0.2 release of his AlacrityVM project. This is a modified KVM that uses a replacement virtualized IO bus for improved performance of, for example, network packet transfer between host and guest. The latest version includes some nice features, such as zero-copy transmits in the VENET driver. For further informatin, visit:
http://developer.novell.com/wiki/index.php/AlacrityVM.

The latest kernel release is 2.6.32-rc6.

Stephen Rothwell posted a linux-next tree for November 5th. Since Wednesday, the PowerPC KVM fix was still around, while the pcmcia, drbd, and catalin trees lost their issues, and the sparc tree gained a build failure for which Stephen applied a patch. The total sub-tree count remained at 146 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 06, 2009 01:30 PM

Evgeniy Polyakov: Elliptics network goes production

Kind of goes - there is a perfect task for this solution, which I can try to hook into this year. The New Year deadlines all deadlines, so there is about a month and a half for the task.

Task is quite simple actually - there is a huge library of files, which does not fit single storage machine. And although it is not that large, about 5-10 Tb of data for starters, next step is to suck in close to 200 Tb of data. Task is to allow on-demand reading without updates of the existing files, only new ones will be added with time. I expect millions of reads per day.

Files should be spread over multiple machines for read balancing, there should be multiple copies of each for redundacy. System should transparently handle failures (storage machines will be spread over multiple data centers). And the main request is to allow to fetch files over direct links, i.e. elliptics network provides data location and some usual HTTP server will give them away.

While I wrote this entry another cool task (re)appeared: clusterize some very popular monitoring system, which to date does not scale very well to existing amount of notification writers (about 200k small writes per second per small cluster). I need to provide fault-tolerant storage which will be able to suffer this load and allow simple horizontal scaling on demand.

Existing performance numbers show that elliptics network can easily handle all those tasks, but some obscure numbers created by the project author are usually not enough for those who deploy new system. As in any other business, people do not eager to try something new. New, shiny and likely buggy...

Well, let's show what we can do. I will post results and setup systems here.

November 06, 2009 01:21 PM

Pete Zaitcev: The litl thing

Apparently, it uses S3. My plan to take over the world is proceeding as I have foreseen.

November 06, 2009 12:08 AM

November 05, 2009

Kernel Podcast: 2009/11/04 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091104.mp3

For Wednesday, November 4th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Cgroups, FatELF, PerCPU MM counters, and Swap.

Cgroups. Balbir Singh posted to let everyone know that discussion is happening concerning the most appropriate place to mount the cgroup filesystem. Since the Linux Filesystem Hierarchy Standard (FHS) was written prior to the existence of cgroups, it has no specific advice, which leads to three alternatives. These are /dev/cgroup, /cgroup, or some place under /sys. Balbir prefers the first option, but that will require some co-operation with udev. He asks for advice from others as to the best place for this to live. Several people seem to be quite happy with /sys/kernel/cgroup (which is not the only filesystem that gets mounted there).

FatELF. Continuing the discussion on the relative merits of “FAT” image files containing multiple ELF objects, Mikulas Patocka made some interesting comments on Linux package managers, describing them as “evil”. In his opinion, FatELF might provide a means to ship single image files containing all of the files an application needs to execute in one object, similar to how Apple and other operating systems already do today. Mikulas is concerned about the relative difficulty Linux users face in installing software not provided by their distribution using package management software. He makes a good point, although FatELF may not be the solution to that particular problem.

PerCPU MM counters. Christoph Lameter, noting that support for generic per-cpu operations is now in the “percpu” and linux-next trees, posted a patch implementing per-cpu mm counters for tasks rather than single entires in mm_struct. This obviates the need for larger SMP systems to perform atomic updates to mm counters and (intuitively) implies a performance improvement. The only downside is occasionally having to iterate over each of these per-cpu values when the actual count values are being requested.

Swap. Following on from the recent discussion about OOM killer behavior and the various metrics that might be used in the future, Kamezawa Hiroyuki posted a patch that exports per-process (task) swap usage statistics via procfs. This happens through the addition of a new “VmSwap” entry in /proc/pid/status.

The latest kernel release is 2.6.32-rc6.

Stephen Rothwell posted a linux-next tree for November 4th. There had been no tree the previous day due to a national holiday in Australia, where he is based (and one trusts the horse race went well, too). Since Monday, there was a new “msm” tree (which is an ARM platform), the PowerPC KVM fix was still required, and a couple of other conflicts went away. The total sub-tree count increased today to 146 trees with the addition of the “msm” tree.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 05, 2009 12:16 PM

Kernel Podcast: 2009/11/03 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091103.mp3

For Tuesday, November 3rd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Block IO controller, FatELF, Ftrace, Performance, and Sysctls.

Block IO controller. The ever patient Vivek Goyal, fresh from the IO minisummit in Tokyo, posted the first version of a new IO bandwidth control patchset entitled that “Block IO Controller”. This RFC patch series aims to address the problem of there being no “one size fits all” IO control policy, and the need for different policies to be implemented for different uses. The patch introduces what Vivek calls the blkio cgroup controller, through which a management interface is provided that can be used to switch policies.

FatELF. Eric Windisch posted some example use cases for FatELF that he felt others should know about, in an attempt to counter some of the points made by Alan Cox previously. In particular, it would seem that Eric is into Cloud Computing in a big way and looks forward to having virtual machine images that can simultaneously run on a variety of different hardware. Although there is certainly some benefit provided by FatELF, it wasn’t clear how these problems couldn’t be solved as Alan had suggested – with different directories containing versions of the same binaries for the different arches.

Ftrace. Michal Simek posted to let everyone know that he is currently working on Ftrace support for the Microblaze CPU architecture (an FPGA-based soft core from the folks at Xilinx). In particular, he is looking at function trace support at the moment and how the mcount function is used to record entry into each individual function. He has a number of questions, and Steven Rostedt (the Ftrace author) was happy to help answer a number of them.

Performance. Alex Shi posted with an observation that performance testing had yielded results with a 20-30% drop off in the 2.6.32-rc5 timeframe. This seemed to be due to a cfq-iosched patch from Jens Axboe. Alex attached an example run of perf stat both with and without the patch, showing a clear difference between the two sets of data.

Sysctl. Eric Dumazet recently observed that sysctl table entries were quite expensive, due to a sentinel value added after each one in order to detect and avoid corruption of table entries. Eric noted that the sentinel need actually only contain a couple of pieces of data, and so he created a special sentinel entry struct called ctl_table_sentinel that was smaller in size. This would apparently reduce RAM utilization of such entries by 40%.

In today’s announcements: Userspace RCU. Mathieu Desnoyers posted to let everyone know that version 0.3.0 of his Userspace RCU patches is now available. This is an RCU implementation using the POSIX pthread functions that applications can use to take advantage of the same features as the kernel has done for some time. The latest version removes a function (call_rcu) for which he had provided differing arguments and semantics than the kernel.

The latest kernel release is 2.6.32-rc6. Linus Torvalds announced version 2.6.32-rc6 of the Linux kernel at 12:05pm US Best Coast Time (PDT). In his announcement, Linus noted that there had been a longer gap since rc5, due in large part to the number of kernel developers who have been away at the kernel summit in Japan or traveling to and fro. There was also an ext4 filesystem corruption problem that required additional time, and that had turned out to be due to enabling checksum testing of journal transactions during recovery. Linus thanked Eric Sandeen for tracking down that particular problem. He also seemed pleased at the number of regressions addressed since 2.6.31.

Stephen Rothwell announced that there would be no linux-next tree for November 3rd due to a public holiday in Australia where he is based, which has apparently also has “nothing to do with a horse race in Melbourne”.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 05, 2009 11:49 AM

Kernel Podcast: 2009/11/02 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091102.mp3

For Monday, November 2nd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: BKL, FatELF, Fast symbol resolution, OOM, and Performance benchmarks.

BKL. There is an ongoing effort to remove the BKL (Big Kernel Lock), which is the last stayover from early Linux support for SMP. Discussion of BKL removal was revived during the recent Real Time pre-emption mini-summit, and Jan Blunk is amongst those who have been looking at this from the filesystem level. He posted a series of patches intended to push BKL use down into individual filesystems from the generic kernel code (for example do_new_mount()) that it lives in today. He requests comments.

FatELF. There was some ongoing (and quite considerable) push back against the notion of supporting FatELF binaries. Chris Adams wondered aloud just what the target audience really was? As he sees it, embedded users don’t want the bloat, Enterprise distributions already have specific support processes in place for different architectures, and community distributions aren’t likely to want to deal with the increased build complexity and space requirements. Meanwhile, Alan Cox congratulated Ryan C. Gordon on re-inventing the concept of a directory – since directories already allow one to have multiple versions of a binary installed on a given system and to pick and choose between them. Sure that’s not as shiny as an Applesque approach, but it has worked for many decades at this point, and most of the distributions implement multi-arch (sometimes called multi-lib) using some kind of similar approach.

Fast symbol resolution. Alan Jenkins posted the latest version of his fast LKM symbol resolution patches. These take advantage of a binary search for symbol resolution at module load time, using a pre-generated (at build time) sorted table of exported kernel symbols. Using this approach, Alan has once again succeeded in reducing overall system boot time slightly on his netbook. The latest version of the patches has seen some limited testing on ARM and has also been built for Blackfin, so it’s not just x86 at this point.

OOM. Kamezawa Hiroyuki posted to let everyone know that he was putting code where his mouth was with a “total renewal” of the OOM killer code. This isn’t complete at this stage, but it is intended to keep the conversation moving. The first patch lays groundwork (including new OOM type classifications), while the second and subsequent patches add the ability to count swap use per process and implement a newly updated badness calculation that uses rss+swap as the base value but also factors in cpusets, and gives tasks a bonus for how far in the past their last allocation occured, and their runtime.

Performance benchmarks. Hitoshi Mitake posted to let everyone know that he has been working on integrating a benchmark subsystem into the existing – and already fairly extensive – “perf” (or performance events) utility. He asked Rusty Russell for permission to pull Rusty’s hackbench code directly into the kernel tree as part of this effort, which can be used by calling “perf bench sched” with whatever parameters one might wish to specify.

Finally today, Tilman Schmidt requests that we draw attention to the Kernel Cleanup wiki that Robert P J Day has been working on. The page at www.crashcourse.ca/wiki/index.php/Kernel_cleanup includes information about unused Kconfig variables, badly referenced ones, and general problems with kernel code that need further investigation in general.

In today’s announcements: LTP. Subrata Modak posted announcing that the Linux Test Project for October 2009 has been released. The latest version includes fixes, 119 test scenarios for EXT4 testing, new GETUID16/GETUID64/GETEUID16 and PTRACE system call tests, and much more. As usual, it is available at http://ltp.sourceforge.net/.

Sysprof. Soeren Sandmann announced version 1.1.4 of the sysprof CPU profiler. This is the latest version to be based upon the rewrite to make use of the new performance counters interface for exposing the low-level hardware counters. Since the previous 1.1.2 release, there have been a number of fixes. A download is available at http://www.daimi.au.dk/~sandmann/sysprof/.

The latest kernel release was 2.6.32-rc5.

Stephen Rothwell posted a linux-next tree for November 2nd. Since Friday, his fixes tree still has that PowerPC KVM fix, while there were a number of arch issues affecting ARM and OMAP in particular. The sub-tree count remains steady today at 145 trees in linux-next.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 05, 2009 03:57 AM

Kernel Podcast: 2009/11/01 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091101.mp3

If at first you don’t succeed. Welcome to version 2.0 of the LKML summary podcast. In this revamped version I will concentrate on the major issues under discussion on a given day, rather than commenting on every single patch, which had become an unsustainable load. I am still interested to hear from volunteers who might help to make the podcast workload less challenging on a daily basis.

For the weekend of November 1st 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.

In today’s issue: Fanotify, FatELF, Futexes, KVM, Memory Overcommit, Regressions, and Thread Naming.

Fanotify. Eric Paris posted a patch series implementing a new file mode entitled FMODE_NONOTIFY, which can only be set by the kernel itself. Its job is to indicate that an fd was opened by fanotify itself and should not cause future fanotify events. This allows one to obviate such livelock scenarios as would otherwise occur from fanotify close events resulting in repeated opens on a file that would then be closed and cause another event to be emitted.

FatELF. Ryan C. Gordon posted what he hoped would be his final round of FatELF patches. These extend the Linux kernel’s ELF binary format handler loader code to accept “FAT” images containing multiple ELF binaries, allowing for such features as multi-arch code encapsulated within a single binary. In some respects, the feature behaves similar to Apple’s Universal Binary format, which it was noted is covered by several patents. More information on FatELF can be found at http://icculus.org/fatelf/.

Futexes. Darren Hart, known for his involvement in the RT kernel community, recently posted an RFC patch series intended to make futex_lock_pi into a fully interruptible syscall. This would allow for canceling of locking requests, while preserving FIFO ordered wakeup and Priority Inheritance requirements, and without having to try to emulate this behavior in userspace. He included a test case demonstrator, which used an RT signal handler to abort the futex locking attempt. Arnd Bergmann responded that it should be possible to simply longjmp out of the test application signal handler and avoid modifying the kernel, something that Darren confirmed did work, but he was apprehensive as to whether there might be unintended issues in doing this.

KVM. Gleb Natapov posted a patch series implementing asynchronous page faults for paravirtualized KVM guests. Typically, a guest encountering a page fault becomes blocked until the faulting page is made available by KVM and the guest can be resumed. But paravirtualized guests are aware of the hypervisor and can interact with it. In this case by blocking only the faulting task within the guest and not the entire guest VM. The faulting page can then be swapped in while the guest is still running, using the assistance of a parallel thread within the hypervisor.

Memory Overcommit. Here comes the annual OOM killer discussion. Back in the middle of October, Vedran Furac sent a message entitled “Memory overcommit”, in which he posited how still today a trivial C program run by an ordinary user that attempts to perform large memory allocations can trigger the OOM killer and really take down a system (by killing many essential system services other than the guilty task) once overcommit_memory is disabled. In the example, Vedran had cited how 8 processes were killed, including the X server and some long running system daemons. He felt that the OOM killer really only served to give Linux a bad reputation amongst some users and that it was better to simply disable it by default – enforcing strict allocation only of the available free pages. Others disagreed, although Vedran had a point in saying the OOM killer might as well be renamed to TRIPK – Totally Random Innocent Process Killer.

Kamezawa Hiroyuki had made several mitigation suggestions against overcommit problems, including the use of oom_adj and explicit cgroups. But Vedran was more concerned with how the OOM killer algorithm seemed to be making the wrong choices in the first place as to which tasks should die. This is an issue that comes up every once in a while. Vedran and Kamezawa had previously taken the discussion off-list (to the mm list instead) but it now returned to LKML, Kamezawa having written a script to analyze the oom_score of existing processes on his own system and discovering (for example) that his GNOME desktop processes were being considered more bad by the OOM killer than the sample “allocate one 1GB of memory” task that had taken down Vedran’s box.

Kosaki Motohiro suggested that problem was the number of libraries the average desktop application is linked against, and also suggested that the OOM killer should not account for evictable file-backed mappings (such as libraries) in calculating the oom_score. This lead to a discussion as to the best meta to consider in making OOM kill decisions. It was deemed necessary to consider the VM size in order to catch swap-ed out fork bomb process attacks but Kosaki noted that basing oom_score on RSS + swap-entries figures would be acceptable to him as an alternative. This lead on to a lengthy discussion thread (and a number of patch iterations – including a nice analysis from Hugh Dickins), concerning the best ways to overhaul the OOM killer for modern systems and what exactly the criteria should be. Should it be that the biggest resident memory eater is always killed (which is hard to predict)? or should the total vm size (including resident and non-resident pages) factor into the decision?

Regressions. Caleb Cushing posted to let everyone know that his network performance has dropped off considerably since moving to 2.6.31.x. But the problem seems ellusive, having bitten in 2.6.30.x previously, then seeming to vanish before apparently re-appearing in 2.6.31.x. Having never performed a bisection before, Caleb wasn’t entirely sure of the process, but did post the log from a bisection hoping that others might chime in with some input.

Thread naming. John Stultz posted another iteration of a patch he has been working on that allows threads to renaming their siblings by writing into /proc/pid/tasks/tid/comm. This will allow thread managers to nicely set the task name of their children, for logging as well as for appearance.

In today’s announcements: The kerneloops.org report for the week of October 31 2009. Arjan van de Ven posted this week’s summary of recorded kernel oops logs from his kerneloops.org online service. A total of 18,023 oopses and warnings were logged over the past week, more than a 200% increase over the past week, though this week’s report co-incides with the latest Ubuntu release (which includes the ability to file such reports for the first time). The top warnings were in suspend_test_finish, acpi_idle_enter_bm and dev_watchdog.

The latest kernel release was 2.6.32-rc5.

Andrew Morton posted an mm-of-the-moment for 2009-11-01-10-01. It contains a fair number of patches against the 2.6.32-rc5 kernel.

Stephen Rothwell posted a linux-next tree for Friday. Since Thursday, he had a PowerPC KVM fix, some architectural fixes, and network and percpu conflicts that needed to be resolved. There are currently 145 sub-trees in linux-next.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 05, 2009 03:13 AM

November 04, 2009

Harald Welte: German news site Spiegel Online has video of my torched car

Some 9 months after some idiots have put my car on fire, the german news site Spiegel Online reports on a court trial unrelated to my car, but showing a video of my car.

Quite funny how they always dig out that footage. The court case was about an alleged failed attempt to torch a car, so showing two completely burnt cars in that article is not really sensible anyway.

As you can see from the article, there' already more than 250 burnt vehicles this year in Berlin.

November 04, 2009 01:00 AM

Harald Welte: Android Mythbusters (Matt Porter)

Some weeks ago I was attending Embedded Linux Conference Europe. My personal highlight at this event was the excellent Android Mythbusters presentation given by Matt Porter.

As you may know, Matt Porter was heavily involved in the MIPS and PPC ports of Android, so he and his team have seen the lowest levels of Android, more and deeper than even cellphone manufacturers ever have to look into it.

The slides of his presentation are now available for download. I would personally recommend this as mandatory reading material for everyone who has some interest in Android.

The presentation explains in detail why Android is not what most people refer to when they say Linux. What most people mean when they say Linux is the GNU/Linux system with it's standard userspace tools, not only the kernel.

The presentation shows how Google has simply thrown 5-10 years of Linux userspace evolution into the trashcan and re-implemented it partially for no reason. Things like hard-coded device lists/permissions in object code rather than config files, the lack of support for hot-plugging devices (udev), the lack of kernel headers. A libc that throws away System V IPC that every unix/Linux software developer takes for granted. The lack of complete POSIX threads. I could continue this list, but hey, you should read those slides. now!

Just one more practical example: You cannot even plug a USB drive to an android system, since /dev/sd* is not an expected device name in their hardcoded hotplug management.

Executive summary: Android is a screwed, hard-coded, non-portable abomination.

I can't wait until somebody rips it apart and replaces the system layer with a standard GNU/Linux distribution with Dalvik and some Android API simulation layer on top. To me, that seems the only way to thoroughly fix the problem...

November 04, 2009 01:00 AM

November 03, 2009

Valerie Aurora: ZFS gets deduplication - the right way

ZFS now has data deduplication - with the right configuration options for safety and performance in a compare-by-hash based storage system. From Jeff Bonwick's ZFS deduplication blog entry:

Given the ability to detect hash collisions as described above, it is possible to use much weaker (but faster) hash functions in combination with the 'verify' option to provide faster dedup. ZFS offers this option for the fletcher4 checksum, which is quite fast:

zfs set dedup=fletcher4,verify tank

The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random hash function, and therefore cannot be trusted not to collide. It is therefore only suitable for dedup when combined with the 'verify' option, which detects and resolves hash collisions. On systems with a very high data ingest rate of largely duplicate data, this may provide better overall performance than a secure hash without collision verification.

What I like is (1) the user chooses the hash function based on their security and performance needs, (2) the system can optionally check for hash collisions, and (3) the ZFS storage pool design makes it easy to migrate data to a new hash function if necessary. ZFS is the first deduplicating storage system I know of with these features. (Do let me know if there are others out there!)

November 03, 2009 01:23 AM

November 02, 2009

Evgeniy Polyakov: Data de-duplication in ZFS and elliptics network (POHMELFS)

Jon Smirl sent me a link describing new ZFS feature - data deduplication.

This is a technique which allows to store multiple data objects in the same place when their content is the same, thus effectively saving the space. There are three levels of data deduplication - files (objects actually), blocks and bytes. Every level allows to store single entity for the multiple identical objects, like single block for several equal data blocks or byte range and so on. ZFS supports block deduplication.

This feature existed effectively from the beginning in the elliptics network distributed hash table storage, but it has two levels of data deduplication: object and transaction. Well, actually we have transaction only, but maximum transaction size can be limited to some large enough block (like megabytes or more, or can be unlimited if needed), so if object is smaller than that, it will be deduplicated automatically.

Which basically means that if multiple users write the same content into the storage and use the same ID, no new storage space will be used, instead transaction log for the selected object will be updated to show that two external objects refer to given transaction.

Depending on transaction size it may have a negative impact, in particular when transaction size is smaller than log entry, it will be actually a waste of space, but transactions are required for the log-strucutred filesystem and to implement things like snapshots and update history. By default log entry size equals to 56 bytes, so it should not be a problem in the common case.

POHMELFS as elliptics network frontend will support this feature without actually any steps out of the box.

November 02, 2009 03:34 PM

Pavel Machek: Debugging MMC is easier...

...if you have MMC card inserted. Oops. I added enough of registration infrastructure and GPIO support to dream that mmc controller is detected, but it is still not enough to get card recognized.

November 02, 2009 01:23 PM

Pavel Machek: Dream booting

With Brian's help, I got recent kernel to boot on HTC Dream. Patches will follow.

November 02, 2009 10:38 AM

November 01, 2009

Paul E. Mc Kenney: Hunting Heisenbugs

My children's opinions notwithstanding, I recently found myself pursuing some nasty concurrency bugs in Linux's TREE_RCU implementation. This was not particularly surprising, given that I recently added preemption support, and the code really hadn't put up that much of a fight. In fact, I was getting the feeling that the bugs had gotten together and decided to hide out, the better to ambush me. This feeling wasn't far from wrong.

My first hint of trouble appeared when I began running longer sequences of rcutorture runs, seeing an occasional failure on one-hour runs. My first reaction was to increase the duration to ten hours and attempt to bisect the problem. Of course, even with bisection, finding the bug takes quite some time given ten hours for each probe, so rather than use “git bisect”, I manually ran parallel runs and (for example) quadrisected. I also ran multiple configurations. The results initially hinted that CONFIG_NO_HZ might have something to do with it, but later runs showed no shortage of failures in !CONFIG_NO_HZ runs as well.

The initial results of the bisection were quite puzzling, converging on a commit that could not possibly change RCU's behavior. Then I noticed that one of the machines seemed to be generating more failures than others, and, sure enough, this difference in failure rate was responsible for the false convergence. I therefore started keeping more careful records, including the machine name, test duration, configuration parameters, commit number, and number of errors for each run. These records proved extremely helpful later on.

Further testing showed that 2.6.32-rc1 (AKA 2.6.32-rc2) was reliable, even for the error-prone machine, and that 2.6.32-rc3 was buggy. Unfortunately, there are no RCU-related commits between 2.6.32-rc1 and 2.6.32-rc3. Unless you count commit #828c0950, which simply applies const to a few data structures involved in RCU tracing, which I don't and you shouldn't. So I ran a few more runs on 2.6.32-rc1, and eventually did trigger a failure. In contrast, 2.6.31 was rock solid.

Now there are quite a few RCU-related patches between 2.6.31 and 2.6.32-rc1, so I started searching for the offending commit. However, by this time I had written scripts to analyze rcutorture output, which I used to check the status of the test runs, stopping runs as soon as they generated an error. This sped things up considerably, because failed runs now took on average only a few hours rather than the 10 hours I was using as a (rough) success criterion.

Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?

Testing eventually converged on commit #b8d57a76. By this time, I getting a bit paranoid, so I ran no fewer than three ten-hour runs at the preceding commit on the most error-prone machine, none of which failed. But this commit does nothing to RCU, but rather makes rcutorture testing more severe, inserting delays of up to 50 milliseconds in RCU read-side critical sections. I therefore cherry-picked this commit back onto 2.6.31 and 2.6.30, and, sure enough, got failures in both cases. As it turned out, I was dealing with a day-one bug in TREE_RCU.

This did simplify matters, permitting me to focus my testing efforts on the most recent version of RCU rather than spreading my testing efforts across every change since 2.6.31. In addition, the fact that long-running RCU read-side critical sections triggered the bug told me roughly where the bug had to be: force_quiescent_state() or one of the functions it calls. This function runs more often in face of long-running RCU read-side critical sections. In addition, this explained the earlier CONFIG_NO_HZ results, because one of the force_quiescent_state() function's responsibilities is detecting dyntick-idle CPUs. In addition, it raised the possibility that the bug was unrelated to memory ordering, which motivated me to try a few runs on x86 — which, to my surprise, resulted in much higher failure rates than did the earlier tests on the Power machines.

I stubbed out force_quiescent_state() to check my assumption that it was to blame (but please, please do not do this on production systems!!!). Stubbing out force_quiescent_state() resulted in a statistically significant 3x decrease in failures on the x86 machine, confirming my assumption, at least for some subset of the bugs. Now that there was a much smaller section of code to inspect, I was able to locate one race involving mishandling of the ->completed values. This reduced the error rate on the x86 machine by roughly the same amount as did stubbing out force_quiescent_state(). One bug down, but more bugs still hiding.

I was also now in a position to take some good advice from Ingo Molnar: when you see a failure, work to increase the failure rate. This might seem counter-intuitive, but the more frequent the failures, the shorter the test runs, and the faster you can find the bug. I therefore changed the value of RCU_JIFFIES_TILL_FORCE_QS from three to one, which increased the failure rate by well over an order of magnitude on the x86 machine.

Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

Given that the race I found involved unsynchronized access to the ->completed values, it made sense to look at other unsynchronized accesses. I found three other such issues, and testing of the resulting patches has thus far turned up zero rcutorture failures.

And it only took 882.5 hours of machine time to track down these bugs. :–)

This raises the question of why these bugs happened in the first place. After all, I do try to be quite careful with RCU-infrastructure code. In this case, it appears that these bugs were inserted during a bug-fixing session fairly late in the TREE_RCU effort. Bug-fixing is often done under considerably more time pressure than is production of new code, and the mistake in this case was failing to follow up with more careful analysis.

Another question is the number of bugs remaining. This is of course hard to say at present, but Murphy would assert that, no matter what you do, there will always be at least a few more bugs.

Answer to Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?.

Answer to Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

November 01, 2009 11:35 PM

Evgeniy Polyakov: Comparing Key/Value Stores

From pl.atyp.us.

Following storage systems were checked:
* tabled (git clone on 10/27) using boto
* Cassandra 0.4.1 using thrift
* Riak (hg clone on 10/27) using jiak
* Voldemort 0.56
* Tokyo Tyrant 1.1.37 (Cabinet 1.4.36) using pytyrant
* chunkd (git clone on 10/27) using own chunkd.py based on Python’s ctypes module
* Keyspace 1.2 using the built-in Python interface

Results can be found in a spreadsheet, but for lazy ones I want to note, that Tokyo Tyrant was far away from any other concurent (in order of 4-20 times). But since it is single-server storage, it would not be fair to compare against others, which can scale.

Actually I need to say 'could scale', since I did not find any at least remotely similar to fairly scaled numbers, most of the applications behave worse when running on 2-3 nodes cluster.

One can compare them against elliptics network numbers, but getting that it is my results, one can assume it is unfair comparison. I'm pretty sure authors of the all above storage systems had their 'nice' results too.

November 01, 2009 03:34 PM

October 31, 2009

Harald Welte: Enabling jabber in WebOS on the Palm Pre using a binary patch

One of my main complaints about the palm Pre is that there is no support for the major IM protocol's such as jabber, icq, aim, msn, ...

As I discovered, they're actually using a library (libpurple) that supports all those protocols. It's just the UI and the intermediate LibpurpleAdapter program which artificially restrict the features that this library offers.

So it sounds to me like palm is getting money or other favors from Google to artificially restrict the capabilities of the Webos messenger.

As I have described in this mail to the webos-internals mailing list, you can actually use a very simple one-byte binary patch to LibpurpleAdapter to enable jabber support.

After that binary patch, you can add jabber contacts with the regular user@jabber-server.doma.in address and use the regular messenger application for keeping in touch with your jabber contacts. Just like how it is supposed to be.

Legal notice: Making this binary patch is legal, since LibpurpleAdapter is actually released under LGPL. If you have a working build environment for the Pre with all the libpurple headers, you can of course modify the source code and recompile it (as explained in the mail).

Side note: The libpurple-adapter source code that Palm has published on opensource.palm.com does not correspond to the actual binary code. This is a LGPL violation. However, since palm is the copyright holder, nobody can really do anything about it. But it once again shows that the software build/release process does not automatically generate the source code packages and that there is an erroneous manual process involved :(

October 31, 2009 01:00 AM

October 30, 2009

Pete Zaitcev: Blog-resident development in the clouds

In case folks don't know, I'm a massive blogger, but it's not blogging about programming and most especially not blogging while programming. I dabbed in it, but it became very obvious to me that it was a province of douchebags and Rusty Russel (who blogged good things about lguest and other projects). The end of my dabbing occured when jbj declared that he's "taking development of RPM 5 to the blog". Seeing that put a capstone into my communication phylosophy. We kernel programmers do the business on mailing lists, Jon Corbet summarises the results.

But it looks like outside of the kernel, a different way of life arose, congealed, or whatever. I cooperate on Hail with Jeff Darcy, and I learned today that he has what is a programming blog. Darcy is not as exceptional as Rusty among kernel hackers. Cloud-y folks, they all blog. But I never knew what to make of that, if it was Sturgeon's law. However, Jeff is not a random wanker, he codes good things. He's also fully versed in good e-mail: no top-posting or HTML from him.

Not sure if this blog is going to explode with programming detail, but even if I'm not as cool as Jeff Darcy or Rusty Russel, why the heck not. It may be worth documenting the thinking missing from commit logs.

In case anyone asks, I still hate Twitter.

October 30, 2009 07:37 PM

October 29, 2009

Valerie Aurora: Bay bridge workaround

For my money, the Bay bridge can stay closed. I couldn't believe what a difference it made when the Bay bridge was closed over Labor Day weekend. My crappy, noisy, stressful SOMA neighborhood became quiet and pedestrian-friendly. Birds sang. Property values would skyrocket. Even just closing half the lanes would make a huge difference.

Anyway, to do my teensy-tiny part in making this a possibility, I just want to remind people that you can work around the Bay bridge closure even if your ultimate destination isn't on public transit. Just take BART across and get a Zipcar the rest of the way.

October 29, 2009 05:06 PM

Harald Welte: India prohibits import of GSM handsets without IMEI

As has been reported at telecomtiger.com, the Commerce Ministry of India has banned the import of mobile phones with no IMEI.

This is somewhat funny, as the IMEI is stored in flash memory in all the phones that I have seen in recent years. Tools to erase or change the IMEI can be found for many popular phones, including (but not limited) to the many MTK based inexpensive phones from China.

So sure, you can now no longer import a device legally with no IMEI, but well, any self-respecting organized criminal will find a way to erase or alter the IMEI anyway ;)

October 29, 2009 01:00 AM

October 28, 2009

Matthew Garrett: More GMA500

But is Intel really the party at fault, here?

For shipping a gpu without open drivers? Given that the alternatives involve someone else designing, fabbing and releasing a piece of hardware under Intel's name without being sued in the process, I'm going to have to say "Yes".

(Note that while Moblinzone.com is a website owned by Intel, the writers don't appear to be Intel employees)

October 28, 2009 06:05 PM

October 27, 2009

Dave Jones: An update on the state of my head.

First off, thanks to everyone who commented on my last post, or sent email expressing concern etc. Much appreciated. Though it did make me feel like I was in an episode of house, with the number of diagnosis’s I got from everyone who had had something similar, or known someone, or known a doctor etc.

So I had my head scanned last friday, and got the results today. It showed up nothing of concern. (Which shot down the majority of the suggestions I got from people, Dr House would not be impressed with you). While a clear report in some ways was a relief as it ruled out so many things, in other ways it was annoying because I still didn’t know for sure what has been going on with the headaches over the last month.

The current theory is that I’m suffering from cluster headaches. The symptoms sure do sound familiar. (Right down to the cute graphic, though mine is the right eyeball mostly). So I got a prescription today for some naproxen and imitrex. The latter reminded me why high-deductable insurance is a bad idea. $149 for a months worth. Suck.

Hopefully they will at least make the pain manageable. How long I’ll have to take them for is currently unknown.

An update on the state of my head. is a post from: codemonkey.org.uk

Related posts:

  1. Not attending kernel summit. Tomorrow, I should have been getting on a plane, and...

October 27, 2009 10:45 PM

Pavel Machek: umount: /mnt2: device is busy.

I hate this part of unix behaviour. I'm root, yet some forgotten bash in some xterm somewhere prevents me from unmounting device. Yes, lsof exists, and it often works, but... I hope we can get revoke support soon and introduce working unmount -f...

October 27, 2009 10:43 PM

Stephen Hemminger: Ubuntu 9.10 hates kernel developers?

Ubuntu has never been the easiest distribution to do kernel development, but it looks like with 9.10 it has made things too painful. I need to build and install kernels all the time, and usually just update grub menu manually. But now with grub 2 in Ubuntu 9.10 they have wrapped the grub menu in grub-mkconfig. Why?

It would be great if the system was setup so just doing 'make install' in the kernel source put in the kernel and updated the grub.cfg, but no that would make too much sense.

P.s: they managed to break the sky2 driver somehow, the connection won't come up and negotiates the wrong speed. It turned out not to be a kernel problem; wiring issue (speed), combined with some Network Manager changes

October 27, 2009 10:02 PM

Rusty Russell: Not Always Lovely Blooms…

So, with my recent evangelizing of Bloom Filters, Tridge decided to try to apply them on a problem he was having.  An array of several thousand of unsorted strings, each maybe 100 bytes, which needed checking for duplicates.  In the normal case, we’d expect few or no duplicates.

A Bloom Filter for this is quite simple: Wikipedia tells you how to calculate the optimal number of hashes to use and the optimal number of bits given (say) a 1 in a million chance of a false positive.

I handed Tridge some example code and he put it in alongside a naive qsort implementation.  It’s in his junkcode dir.  The result?  qsort scales better, and is about 100x faster.  The reason?  Sorting usually only has to examine the first few characters, but creating N hashes means (in my implementation using the always-awesome Jenkins lookup3 hash) passing over the whole string N/2 times.  That’s always going to lose: even if I coded a single-pass multihash, it’s still having to look at the whole string.

Sometimes, simplicity and standard routines are not just clearer, but faster.

October 27, 2009 04:46 AM

Rusty Russell: A Week With Android (HTC Magic)

I haven’t used an iPhone in anger so I can’t compare, but I got this so I could use Google Maps to navigate public transport: Adelaide’s integration is excellent, and as I have no car it’s important for Arabella and me.

The Good

The Bad

I got it from Portagadgets.com, who were efficient (A$487 + $36 shipping, done via local bank transfer).  Getting an account and new SIM from Exetel took longer.

Conclusion: it’s definitely usable by non-geeks, and has raised my expectations of future phones.  There are some things (such as writing this post) which are much easier on my laptop.  But for reading Facebook or Wikipedia, finding your way on Google Maps, or having SMS conversations it’s excellent.

October 27, 2009 04:00 AM

Harald Welte: Implementing the GPRS protocol stack for OpenBSC

During the last week or so, I've been spending way too much time implementing the network-side GPRS protocol stack as part of an effort to not only provide GSM voice + SMS but also GPRS+EDGE data services with OpenBSC

GPRS is fundamentally very different from the classic circuit-switched domain of voice calls and CSD (circuit switched data). Not only conceptually and on the protocol level, but also in the actual system architecture. They way it was added on top of the existing GSM spec is by making no modification to the BSC and MSC, and only the minimal necessary modifications to the BTS. They then added a new Gb interface to the BTS, and the SGSN and GGSN core network components, who in turn talk to HLR/VLR/AUC.

So in the most primitive GPRS network, you can have the GSM and GRPS domains completely independent, only using the same databases for subscriber records and authentication keys. This goes to the extreme end that your phone would actually independently register with the GSM network (ISMI ATTACH / LOCATION UPDATING) and to the GPRS network (GPRS ATTACH / ROUTING AREA UPDATE). While both of the requests get sent to the same BTS, the BTS will send the GSM part to the BSC (and successively MSC), and the GPRS part to the SGSN.

Also, the actual software architecture looks completely different. In the GSM circuit-switched domain you always have a dedicated channel when you talk to a phone. The number of dedicated channels is limited by the transceiver capacity and the channel configuration. In OpenBSC I chose to simply attach a lot of state to the data structure representing such a dedicated channel. In the packet-switched domain this obviously no longer works. Many phones can and will use the same on-air timeslot and there is no fixed limit on how many phones can share a radio resource.

What's further important to note: The protocol stack is very deep. If you look at the GPRS related output on an ip.access nanoBTS while your mobile phone makes a HTTP request, the stack is something like HTTP-TCP-IP-PPP-SNDCP-LLC-BSSGP-NS-UDP-IP-Ethernet, while the first HTTP-TCP-IP-PPP is obvious, I would not have expected that many layers on the underlying network. Especailly if you look at the almost zero functionality that NS (GSM TS 08.16) seems to add to this stack. Also, the headers within the protocol can actually be quite big. If we only count the number of bytes between the two IP layers in this stack: 8 bytes UDP, 4 bytes NS, 20 bytes BSSGP, 6 bytes LLC and 4 byte SNDCP. That's a total of 42 extra bytes. And that for every small packet like TCP SYN, SYN/ACK or the like! No wonder that mobile data plans have been prohibitively expensive all those years ;)

So with regard to the actual GPRS implementation in OpenBSC, the following things had (or still have) to be done

Once all that full stack has reached a level where it works to a minimal extent, issues like BSSGP flow-control as well as LLC re-transmission, fragmentation and [selective] acknowledgement have to be dealt with.

Finally, if somebody is bored enough, he could also work on things like combined GSM/GPRS attach, or SMS over GPRS.

As you can see, it's quite a large task. But we need to start somewhere, and a lot of this will still be needed when moving into the 3G and 3.5G domain. Even if not at the lower level protocols, but from the software architecture point.

If you're into communications protocol development and don't mind our ascetic 'plain old C language' approach and are interested to contribute, feel free to introduce yourself on the OpenBSC mailing list.

October 27, 2009 01:00 AM

Harald Welte: A common misconception: GPRS encryption differs from GSM encryption

In the last couple of months, I've met numerous people with varying background all sharing one misconception about cellular networks. Even I was not very clear on this until recently: GPRS encryption is very different from GSM encryption. Most people know it uses different algorithms, sure. But it also operates on a completely different layer in the protocol, and is between two different entities.

Encryption in GSM networks happens on the Layer 1 of the Um interface between the MS and the BTS. It is a simple point-to-point encryption of only one particular network interface. There is no more encryption as soon as the signalling, voice and SMS data leaves the BTS (on a microwave link or actual land line) to the BSC, MSC, SMSC and other network elements.

In GPRS, the encryption is not on the Layer 1, but on the Layer 2 (LLC) of the Um interface. As the LLC layer is not terminated at the BTS but at the SGSN, the data is still encrypted when it leaves the BTS.

This means, among other things, that things like eavesdropping on unencrypted microwave links does not work for GPRS anymore.

October 27, 2009 01:00 AM

Harald Welte: German constitutional court hearing on data retention

On December 15, there will be a court hearing by the German Constitutional Court (Bundesverfassungsgericht) on the law on data retention which was enacted in 2007 and has been valid since January 1st, 2008.

This law requires any communications network operator to keep digital records of every voice call and e-mail, including sender and all recipient addresses.

This law was required by the European Union Directive 2006/24/EG, one of those paranoid reactions against the perceived threat of terrorism. Laws implementing this directive in the EU members Romania and Bulgaria have already been invalidated by their respective constitutional court.

In Germany, more than 34,000 (I'm not kidding) people have filed a constitutional complaints against this law. This is the first time that such a significant number of individual citizens has ever made constitutional complaint. Only the documents about power of attorney have filled 12 large boxes, each with many folders. As you could probably guess by now, I'm one of those plaintiffs.

As an interim solution, the constitutional court has already decided on March 19, 2008 that such data can only be used under special circumstances, such as only certain criminal offenses, and only if there is already a very strong initial suspicion, and if there is close to no other way to prove or deny the allegations brought forward by the prosecutor.

I hope the court hearing on December 15 will bring the court closer to actually ruling on this case. This has been dragging on for a long time now.

Just like when the constitutional court had a hearing on voting computers, I am planning to be in the audience and want to see live what the constitutional court does with regard to matters that I strongly care about. I hope my registration will make it in time... given the number of plaintiffs I suppose there will be many more people interested in attending the hearing than they have space. Which raises another interesting issue: I suppose if you are an actual plaintiff, it would be weird if a court refuses you to be at the actual hearing. But which court would hold > 34.000 plaintiffs? ;)

October 27, 2009 01:00 AM

October 26, 2009

Matt Domsch: Upcoming Fedora Elections

Yes, it’s that time of year again.  Rain is falling, another Fedora release is about to conquer the known world, and volunteers everywhere are busy preparing their ideal Fedora Mission Statements to captivate the electorate.  Fedora’s Winter Election is upon us.

The first order of business is to find an Election Coordinator.  For the last 2 election cycles I have volunteered for this role, with the able assistance of John Rose (inode0), and Thorsten Leemhuis (thl) and others.  This cycle, I would like someone besides myself from the Fedora community to volunteer as Election Coordinator.  Raise your hand, don’t be shy!  If you have been harboring a secret (or public) list of all my mistakes, here’s your chance to set things right!

As Election Coordinator, you will have the opportunity to:

Second, a schedule will need to be set.  At the Board meeting this week, we agreed that it would be nice to hold in-person forums at FUDCon Toronto, December 5-7, for those who can attend.  Our election rules require us to complete the election within 30 days of the Fedora 12 release, so must end by December 17.  Per Nigel Jones, author of
our voting system, most of the votes cast were within the first 2-3 days, so running it Dec 8-17 would be sufficient.

Before these, we typically hold nominations for 2 weeks, and a week for IRC Town Halls to be scheduled.  Thorsten also requested after the last election that we have a few days between end of nominations and beginning of the town halls, to allow time for candidates to be given a set of questions, and sufficent time to answer.

Third, we need to be sure of all the committees who are holding an election.  The committee chairs can assist here.  I believe that the Board, FESCo, and Ambassadors are electing members, and that the Fedora 13 naming election will happen too.  Are there any I missed?

Feedback on prior elections, ideas for how to improve this cycle, and volunteers for Election Coordinator all welcome on the fedora-advisory-board list.

October 26, 2009 10:07 PM

Jesse Barnes: So I followed Paul Mundt into this narrow alley...

Back from Japan at last (I think United lost my sleep schedule on the way home though, trying to retrieve it this weekend has been a challenge).

Both KS and JLS went well I thought. It was really good to connect with some of the Japanese developers that until now I’ve only interacted with through email.

The summit went well this year I thought. We didn’t have a big set of controversial issues to discuss, but we did sort out some development process issues. The highlight for me was the two customer panels. On the first day we had some people from TV and other vendors talk about how they’re using the kernel and other open source software. It’s interesting that some of them are stuck way back on 2.4 and very early 2.6 kernels. Part of the reason is long product development cycles, but mostly it’s because the SoCs used in many products only have support in a limited set of kernels (usually custom patches for specific kernels provided by companies like Montavista). The “platformization” work done by tglx and the x86 team recently (partly motivated by Intel “Moorestown” support, but also in preparation for more x86 based SoCs in the future) should help with this for x86 stuff. We definitely want to avoid an ARM-like situation where each SoC requires a specific kernel with incompatible firmware and hardware support. I had some good discussions with Linus and Paul on that topic; the tricky part will be ensuring that vendors adhere to some level of standardization in their platform and firmware support. Doing so will have big benefits: upstream kernel support should be better and much more flexible (good for the SoC vendors and their customers), and the platform maintainers should have a much easier job integrating support for new platforms without a huge set of ifdefs and incompatible firmware interfaces. Managed to get a few bugs fixed at KS as well, Ted & Dirk didn’t have anywhere to run when I wanted them to test some patches for problems they’d reported!

The JLS conference was interesting too, with a few good talks on things like barcode delivery of oops info and btrfs

Tokyo is a pretty amazing city. This was my first trip to Japan and a few of us were fortunate enough to have Paul Mundt guide us for a couple of evenings to explore the city. The narrow alleyways and tiny bars in the Shinjuku (at least I think that’s where we ended up) were really fun. We even checked out a Mexican bar called Bonita; Mexican stuff outside the southwest US and Mexico is always interesting, but the Japanese mix made things even more so. Overall a fun night including Japanese Denny’s food, passed out salarymen, and an everything store with some bizarre costumes, including some furry outfits we were tempted to buy… A bit later in the week we had a contrasting experience by going to Seamon (one of the dozens of one star Michelin sushi restaurants in Tokyo) and a high end scotch and cigar bar afterwards.

Ok now back to catching up on the huge backlog of patches that have accrued due to travel neglect.

October 26, 2009 05:15 PM

Evgeniy Polyakov: Hashes and theory of their cracking

Of course there is no such theory, but practice breaking a hash is fascinating for the researcher.

Currently in netdev@ people started lengthy discussion about new hash for the interface name and (optionally?) for dentry hash, or I just misunderstood the latter.

Anyway there is more than a dozen of different algorithms tested for deviation and speed. It is very interesting to find out which one will be selected.

Actually it is only interesting from the single side - how to break it. By breaking I mean creating application which can generate input data which will produce the same hash value after processed by the selected algorithm.

That's what I did for Jenkins and Bernstein/Torek (hash * 33) hash quite for a while already.

Looking forward for the new hash :)

October 26, 2009 05:09 PM

Rusty Russell: Google Analytics For WordPress Upgrade Fail

Had an old copy of the “Google Analytics For WordPress” lying around (which didn’t seem to put anything in my blog source), but after upgrading it it kept saying “Google Analytics settings reset to default” whenever I tried to change anything.

See this thread which talks about the problem and waves at the solution.  Here’s what you need to do, if like me you’re not a WordPress/MySQL junkie and want simple instructions:

  1. Find your wordpress config file.  Mine, on a Debian system, was in /etc/wordpress/config-rusty.ozlabs.org.php
  2. It will contain lines like this:
    define('DB_NAME', 'rustyozlabsorg');
    define('DB_USER', 'rustyozlabsorg');
    define('DB_PASSWORD', 'g1812fbsa');
  3. You need to open the mysql database, we’re going to do some brain surgery to remove the old cruft:
    $ mysql --user=rustyozlabsorg --password=g1812fbsa rustyozlabsorg
  4. This should give you a “mysql>” prompt, where you type the following:
    DELETE FROM wp_options where option_name='GoogleAnalyticsPP';
  5. It should say something like “Query OK, 1 row affected (0.00 sec)”. Then type
    quit;
  6. You’re done.  Reload your setting page and try again.

Hope that gets into Google and helps someone else who can’t figure out what’s going on!

October 26, 2009 02:20 PM

Evgeniy Polyakov: Lua in NetBSD kernel

Lua in NetBSD kernel.

As you might know, NetBSD already has XML parser in its kernel for some obscure Apple protocol. Now they want scripting language also.

Btw, there was some work to add Lua bindings for the elliptics network by Daniel Poelzleithner, but looks like it is not very active at the moment. There are Perl bindings for this distributed hash table storage.

October 26, 2009 12:24 PM

Pavel Machek: fastest clock

Gaining ten minutes in two hours tipped me off a bit -- that's too bad even for $3 clock.

It turned out that the clock can really keep time -- given "good enough" power supply. They were really designed with their small batteries in mind. Give them 2.4V (2x NiMH), and they gain time at 10minutes per two hours (and other glitches, like sometimes going into "time not advancing" mode, when seconds change but get reset to 0 when you stop watching them). Give them 3V (2x primary cell), and they fail to work, producing distorted, blinking display. Give them 1x primary cell, 1x NiMH, and they fail to work, too. But give them 1x new primary cell, and 1x almost empty primary cell, for 2.7V total, and voila, they actually start keeping time...

October 26, 2009 08:15 AM

Rusty Russell: Rusty Finally Enters Web 1.1

Jeff Waugh long ago suggested I switch to Wordpress.  I had a few toy blogs with WP, and it worked well, but the final motivation to stop banging out raw HTML and feeding it to blosxom was that I have a new Android phone (I lost my second-hand one sometime at the last farm visit, so it was time to ask the Ozlabbians who know this stuff what to get: the answer was the HTC Magic).  And being able to blog on the train increases the chance that I’ll actually blog regularly.

October 26, 2009 06:49 AM

Harald Welte: Qualcomm launches Open Source subsidiary

As several news sites have been reporting (here a report from LinuxDevices.com), Qualcomm has announced the launch of an Open Source Subsidiary.

As usual, I very much welcome such a move. Qualcomm is one of those companies who have a very bad reputation in the Open Source and particularly Linux community. They have so far failed to provide user manuals or other reference documentation for any of their parts. They haven't even managed to publish reference documentation on the external interfaces such as the AT command dialect or the binary shared memory protocols that are used to interface the GSM/CDMA/WCDMA baseband in their product.

So when it comes to an Open Source project that wants to interoperate with Qualcomms hardware, they have so far been doing everything to make that as hard as possible. Neither the community as large has access to the information that it needs, nor do the Qualcomm customers get the respective document under a license that allows them to actually contribute to Open Source projects.

If that documentation was available, or if Qualcomm was actually working on FOSS licensed drivers and contributing those mainline, the support for Qualcomm's hardware in Linux would be much better - resulting in less time to market for companies interested in using Qualcomms parts in their products.

The actual press release does not indicate that this newly-founded subsidiary truly understands this. It speaks of hardware-optimizing the performance of mobile operating systems. That sounds like "we'll take the existing code, make a fork, do non-portable micro-optimizations and ship that to our customers". It does not mention actually contributing to the community or understanding the benefit that the Open Source development model.

I remain to be convinced. Let's hope Qualcomm has scored somebody with a lot of actual hands-on Open Source community experience to advise them properly.

October 26, 2009 01:00 AM

Harald Welte: Palm Pre: Nice UI, severe lack of functionality

Using the Palm Pre: Everything but an exciting experience :(

During the last week I've started to use my new Palm Pre (for those of you who're living under a rock: The Palm Pre is a smartphone powered by an Operating System called WebOS, which is in turn powered by the Linux kernel and lots of other "standard" Linux programs like glibc, alsa, udev, ...

This adherence to a more standard Linux userland makes the Pre much more attractive than the Android based products out there. Android is reinventing the wheel everywhere, and things that Linux users and developers have been taking for granted during the last five to ten years simply don't exist on Android.

To be honest, the experience was everything but exciting. More about that later. Lets' start with the positive side of things. Yes, I like the device for the following facts:

Which is what got me excited and made me buy one of those expensive devices.

However, looking at it from a strict user point of view, I am not very happy with it. It simply lacks so much in functionality that it is not even funny.

That is simply the user point of view. I also have many more technical points from a developer perspective, but that is probably better kept for another post. Meanwhile I'm not sure if the Pre was all that much of a good idea. The N900 is coming up next, and will be much closer to the standard Linux userland stack (including X11, GTK, Qt, ...) than the Palm Pre is.

October 26, 2009 01:00 AM

October 25, 2009

Harald Welte: Symbian kernel Open Source release and Tanenbaum

As most people have noticed by now, The Symbian Foundation has released the source code of their microcernel under an open source license. While any open source release of formerly proprietary software is something I warmly welcome, I doubt that it will take of as an actual open source project.

There's a difference between releasing software under a FOSS license and running a successful FOSS project. The latter involves a sufficiently large community of developers, ways how they can contribute, ...

Especially with special purpose code such as an operating system (kernel) for mobile devices, very few people will show interest as long as there is no actual hardware where they can run the software, without or with custom modifications. Sure, there will be academic interest and people who will look at the source code to find ways to exploit potentially existing security weaknesses, but no community of people who work on it since they will practically use it on their own device.

So what I'd do if I was the Symbian Foundation: I would release an actual mobile phone which is open enough for people to run (modified or unmodified) recompiled parts of the Symbian codebase which are now available as open source. This way it will be much more appealing. However, even at that point, many other parts of the system are (or even will forever be?) closed, limiting the amount of impact. Furthermore, since modified versions cannot be installed on any other regular non-developer phones, the impact of any contribution to the codebase can not be to the benefit of many people. Just compare that with contributing to the mainline Linux kernel, where a contribution will be used on at least almost every server/workstation/laptop after the next distribution (and thus kernel) update.

Another issue that I really was shocked is the following quote by Andrew S. Tanenbaum: 'I would like to congratulate Symbian for not only making the source code of its kernel open source, but also the compiler and simulation environment,' said Andrew S. Tanenbaum'

However, the compiler was not made open source. It is released as proprietary binary code, and is only "free as in beer" for organizations up to 20 employees. So either Tanenbaum did not really look at the hard facts of what was being released, or he was misquoted in a really bad way! That should not have made it into the final release, as it's now a damaging statement for both the Symbian Foundation and Mr. Tanenbaum.

By the way, according to a lwn.net comment thread, they're working on making it able to compile under gcc, and they're actually accepting patches, which is of course great.

Despite my negative comments: I wish them as much luck and success as possible with their new open source Symbian kernel. I personally just am not seeing it turning into a vibrant, community-maintained project - and I hope the founders of the Symbian Foundation did not start the project based on that assumption and will in the end perceive it as a negative experience when evaluating the open source move some years down the road.

One final note: The fact that they chose the EPL as license is really strange, as it prevents exchange of code with the major existing FOSS kernel projects (Linux, *BSD). Not that I think there is much to be exchanged, given the microkernel approach...

October 25, 2009 02:00 AM

October 24, 2009

James Morris: SELinux Sandbox slides available, et cetera

I’ve just given a presentation on SELinux Sandboxing at FOSS.my 2009 in Kuala Lumpur — the slides are available for download as a PDF file here.

The presentation was an overview of sandboxing as a concept; how we can enhance it with MAC security; and how it’s being implemented in Fedora 12 with SELinux. I also discussed the need for a standard security API for Linux, so that developers will be more inclined to incorporate enhanced security support in their software, and to generally increase security adoption via standardization. We’ve seen this work well thus far with sVirt, so it should be feasible

The SELinux Sandbox stuff will be familiar if you’ve seen Dan Walsh’s recent talks on the topic, although in this case, I included his cell phone number in the presentation if people have detailed questions, seeing as he’s not here in person.

It’s been yet another busy conference trip, with KS and JLS last week — I attended some of the JLS security talks and a Japanese Secure OS user group dinner. It was a very interesting and productive time.

I dented this a few days ago, but got no answer (and also dragged DaveM to see it & he couldn’t figure it out, either): does anyone know what this mystery object is?

Mystery object

It’s a spinning, blue and white striped cone near the ceiling of an underground Tokyo subway entrance.

October 24, 2009 08:00 AM

October 23, 2009

Evgeniy Polyakov: Week from lytdybr point of view

I suppose I will start new format for the small events unrelated to hacking. I will accumulate them into once-per-week post with some descriptions.

First, sport. I climb two times per week and see that there is a fair progress in the power endurance. But since I still try to wear out my new shoes (well, not that new, I bought them about a month or even more ago, but they are still very tight), so results are biased, namely I can not control my left feet since it suffers quite a lot after some time in the shoe. But still I got a noticeble jump in the training level. For example yesterday I managed to climb three 6a+ traces on the negative slope without the rest in between. And while this is not some complex trace I could be proud of, it is complex enough for 3 times in a row.
I check power endurance level my measuring how tire I am after the training or exercise or trace. In the above case it was hard to breath, but not hard to secure the holds, so looks like I had some endurance left after this run, but muscle breathing was close to its limit.
If training course will continue to increase for another couple of weeks, I will reach a very good level, and my shape will allow to start working with the really complex traces. Namely I want to do 6c and higher on the negative slope, and local 6c is actually at least 7a in the common sence (yes, there is a special table printed to match local and usual trace grades :)
So, no matter what, I like how things go right now.

Second, music. I did not play with the teacher previous weekend, but wait for the sunday for one. It is rather hard to see any difference in playing in this small time interval, but I can confirm that my previously reachd level is still there.
Today I played in the office for some time, maybe an hour or so, when most of the people went home. And while there were no some incredible things played, I found that I can play and improvise some interesting things from the virtually nothing. It happens quite rarely, actually I could remember only couple of intersting and very small, maybe for 2-3 measures, things, but they were real. So, I slowly move forward. For example I can rather easily start playing from the sheet, although quite slowly, but there will be a clean attack and shap sound upto natural 2F or so. It requires some warming first though. Also learned fair number of various scales, and while they are mostly major ones, I know how to build one (tone, tone, half-tone and so on) from any note as well as magic shift to get minor scale (one tone lower from the major tonic becomes a tonic for the minor scale, while all notes are actually left the same). Also played some bits in pentatonics, it is easier to improvise, and some blues scales, although I do not actually remember it, I played from the sheet.
Sometimes I play some very simple bits on piano. Nothing really intersting, but I still have in plans to start doing it more seriously, namely find a teacher locally.

And while my musical earing as well as playing techique are quite far from what I want to have as the nearest goal, I see that there is a way to reach that level. Although not simple and quick one.

Third, car and appartments. Actually nothing major happened here. I do enjoy to drive my car, but there is a major lifestyle change, which does not allow me to drink when I want anymore. And that kind of dissapoints me, since I actually like this.

   
Bottles of Ballantine's and Jameson

I can not start this one, for example... But things are not that bad - I will leave my car on the office parking and enjoy the taste :)

There were some other good and not that good things happend, but it is unlikely to be anyhow interesting, so let's draw the line: I expect things to be just fucking cool, and that's what I like.

Stay tuned, there will be some interesting notes from the technical part of the brain: elliptics network pre-production testing, POHMELFS status and its details and maybe also something new.

October 23, 2009 09:28 PM

October 22, 2009

Matt Domsch: Fedora is Self-Hosting

Fedora 12 (Beta available now), is self-hosting.

What does this mean? Simply put, it means that you can use a copy of Fedora 12 to rebuild, from source, all* of Fedora 12 again.

Why is this important? One of the key tenets of Free and Open Source software is that anyone can get a copy of the source code, make modifications to it, built it, an use the modified version. Simply publishing the source code, without also allowing people a way to rebuild and use that code, doesn’t accomplish this goal.

Source code tends to bitrot over time. Libraries that your code uses will change, get updated, add features and bugfixes. Compilers improve and update to later standards. Your code needs to keep up. So, for each Fedora release, we run an “Fails To Build From Source” pass, which rebuilds every package in the distribution, using the packages in the distribution. We started the Fedora 12 development cycle with about 400 packages which couldn’t build (still, less than 5% of the total packages) for various reasons. Over the last few months, members of the Fedora Packager community have been whittling away at these, fixing their packages, sending patches to their respective upstream projects, and therefore improving the quality of the open source ecosystem as a whole.

The result?  You see immediate improvements (smaller package sizes due to new compression methods being used, future-proof security through the use of stronger hashes to guarantee package integrity), and increased flexibility should you wish to remix Fedora for your own purposes.

Thank you packagers!

* Truth in advertising: All in this case means 8448 of the 8485 packages in the Fedora 12 tree. There are 37 problematic packages (0.4%), none critical to a vast majority of users, which still need some love.

October 22, 2009 09:03 PM

Pavel Machek: seriously crappy clock

...the clock just gained 10 minutes in little under 2 hours. That's way too much, even for cheap chinese stuff. Perhaps it does not like rechargeable AA batteries? What is going on?

October 22, 2009 08:59 PM

Pavel Machek: crappy spitz, crappy clock

Well, I'm pretty sure it is not just emacs acting funny. I very probably have hw problems on my zaurus -- because other zauruses do not behave like this few times a day:

  CC      arch/arm/mach-pxa/spitz.o
arch/arm/mach-pxa/spitz.c: In function 'spitz_wait_for_hsync':
arch/arm/mach-pxa/spitz.c:413: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.3/readme.bugs> for instructions.


And now I have some crappy clock, too. I got myself a keychain led-projection clock -- well for $3 so I probably should not complain -- with intention to make them run 24/7. (Many people are assembling them fromparts, like this, but I'm not enough of hardware hacker to do that.)

So I replaced tiny button batteries with rechargable AAs, thinking that hopefully it will work for few days... Well, I should have done the maths. Batteries were pretty much empty after few hours :-(... but I got nastier surprise: my $3 clock is gaining like 10 minutes a day :-(... which is even worse than clock in my notebook, loosing like 2 seconds a day.

2 seconds per day is bad, 10 minutes a day is unusable. At least I will not have to build proper power supply for the projection clock...

October 22, 2009 08:41 PM

Harald Welte: FOSS.in CfP running for quite some time

In case you have been sleeping throughout last week: On October 16, The FOSS.in Call for Participation had been released.

FOSS.in is one of my regular conferences, and probably the only event aside from the Chaos Communication Congress that I managed to visit in five consecutive years. I'm looking forward for this year's incarnation, and I'll definitely do my part to make the event more interesting :)

I hope everyone will now hurry to submit their proposals for talks, workshops and work-outs! It's a collaborative event, and it lives by your contribution.

October 22, 2009 02:00 AM

October 21, 2009

Matt Domsch: Installing Fedora 12 and saving the environment

If you’re like me, chances are you have a system or three with DVD / CD burners in them.  Aside from their use for backups, I have tended to use my burners to create Linux install DVDs, done my install, and then given it to someone else, or (ashamedly) thrown it away.  What a waste.

I also prefer to do network-based installs, where I don’t have to download a whole 4GB DVD image, or even 700MB CD image, and burn it.  Instead, I download the 160MB “netinst” network install ISO, burn that to a CD, boot that CD, and point the installer at a Fedora mirror to grab all the packages.  This works great, but still, I’m left with a netinst CD when I’m done that I may no longer need.

Enter isohybrid, new in Fedora 12 (Beta).  I’ve got a few USB keys of various sizes, most larger than 160MB.  Instead of burning a CD (which I can still do, the process is unchanged), I can write the netinst ISO file directly to a USB key, and boot it.  Amazing!

Give it a try when you install Fedora 12 Beta, and save one more CD from becoming landfill.

$ wget http://download.fedoraproject.org/pub/fedora/linux/releases/test/12-Beta/Fedora/x86_64/iso/Fedora-12-Beta-x86_64-netinst.iso
$ sudo dd if=Fedora-12-Beta-x86_64-netinst.iso of=/dev/sdc bs=1M
$ eject /dev/sdc

Replace /dev/sdc with the actual device name of your USB key. You will want to unmount any file systems that are mounted on that key before writing to it.

Then boot that USB key, and you’re off to the races. When prompted for which local file system contains your install image, simply click “Back”, select the “URL” install method, and use a URL of your favorite mirror.

Special thanks to H. Peter Anvin for writing isohybrid and including it in syslinux.

October 21, 2009 05:04 PM

Evgeniy Polyakov: NTT Cyber Space Labs presents Sheepdog - distributed storage system for KVM

MORITA Kazutaka wrote:

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

* Linear scalability in performance and capacity
* No single point of failure
* Redundant architecture (data is written to multiple nodes)
- Tolerance against network failure
* Zero configuration (newly added machines will join the cluster automatically)
- Autonomous load balancing
* Snapshot
- Online snapshot from qemu-monitor
* Clone from a snapshot volume
* Thin provisioning
- Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:
http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

- VM image deletion support
- Support architectures other than X86_64
- Data recoverys
- Free space management
- Guarantee reliability and availability under heavy load
- Performance improvement
- Reclaim unused blocks
- More documentation

IMHO, block level distrubuted systems are dead overall, although it has its niche.

October 21, 2009 07:00 AM

Harald Welte: Differential Power Analysis on mobile phone?

cnet.com reports some researchers succeeding in performing a differential power analysis (DPA) on a mobile phone in order to "steal cryptographic keys that are used to encrypt communications and authenticate users on mobile devices".

This sounds fishy. At least on GSM phones, the keys for authentication are stored inside the SIM card. And somebody claiming that within a mobile phone with it's many analog RF and digital circuits (causing interference and noise) he can still perform a DPA on the SIM card just simply sounds unreasonable.

I would like to see those results being fully disclosed and independently reproduced before giving them much credibility.

The current encryption session key is not used for authentication, it is very short lived (typically 1 to 5 calls before a new key is negotiated), and it is not considered very safe anyway. The phone writes it to the SIM card, and malware programs installed on the phone are likely to get access to that key anyway. So no need for a DPA here...

October 21, 2009 02:00 AM

October 20, 2009

Pete Zaitcev: Rumor-mongering, tribal knowledge

In a comment to mdomsch's entry about the newly introduced support for PRNG of TPM in rngd, Arjan asks why keep this in userspace. Why, indeed? Please pardon me engaging into rumor-mongering, but I heard that the problem is the maintenance of the quality of the random stream. In other words, lots of sources of the enthropy may go bad (get stuck on a certain value usually, but not only that). Detecting it in kernel would be too difficult.

Now, is this credible? It is to me, but I did not look at the source.

I'm often on the receiving end of it too. For example, yesterday I wanted to create a wildcard A/AAAA record (it's used for S3 bucket selection in tabled). All examples, without fail, used the fully qualified syntax "*.sub.dom.com.", unlike all other entires in the SOA zone. And why? Nobody knows, they just do it.

October 20, 2009 01:36 PM

Kernel Podcast: 2009/10/18 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091018.mp3

From London, England, for the weekend of October 18th, 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.

NOTICE: We got quite behind for a while. Rather than keep being two weeks behind, and now that the merge window is closed for 2.6.32, I am going to jump forward to the present. I will fill in the two week gap through additional back episodes – and if that fails, I’ll do a “summary” show of that period (and mention all the cool things from URCU to the latest Git, stable kernels, 2.6.32-rc3, 2.6.32-rc4, a clone3(!) system call proposal, a trace types registry proposal, and Grant Likely’s awesome work on flattened device trees finally getting properly recognized in the MAINTAINERS file).

Remember, I do this in my spare time and without any help from others. I wanted to make sure the merge window was covered, which is why we lagged, and as Linus says, things have been calm enough over the past two weeks. You could always drop me a line and help me form a group of podcasters. I am interested in hacking up a TurboGears front end to a special LKML site that fellow podcasters could use to easily prep the show – maybe when I’m traveling over the holidays I will spend some time poking at that.

In today’s issue: EDF, ext4, fast symbol resolution, M68K, and the staging tree.

EDF. Raistlin posted the latest RFC version of the EDF (Earliest Deadline First) scheduler patches for wider kernel community consideration, including links to various papers, talks, and news coverage, and also thanking the community for feedback at the recent RTLWS (Real Time Workshop) in Dresden. The patches are available via various git repositories covering users of mainline, sched-devel, and also the preempt-rt patches.

ext4. Parag Warudkar posted a story involving various attempts to use ext3, XFS, and ext4 on his laptop as a root filesystem, and in particular the handling after an unclean forced shutdown due to a failed resume from sleep. His experience anecdotally suggests that ext4 has become more intollerant to unclean shutdowns and so he asks, “is this to be expected or it’s just sheer coincidence?”. Ted T’so followed up, referencing a longstanding bug on kernel.org that he has mentioned before. He says “it’s been frustrating because I have not been ble to replicate it myself; I’ve been very much looking for someone who is (a) willing to work with me on this… and (b) who can reliably reproduce this prolem”. Maybe Parag can help.

Fast symbol resolution. Carmelo Amoroso posted to let everyone know about his “Fast LKM symbol resolution” patches. These add a SysV ELF hash table to speed up module symbol resolution at load time. I was at the Embedded Linux Conference as this year’s keynote speaker. As I expected, Alan Jenkins was also interested in taking a look at this as he has also been looking at ways to speed up symbol resolution through using a binary search. Clearly, as Alan notes, only one of the two solutions is going to work out – so the two of them can now help to figure out which one that is going to be :) Greg Kroah-Hartman added that he is happy to see the work being done, as obviously most distributions are “forced” to ship very modular kernels.

M68K. Steven King posted a script and a patch that enables merging m68knommu and regular m68k into a single tree, at the inspiration of Sam Ravnborg’s recent efforts to merge the include files. This is a big win because it reduces the amount of code duplication in having two “architecture”s.

Staging. Various discussion has been taking place concerning the impact of effectively removing a driver via the staging tree. This is the case of what to do when an improved or next generational driver is being worked on via the staging tree and will replace a driver that has been removed from mainline. Questions included how should users be made aware of this (given that they are likely using a distribution kernel and thus will only notice many months after the removal occurs), and what onus should be place upon vendors.

In today’s pull requests: some libata fixes from Jeff Garzik, some vbus-enet and vbus fixes from Gregory Haskins (fixing an “illegal” use of a GFP_KERNEL kmalloc within a DEVADD, detected via lockdep and not really seen in the wild), some AMD64 EDAC fixes for 2.6.32-rc6 from Borislab Petkov, some device mapper updates for 2.6.32-rc6 from Alasdair Kergon, some KVM updates against 2.6.32-rc5 from Marcelo Tosatti, some input updates from Dmitry Torokhov, and some inotify/dnotify/fsnotify updates from Eric Paris.

In today’s miscellaneous items: ongoing debate as to the best way to do TSC emulation within Xen (and other virtualized guests in general), a question as to why a software RAID device undergoing reconstruction would cause large numbers of processes to get stuck in a “D” state from Holger Kiehl, a patch adding const qualifiers to various users of quota_format_ops from Alexy Dobriyan, an x86 patch from Andreas Herrmann making use of a new MSR that convieniently includes NodeID and number of nodes per processor meta-data, some thermal patches from Roel Kluin, some Kconfig cleanup patches for powerpc from Kumar Gala, version v0.30 of checkpatch (including a fix for the perl warnings that Andrew Morton had managed to trigger previously), version 3 of some ACPI docking support cleanup patches from Alex Chiang, concerns about a hang on boot when using kgdb from Peter Teoh, a note that the rt2×00 wireless project’s mailing list is actually moderated (although the MAINTAINERS file did not list this fact previously) from Bartlomiej Zolnierkiewicz, a rant about rfkill userspace visible interface changes between 2.6.30.2 and 2.6.31.4 from Olivier Galibert, some miscellaneous MAINTAINERS file cleanups from Joe Perches, and a couple of BKL removal patches from John Kacur (thanks for that, John!).

In today’s announcements: BFS v0.304 stable release. Con Kolivas announced the first officially stable release of his “Brain Fuck Scheduler”. Since the patch is quite large, he posted an URL to download it. Citing the usual warnings about development code, he says it is “known to be quite stable”, though it is apparently relatively easy to trigger a well known keyboard+Xorg failure that has recently been discussed and deemed not to be a BFS issue specifically. He also includes a link to the latest version of the BFS FAQ.

Git version 1.6.5.1. Junio C Humano announced version 1.6.5.1 of the Git SCM (Software Configuration Management) tool as used in development and maintainership of the Linux kernel. The latest release fixes an infinite loop bug when processing corrupted packs, addition of MiB/s download speed listing for fast links, and various other fixes also.

Sparse 0.4.2. Christopher Li announced version 0.4.2 of the sparse kernel source code checker tool as originally written by Linus Torvalds. He is the new maintainer, as previously mentioned on the sparse mailing list, and he thanks Josh Triplett for previously maintaining the project. He also took the opportunity to announce a new kernel.org wiki for the sparse project.

The latest kernel release is 2.6.32-rc5, which was released by Linus on Thursday evening at 18:11:49 Best Coast Time (PDT). As Linus has said several times, this is a “short week” release since he will be at the annual Kernel Summit in Japan and doesn’t want to be doing horribly jetlagged releases. By far most of the changes (90%) since -rc4 are in drivers, and Linus includes a handy git command that you can use to visualize the size of them. Linus hopes that no new regressions were added, noting “like that ever happens”.

Greg Kroah-Hartman announced review patches for the 2.6.31.5 stable kernel.

Stephen Rothwell posted a linux-next tree for September 16th. Since Thursday, there was a new “devicetree” tree (thanks to the awesomeness of that work), the linux-next “fixes” tree still contained a build fix for powerpc/kvm, the kbuild tree still had a build failure that required Stephen to remove include/asm/asm-offsets.h from his object tree, and the tty tree lost its build failure. The total tree count increased to 144 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

October 20, 2009 10:58 AM

Kernel Podcast: 2009/10/01 Linux Kernel Podcast

Audio: COMING SOON

For Thursday, October 1st, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: concurrent workqueues, DRBD, VFAT, and writeable overlays.

Concurrent workqueues. Tejun Heo posted an RFC patch series implementing concurrency managed workqueues. The basic premise is that having such an implementation in the kernel keeps the individual workers from having to do it for themselves. The implementation adds a single shared pool of workers per cpu and will attempt to keep the CPU loaded up with as much deadlock-free work as is possible. The code is quite intrusive, illimating RT support from workqueues and touching a lot of code (including the scheduler) but Tejun thinks that, overall complexity will decrease and other code could be removed. David Howells was interested in how this might replace slow-work, and he posted some followup questions for Tejun.

DRBD. “Roland” (devzero) mailed concerning recent comments surrounding DRBD, the distributed replicating block device. He was concerned because a number of people have expressed an interest in these patches not being merged, while DRBD has already been out of tree for around 8 years, and isn’t in staging. He would like to see some more satisfactory resolution for “brilliant things like these” than have them perenially sit out of tree while folks figure out the best way to effect a dm/md merge and a timeframe thereof.

VFAT. Philippe De Muter followed up surrounding his “Simon and Garfunkel” issues (that mp3 files with two tailing dots before the extension were not being properly handled) on VFAT filesystems to say that a recent Windows box wasn’t handling this all too well either. He considers this a bug and isn’t sure that Linux should remain compatible with it, so withdraws his request.

Writeable overlays. Val Aurora posted the latest version of her union mounts and writeable overlays design document, complete with a bunch of patches that she has rebased to kernel 2.6.31, and accompanying tools patches to e2fsprogs and util-linux-ng. Apparently, there will be some review patches soon, though that isn’t an excuse not to start poking. The patches are up at http://valerieaurora.org/union/.

In today’s pull requests: some scheduler fixes from Ingo Molnar (freshly back from a trip, and now believing that it’s “good in all tests”), some networking updates from David Miller, some m68knommu updates from Greg Ungerer, some wireless updates from John Linville, and some btrfs updates from Chris Mason.

In today’s miscellaneous items: a question as to whether IA64 should use a global register for storing per-cpu pointers from Tony Luck, some netfilter patches from Joe Perches, ongoing discussion of alternatives support for cmpxchng64 (silent failure on a cmpxchg of unsupported size annoys Linus, who also provides a commentary on Windows NT’s cmpxcnhg implementation), some autofs4 patches from Ian Kent, version 5 of a fix for too big f_pos handling from Kamezawa Hiroyuki, a suggestion that there might be a buggy implementation in ftrace_profile_enable_event from Paul Mackerras, a series of patches intending to correct usage of __exit_p and __devexit_p from Uwe Kleine-Konig, version 20 of the swap over NFS patches originally worked on by Peter Zijlstra (who is short on time) and now being persued by Suresh Jayaraman, a small update to the optimization flags for the AMD Geode from Matteo Croce, a question about connector and PROC_EVENTS behavior from Kevin Fox, some Kconfig comments cleanups from Michael Roth, and some wonderings from David Miller about the status of mvalloc_user and “perf” mmap patches needed for SPARC to make use of performance events utilities properly.

Finally today, Arjan van de Ven and Andrew Morton continued to discuss the state of the Linux floppy driver, in particular that fact that GCC complains that floppy.c’s ioctl has insufficient bound checks. In response, Andrew stated: ‘gad. You said “floppy” and “ioctl” in the same sentence. Where angels fear to tread.” Separately, Andrew sent an additional error handling patch.

In today’s announcements: The Linux Foundation Technical Advisory Board. James Bottomley posted to let everyone know that there will be elections for the board of the Linux Foundation Technical Advisory Board (TAB) immediately following the forthcoming events in Japan (2009 Kernel Summit and Japan Linux Symposium). Anyone can stand for election by emailing as advised.

Clownix-spy. Vincent Perrier posted to announce a utility at clownix.net that can be used to plot any kernel variable changing over time through the use of a periodic kernel thread that wakes up to sample it and deliver the results to a userspace gtk-based plotting tool. The initial example is for plotting qdisc enqueus, dequeues, and drops.

URCU version 0.2. Mathieu Desnoyers posted version 0.2 of the userspace RCU library he has been working on. It contains some clarifications for three function usages.

The latest kernel release was 2.6.32-rc1|rc2 (both the same).

Greg Kroah-Hartman posted a series of review patches for the 2.6.27.36 stable kernel, and 136 review patches for the 2.6.31.2 stable kernel.

Rafael J. Wysocki posted a summary of regressions since the 2.6.31 kernel, based upon bug filings on the kernel.org bugzilla. As he notes, there aren’t too many new regressions since 2.6.31, but there are still “quite a number” since 2.6.30 and it’s been that way for quite some time.

Stephen Rothwell posted a linux-next tree for October 1st. Since Wednesday, the linux-next fixes tree still has a fix for powerpc/kvm, there is still a reverted SCSI commit, the sound tree gained a build failure (so the previous day’s version of that tree was used), the block tree lost its conflicts but gained a failure for which a commit was reverted, and the drm tree lost its conflict. The total subtree count remained steady at 139 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

October 20, 2009 10:54 AM

Pavel Machek: So I survived that peach

...but my zaurus is unwell.

pavel@toy:/usr/src/linux-rc$ emacs arch/arm/mach-pxa/spitz_pm.c
*** glibc detected *** emacs: corrupted double-linked list: 0x00482320 ***
Fatal error (6)Aborted (core dumped)
pavel@toy:/usr/src/linux-rc$ emacs arch/arm/mach-pxa/spitz_pm.c
Fatal error (11)Segmentation fault (core dumped)
pavel@toy:/usr/src/linux-rc$ emacs arch/arm/mach-pxa/spitz_pm.c


(This was with 2.6.31.2).

Emacs actually started on the next try. And yes, I get various weirdness from gcc, too. In fact, I learned to run overnight compilations as "time make; time make; time make", so that it finishes...

Are other zauruses broken, too? Is there some good memory test for arm -- besides gcc?

October 20, 2009 08:19 AM

Kernel Podcast: 2009/09/30 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090930.mp3

From the Embedded Linux Conference in Grenoble, France, for September 30th, 2009, I’m Jon Masters with a summary of the day’s LKML traffic.

In today’s issue: dm-ioband, robust lists, and page writeback.

dm-ioband. Vivek Goyal posted anothe round of benchmarks of the dm-ioband patches, again noting some problems with the implementation. In the latest tests, he created two ioband devices (ioband1 and ioband2) of weight 100 each on two disk partitions. On one (ioband1), he had a buffered writer do writeup and on the other he had one priority 0 reader and an increasing number of priority 4 readers, to see how bandwidth distribution worked. With vanilla CFQ the results were roughly as expected, but the results for the dm-ioband patches had violently wild swings in bandwith and were quite clearly not correctly preserving any kind of fairness whatsoever. Ryo Tsuruta promised to look into it some more.

Robust Lists. The Linux kernel uses a “robust list” pointer in the task struct task representation structure in order to keep track of userspace futex locks – providing a flexible (and also extensible) way to keep track of locks that userspace might want to play with, and also an atomic means for it to do so through system calls that internally result in list ops. The kernel needs to specially handle the case of a new address space through execve, and Anirban Sinha was concerned that the code within exit_robust_list was inefficient.

Writeback. Fengguang Wu noted that WRITE_SYNC_PLUG and priotization of bio writeback had been implemented about 5 months ago due to complaints from Linus (so it’s no longer true that all requests get ultimately treated equally no matter their sync/async status). Fengguang also posted a patch increasing the MAX_WRITEBACK_PAGES size and adjusting the writeback call stack to support larger writeback chunks. The reason for a limit is to prevent holding I_SYNC against an inode for “enormous amounts of time”.

In today’s pull requests: some ext4 patches for 2.6.32 from Ted T’so, some nilfs2 fixes from Ryusuke Konishi, and some block updates for 2.6.32-rc from Jens Axboe (mostly driver fixes, and especialy to cciss) which included DRBD. Christoph Hellwig considered including DRBD to be ill advised at this stage.

In today’s miscellaneous items: a patch adjusting percpu initialization on IA64 such that the head.S provided __cpu0_per_cpu special CPU percpu area is copied over to a generic location in the linear mapping during memory initialization from Tejun Heo, a request from Amerigo Wang that Barry Song add a signed-off-by to his Y2K38 time patch, a connector bugfix from Christian Borntraeger, ongoing discussion of Intel’s TXT (Trusted eXecution Technology) and in particular Pavel Machek’s views on removal/modification of RAM chips at runtime to usurp any protections, a note from Frederic Weisbecker that patches against 2.6.29 (in this case against at the time experimental “perf” patches for ARM) are useless at this point as too much has changed and patches need to be against 2.6.32, a note from Jens Axboe that find_busiest_group uses a lot of CPU (multiple SSD testcases), a note from Florian Weimer that the new O_NODE open flag implementation does allow one to bypass permission checks on open files within directories whose permissions change while the file descriptor is open (and so “the whole thing is a bit worrisome because it may turn file descriptor information leaks into something worse”), a virtio_ids patch from Christian Borntraeger that makes Rusty’s previous cleanups (moving all device IDs into a single file) once again compatible with userspace users of the header files by moving some includes around, a note from Berthold Gunreben (who had previously posted about ATA bus errors on resume that Tejun Heo though were due to the PSU briefly dropping power to the disk – for which Tejun provided some detailed advice on burning aforementioned PSU) that he had moved to another filesystem (JFS) and could no longer reliably reproduce what might still exist as an underlying error, an RFC patch from Kamezawa Hiroyuki adding percpu array counter support (as used e.g. in vmstat) and new array_counter_add, and array_counter_read functions, a patch from Arjan van de Ven taking advantage of GCC’s ability to determine at compile time whether certain copy_from_user buffers are correctly sized to produce a compile-time (and not runtime) warning in the case that they are not, version 2 of some CFS hard limit patches from Bharata B Rao, a note that bluetooth “is very ill in -next” from Alan Cox, an update to Linus Torvald’s alternatives based cmpxchg64 with some fixes based on some actual testing from Arjan van de Ven, a patch ensuring we always return from cpu_idle with interrupts enabled from Kevin Hilman, and a note from Russell King that Linus imposes a “one pull request per week” limit on arch maintainers like himself and so this can explain why the ARM tree has been broken recently.

In today’s security items: An x86_64 patch from Jan Beulich removing a register leak situation in which a 32-bit process could temporarily switch itself into 64-bit mode in order to get access to additional 64-bit register entries that are not normally cleared on return to 32-bit userspace.

Finally today, Pavel Machek complained about Daniel Walker’s ongoing round of checkpatch warning emails, suggesting that they were “unwelcome” in the case that patches already had many other known problems to resolve.

The latest kernel release was 2.6.32-rc1/rc2 (both the same).

Stephen Rothwell posted a linux-next tree for September 30th. Since Tuesday, his “fixes” tree contains a build fix for powerpc/kvm, the usb.current tree lost all of its conflicts, the scsi tree commit that was causing boot failures was still reverted, the drm tree gained a conflict against Linus’ tree and the usb tree lost all of its conflicts. The total subtree count remained steady at 139 trees in the latest linux-next compose.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

October 20, 2009 07:32 AM

Matt Domsch: TPMs are good for something

TPMs (Trusted Platform Modules) have long been avoided on Linux, given that their primary use cases have historically been around licensing and Digital Rights Management, concepts which are mostly foreign to Free and Open Source software.  However, as new use cases, such as “trusted boot” have emerged, developers have added TPM device drivers to the Linux kernel to enable these uses.  One often-overlooked feature of the TPM is that it has a hardware pseudo-random number generator.

A while back, Jeff Garzik and others were discussing this on the linux-kernel mailing list (summarized on LWN.net), where it was suggested that the TPM could be used to feed the rngd (random number gathering daemon) tool, just as it reads from other hardware random number generators.  The rngd program reads from hardware-based random number generators and feeds entropy into the kernel’s entropy pool.  Easy in concept, but lacking in TPM implementation.

As it happens, quite a few Dell systems include a TPM chip, including the PowerEdge 11G servers such as the R610 and R710.  So, I asked Dell’s crack team of Linux developers to see what they could do.  The result: a patch to rngd which adds the TPM as another source of random numbers for feeding the kernel’s entropy pool.

We’re working with Jeff to get this patch applied to the rng-tools upstream sources, and from there into the various distributions as their schedules permit.

So, should you find yourself running out of entropy on your servers, and not having a keyboard or mouse attached as ways to feed the entropy pool, you can run enable the TPM in BIOS SETUP, run rngd, and never lack for randomness again.

October 20, 2009 04:45 AM

Harald Welte: Letter to the European Commission opposing Oracle's acquisition of MySQL

As can be found here, Knowledge Ecology International, the Open Rights Group and Richard Stallman have issued a joint letter to the European Commission asking it to disapprove the acquisition of MySQL by Oracle.

I very much welcome this move. There clearly is a conflict of interest between Oracle's own proprietary database software offerings and MySQL. Sure, the community could always fork MySQL, but at what cost? Potential disputes about the trademark, being forced to rename itself, and confusion among the millions of users world wide (well, might just be hundreds of thousands).

October 20, 2009 02:00 AM

Stephen Hemminger: Japan Linux Symposium

I am giving three talks: 1) routing performance, 2) staging drivers, 3) Vyatta CLI.
So if you are attending JLS please stop by and give me support.

October 20, 2009 02:27 AM

October 19, 2009

Evgeniy Polyakov: POHMELFS transactions

It happend that my previous idea of using socket buffer and VFS pages is very wrong. Mainly because of POHMELFS transaction nature. Transaction must stay in memory until remote server acknoledges its data.
But what will happen when second write is about to update the same area? We can not overwrite data, since then we will lost previous transaction and there will be no way to resend it and store elsewhere on timeout or other error. Instead we should allocate new buffer and copy data there. But this is not that simple, since we have to update VFS page cache, and thus to evict previous page first. Also all pages have to be somehow linked, so that when transaction is committed, appropriate pages could be freed.

Other filesystems, namely btrfs, waits until writback is over on the page about to be overwritten, which may or may not be a good idea for the overwrite workload, and I expect it actually to be a bad idea, especially for the high-latency storages, but it is noticebly simpler to implement. Buffer heads used to track partial page updates are quite heavy and not really needed for my case, so I will implement trivial tags attached to pages, and when overwrite is going to happen, system will wait for the pages in question to be flushed to the remote server, and then overwritten in place creating new transction.

Above tags are needed for the usual writeback - we will not really write data at writeback time, instead we will find transactions which refer to given page and resend them. In the perfect case, which I expect to happen most of the time, there should be no such stall transactions at all, since they will be quickly acked soon after write time when we will send data to the server, but it is still possible that there are no quick acks, so writeback can fire the inode.

That's the plan, now back to drawing board to actually find out how pages should be attached to transactions... Stay tuned!

October 19, 2009 07:08 PM

David Woodhouse: 19 Oct 2009

Eww, this country is uncivilised. Just got back to my hotel room and my clothing reeks of smoke. I'd almost forgotten how horrid that was.

October 19, 2009 04:38 PM

October 18, 2009

Pavel Machek: Peach

I ate peach. Single peach. In about 10 minutes, my lips shrank, and I started to feel pretty bad. In about 10 more minutes, they grew. Swelling? I wonder what happens next.

October 18, 2009 10:44 PM

October 17, 2009

Evgeniy Polyakov: Hacking jabber chats

Actually I wanted to allow gajim to work with Yandex Online. By default it does not connect as well as empathy from the latest Ubuntu. The latter shows network error windows and that's all.

So, what is the geek-way to fix it? Of course not to bugger support or whatever else (although some internet trolling brings som lulz too). There is a special client (open source of course) which works with the service, so let's compare network protocols used by the working and failing clients.

After some debug and tcpdumps I got this:

gajim->server: <?xml version='1.0'?><stream:stream xmlns="jabber:client" to="xmpp.yandex.ru" version="1.0" xmlns:stream="http://etherx.jabber.org/streams" >

server->gajim: <?xml version='1.0'?><stream:stream xmlns='jabber:client' xmlns:stream='http://etherx.jabber.org/streams' id='3512116648' from='ya.ru' xml:lang='en'><stream:error><host-unknown xmlns='urn:ietf:params:xml:ns:xmpp-streams'/></stream:error></stream:stream>

And the working client:

yachat->server: <?xml version="1.0"?><stream:stream xmlns:stream="http://etherx.jabber.org/streams" version="1.0" xmlns="jabber:client" to="ya.ru" xml:lang="ru" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:yandex="ns:yandex:let:me:in" >

server->yachat: <?xml version='1.0'?><stream:stream xmlns='jabber:client' xmlns:stream='http://etherx.jabber.org/streams' id='4193006145' from='ya.ru' version='1.0' xml:lang='en'> <stream:features><starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'/><compression xmlns='http://jabber.org/features/compress'><method>zlib</method></compression><mechanisms xmlns='urn:ietf:params:xml:ns:xmpp-sasl'><mechanism>PLAIN</mechanism></mechanisms></stream:features>

As we can see, server replied and it wants to sex, drugs and rock-n-roll with cookies.
Initial string sent by the client differs by two additional tags in the working client, so let's brutally hack gajim to have them also (my first python hack, I know it still has its buggy shiny 2d grammatics):

--- /usr/share/gajim/src/common/xmpp/dispatcher_nb.py	2009-10-17 19:32:20.000000000 +0400
+++ /usr/share/gajim/src/common/xmpp/dispatcher_nb.py	2009-10-17 19:46:27.000000000 +0400
@@ -110,6 +110,8 @@
 		self._metastream.setNamespace(self._owner.Namespace)
 		self._metastream.setAttr('version', '1.0')
 		self._metastream.setAttr('xmlns:stream', NS_STREAMS)
+		self._metastream.setAttr('xmlns:xml' 'http://www.w3.org/XML/1998/namespace ')
+		self._metastream.setAttr('xmlns:yandex', 'ns:yandex:let:me:in ')
 		self._metastream.setAttr('to', self._owner.Server)
 		self._owner.send("<?xml version='1.0'?>%s>" % str(self._metastream)[:-2])

Tcpdump, connect and:

gajim->server: <?xml version='1.0'?><stream:stream xmlns="jabber:client" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:yandex="ns:yandex:let:me:in" to="xmpp.yandex.ru" version="1.0" xmlns:stream="http://etherx.jabber.org/streams" >

server->gajim: <?xml version='1.0'?><stream:stream xmlns='jabber:client' xmlns:stream='http://etherx.jabber.org/streams' id='3512116648' from='ya.ru' xml:lang='en'><stream:error><host-unknown xmlns='urn:ietf:params:xml:ns:xmpp-streams'/></stream:error></stream:stream>

Fuck my brain, but nothing changed. Ok, let's compare tags sybmol-by-symbol. Actually it is enough just to compare tag "to":

gajim: to="xmpp.yandex.ru"
yachat: to="ya.ru"

No need to look into the sources to determine that it is gotten not from the server string, but jabber ID, i.e. string after @ symbol. Setting there login@ya.ru instead of default yandex.ru or xmpp.yandex.ru, and things magically start to work.

A good shake after the music hours.

October 17, 2009 04:19 PM

October 16, 2009

Jaya Kumar: Deepavali

நண்பர்களுக்கு இனிய தீபாவளி வாழ்த்துக்கள். Happy Deepavali to everyone, enjoy in moderation. FOSS.IN CFP is also out, I highly recommend participating.

October 16, 2009 06:47 PM