Kernel Planet

November 20, 2009

Roland Dreier: First they laugh at you…

I found this article in “Network Computing” pretty interesting, although not exactly for the content.   Just the framing of the whole article, with Microsoft is touting the fact that they’ve managed to achieve performance parity with Linux on some HPC benchmarks as an achievement (and putting up a graph that shows they are still at least a few percent behind), shows how dominant Linux is in HPC.  Also, the article says:

The beta also reportedly includes optimizations for new processors and can deploy and manage up to 1,000 nodes.

So in other words Microsoft is stuck at the low end of the HPC market, only usable on small clusters.

November 20, 2009 08:44 PM

Matthew Garrett: Why SHMConfig is off by default

Bastien mentioned the Chromium OS xorg.conf file, which includes an irritating wart - namely, Option "SHMConfig" "on". This tells the Synaptics touchpad driver to export its configuration data to a shared memory region which is accessible to any user on the system. The reason for this is that in the past, there was no good way for configuration information to be passed to input drivers through the X server at runtime. This got fixed with the advent of X input properties, and synaptics can now be configured sensibly over the X protocol.

But why was it off by default? Because, as I said, the configuration data is exported to a shared memory region which is accessible to any user on the system. And while it contains a bunch of information that's not terribly interesting (an attacker being able to disable my touchpad or turn on two finger emulation may be a DoS of sorts, but...), it also contains some values that are used to scale the input coordinates. Which means that anyone with access to the SHM region can effectively take control of your mouse. The current position is exported too, so they can also track all of your mouse input.

Now, this isn't stunningly bad. The attacker can only do this while you're touching the pad. You'll see everything that happens as a result. There's no way to fake keyboard input. They need to be running code as another user on the system - if they're running as the logged in user then they can already do all of this. And for a device as single-user as Google seem to be looking at, it's obviously not a concern at all.

But there's still plenty of places on the web suggesting that you enable SHMConfig, and various distributions that ship with it turned on (Ubuntu on the Dell mini used to, but got turned off after I contacted them about it). It's absolutely fine to do this as long as you're aware of the security implications of it, but otherwise please use X input properties instead.

November 20, 2009 04:56 PM

November 19, 2009

Matthew Garrett: Sigh.

If only eeepc-laptop sent standard keycodes, or something.

Oh, wait.

Writing a Linux distribution is hard. There's a huge range of interconnected dependencies. It takes a long time to learn how everything fits together, and fixing things properly rather than adding device-specific hacks often requires rewriting a lot of code. I'm sure Google will figure it out in time[1], and I'm also sure that the majority of their work is going into their UI rather than the underlying infrastructure. But even so, don't expect that you'll be able install Chromium OS on a random piece of hardware and have it work as well as, say, Fedora in the near future.

[1] Based on that script, I'd say they're about equal to Xandros at the moment

November 19, 2009 08:32 PM

November 18, 2009

Evgeniy Polyakov: Elliptics network "make things easy" release: 2.6.2

Name says it all: it is not dumb simple to create distributed hash table redundant storage over multiple nodes with HTTP data access.

Data is uploaded using POST method through special FastCGI application, which is linked with elliptics network library and writes data into the storage according to its config file (one can specify data redundancy there for example).

Data receiving is rather different idea - FastCGI application described above only lookups requested object in the network and returns direct URL to appropriate storage server. It has to be configured according to some standards (like data must be placed in subdirs, indexed by the parameter, which is equal to appropriate elliptics network port minus FCGI_DNET_BASE_PORT config option, i.e. if there are two elliptics nodes (for two disks for example) running on 1025 and 1026 ports, FCGI_DNET_BASE_PORT config parameter being set to "1024" and single web server, its document root should contain subdirs 1 and 2 (1025-1024 and 1026-1024).
There is bunch of other useful config parameters, although there is no authentification or any kind of permission checks yet.

Anyway, here is a changelog:

I deployed small storage total of 1 Tb of data spread over 4 physical machines with 2 elliptics node on each (one per storage disk) and uploaded 60 gb of data there (about 3-4 thousans of files and each one has additional copy).

Upload is rather trivial:

wget --post-file=$file http://base_fast_cgi_host.net/name.mp3?name=$some_file_name

You might expect that file downloading will be as simple as

wget http://base_fast_cgi_host.net/name.mp3?name=$some_file_name

and you will be absolutely right, that it will redirect you to some server inside storage cloud via direct link to the requested object.

Also added files to make debian packages. This even works.

Full changelog is available in git tree.

That's it, enjoy!

November 18, 2009 05:36 PM

November 17, 2009

Evgeniy Polyakov: HTTP fastcgi daemon has been imported into elliptics network tree

and uploaded into git tree.

Now its time to setup a small 8-node elliptics network cluster (one node per fast scsi disk) on 4 physical machines (total of about 1 Tb) and run some tests, namely massive data upload and download. There will be two external nodes serving as upload proxies (I plan to write at least one additional object copy for redundancy) and data fetching URL generators (objects themself will be downloaded via direct HTTP links from the storage nodes)

Upload as well as download works with wget pretty smoothly, but there was no load yet.
I will go climbing quite soon but would like to start some massive uploads before that to get some results later today.

If things will go smooth as well, this will be next elliptics network release. Next step will be to add some authentification bits into the field, currently neither application checks permissions just because there are no restrictions at all. One can configure web-server instead though...

Stay tuned!

UPDATE: imported my music collection (3500 files), total of just about 30 Gb, not that much actually, will try to find out what else one can find here.

November 17, 2009 02:17 PM

Evgeniy Polyakov: Full cycle elliptics network access over HTTP completed

Elliptics network - a distributed hash table storage with zillions of tasty things got another gem.

It is possible to upload and download content over HTTP (GET and POST methods are supported) via direct links (download only, upload uses FastCGI proxy), system scales horizontally, allows to implement redundant object storage with multiple copies, automatic data relocation when nodes fail and so forth.
POST processor was originally written in Lisp, but I decided to switch over to plain C because of simplicity elliptics network library provides with its API. Getting that I stopped to use Google in the office (sigh, if you would know how ugly Bing is, but there is no alternative I'm afraid), it takes really long to find out something useful in internet about complex tech tasks now, so continuing working with Lisp in that environment did not look like a good idea. I will use it for AI tasks though.

So, modulo unknown bugs, it should completely solve your download HTTP server scalability issues. Now its time to cleanup debug prints and code a little bit, extend configs to add some latest words, and then to add authentification protocol, namely some cookie generation to forbid unauthorized access. So far I did not commit the latest changes, it will be there tomorrow.

It happend that using elliptics network is really trivial, even when one forgot quite a lot about its internal state. Code extension, albeit quite visible, was not that invasive actually, and I was able to catch up with it very quickly.

I added another flag into the protocol, which allows to upload data without transaction machinery on the disk (transaction are still present in memory on the client, so will be resent if not acked and so on), i.e. it is now possible to put data into the storage without history (was possible before) and to eliminate on-disk format changes for the data, so that placed files only differ in names from what was originally posted, and names may depend not on content, but on provided ID (like hash of the name or precise ID user may provide). Previously there was a history for the object, which in turn contained transaction ID, which was generated based on uploaded content, so there were two steps needed to fetch the data: get history and get appropriate transactions from it.

So, while you are thinking about how to solve scalability problems of the new project, consider checking elliptics homepage to get in touch with its features and capabilities. There is also benchmark section there, which I plan to extend quite soon with the new data.

Also there is pending POSIX filesystem on top of this storage, which should show up not that far away either.

Stay tuned, there will be some news quite soon!

November 17, 2009 01:26 AM

November 16, 2009

Pete Zaitcev: Raising in rage against Eucalyptus in Fedora

At leat at present junction, it's a "no", although of course I cannot stop confused people from packaging it. Things that concern me about Eucalyptus:

* Lack of any kind of willingness to work with the outsiders, unless they are of course Ubuntu. It's even documented. You want to eat crumbs off someone's else table? I sure don't.

* Crazy Xensource-like premature commercialization with Eucalyptus.com. I expect trouble getting understanding (in addition or in explanation of the above).

* We already have all sorts of management stuff with oVirt, sVirt, libvirt, etc. Eucalyptus is a direct competitor for all that, not a complement to it!

* Maybe their compute cloud is the second coming of Christ, but on the storage cloud side, their S3 is worse than what I have today (except in bugginess, perhaps). And I'm sure Jeff Garzik is happy to accept patches, which is the key, goddamit.

All these points are subject to reconsideration in the future, but for now it's pretty obvious to me. I am surprised Greg doesn't think so.

UPDATE: Greg comments further. His is a different view, he wants a coherent cloud story now. But I remember too well how much of steamroller Xen seemed a few short years ago, and where did it go? So, I tend to think we need good code more than coherent story. Perhaps I'm unduly idealistic. And yes, our patchwork of projects and packages is frustrating.

November 16, 2009 08:26 PM

Harald Welte: David Burgess (OpenBTS) visiting me for a couple of days in Berlin

On Friday, David Burgess of the OpenBTS project has come to visit me in Berlin. We're working on the final preparation of the two-day Deepsec 2009 GSM Security Workshop which will happen in Vienna next week.

David has more than 10 years experience in implementing GSM Layer 1 as well as the higher-layer protocols, so it's always great to talk with him and tap into his experience. Unfortunately the preparations for the workshop kept us too busy to work on some actual code.

The more than 200 slides for the workshop will be published after the workshop is over.

November 16, 2009 01:00 AM

November 15, 2009

Pete Zaitcev: Is there A20 Gate in Litl?

Havoc's post explains his vision for the way a consumer would interact with Litl in the same condescending way Steve Jobs wants us to enjoy iPhone. This should certainly sound familiar to anyone who heard of Mac:

The software is finely-tuned to the hardware, and the flippable hardware inspires one of litl OS's core features [...]

But what is in it for me? Them main issue is if Litl is hackable. Is there hardware documentation? What about software and firmware, is it GPL?

100% legacy-free. No caps lock. HDMI, not VGA. etc.

Is there no BIOS? No A20?

Note: Comments about A20 will be deleted. This post is about Havoc's indiffierence to hackerdom, not A20. Go to mjg59's post to discuss A20.

UPDATE: Havoc commented with:

There is no TPM chip or "tivoization" since there's no subscription contract. If you want to replace with Linux it's fine with us.

There's nothing too unusual about the hardware from a driver perspective. The custom wheel and button are hooked up as keyboard keys. A netbook distribution of your choice probably runs, or runs with trivial fixes, though I haven't tried on the final production hardware.

If you have questions then ask one of the litl developers.

November 15, 2009 06:38 PM

November 14, 2009

Pete Zaitcev: Hail tabled with data replication

It's almost there, I can sense it. I posted the patch with the scan daemon and 1-st party copy implementation today. There's still a lot of work remaining, in particular I need Chunk to perform self-tests and 3-rd party transfers.

The biggest issue, is, however, tabled itself. Or, actually, its database. In order to support nodes coming down efficiently, it needs to know what keys were at a given node, and the information is not indexed by node. So, the alternative is either changing the whole database scheme, or severely limiting the supported scale so that the limitations of the whole-database scan do not come forward too forcefuly. Given that there's no time, I opted for proclaiming no more than 10 Chunk nodes and 1 million keys. Such small limits make tabled completely useless and a toy (until the new database), but then if anyone thinks otherwise he's deluding himself anyway.

November 14, 2009 08:33 AM

Arjan van de Ven: Some PowerTOP updates

In the last few days I’ve been working with Auke Kok on adding some new features to the PowerTOP tool.

There are three new power saving checks in the current git version of PowerTOP. Each feature needs a small kernel patch to give PowerTOP the information it needs; the three patches are currently making their way through the various maintainer trees and code reviews, hopefully for inclusion in the 2.6.33 kernel.

Test 1: Who is spinning up the disk
Not everyone has an SSD in their system yet (I feel sorry for those who don’t). The mechanical rotation of a disk seems to cost in the order of 0.5 Watts to 1 Watt of power (depending on the disk).
For this reason, pretty much all laptop disks will stop spinning after several seconds of inactivity.

Now, to enjoy this power savings, it is essential that the disk gets to actually spin down, that is, that there are long periods of inactivity. In a typical Linux distribution, there are various programs that, unfortunately, write things to the disk all the time. Until now, the only way to find out which programs were doing this was to use the blockdump kernel feature.

blockdump is not very user friendly, and while powerful, has several caveats.

Using the perf kernel infrastructure, the git version of PowerTOP now has included the equivalent of the blockdump feature, and will report disk-waking application both in the regular interactive
view as well as in the diagnostic “dump” mode.

In the interactive mode, disk activity is shown with a “D” in the line like this:

Top causes for wakeups:
5.9% (441.0) <interrupt> : ahci
0.0% ( 0.0)D gcc

and in addition, a suggestion will show up in the bottom pane like this:


The program 'cc1' is writing to file '.oom_kill.o.d' on /dev/sda3.
This prevents the disk from going to powersave mode.

While in the “dump” mode a full list of file accesses is shown:


Disk accesses:
The application 'gcc' is writing to file 'ef7f9d8f6d628ba196edb8882ed560-' on /dev/sda3
The application 'gcc' is writing to file '?' on /dev/sda3
The application 'gcc' is writing to file 'sysctl.o.BvmMr0' on /dev/sda3

We used this feature to make sure that the Moblin OS keeps the disk in low power mode as long as possible.

Test 2: Is the audio power saving working

PowerTOP has been suggesting to enable the power saving feature of the HD Audio chipset. In power save mode, the so called “codec” (the chip that turns the digital signal into the analog voltages that speakers can then turn into sound) will be turned off during periods when no applications use the sound subsystem. This can easily save 0.25 Watts to 0.5 Watts.

However, as usual, this savings is only achieved when no application is actually using the sound subsystem. The git version of PowerTOP will now show statistics
about how much of the time the audio codec is in power save mode.

In dump mode, the output looks like this:

Recent audio activity statistics
Active Device name
0.0% hwC0D2

while equivalent information is shown as a suggestion in interactive mode if there is activity on the audio chip.

Test 3: Is SATA Link power management working

PowerTOP has checked the ALPM AHCI Link power saving feature for quite some time now. This feature is good for between 0.5W and 1W of power per Serial ATA link
when the link is quiet. Now, of course this is only helping if the link is actually quiet….

The git version of PowerTOP will now display actual statistics of the various Serial ATA links in the system to measure the effectiveness of this feature,
as well as pointing out potential tuning spots.

In dump mode, on a busy system this will look like this:

Recent SATA AHCI link activity statistics
Active Partial Slumber Device name
86.5% 13.5% 0.0% INTEL SSDSA2MH08

While on a more idle system it looks like this:

Recent SATA AHCI link activity statistics
Active Partial Slumber Device name
0.8% 99.2% 0.0% INTEL SSDSA2MH08

In interactive mode, the same information will be displayed in the suggestion portion of the screen.

November 14, 2009 06:23 AM

Harald Welte: India setting up service stations to program IMEI into phones

This is not really current news, as it was released much earlier this year. However, I'm not following Indian news that closely so it has slipped my attention:

India's COAI is setting up hundreds of service centers where end users can have an IMEI programmed into their phone. This apparently relates to the fact that there are plenty of phones of Chinese origin with an all-zero IMEI in India.

Since there is a government law that requires every phone to have an unique IMEI number, operators have been ordered to refuse phones with an all-zero IMEI onto their network.

I personally find all of this very funny:

So from a real IT security point of view, this entire exercise is nothing but an annoyance to keep people busy and create employment for the staff operating those IMEI programmers.

Tho those involved: Work smarter, not harder ;)

November 14, 2009 01:00 AM

November 13, 2009

Evgeniy Polyakov: Completed static content elliptics network implementation

I've completed installation of the small distributed hash table storage with static content delivered via direct URLs. This whole setup slightly differs from more common and expected one in one detail: how data is fetched and accessed by the client.

In the common case it is supposed that elliptics network powered applications will fetch data from the network according to transaction history of the object (optionally in parallel). This requires client code to be linked with elliptics network library and modified according to its API.

But there is another way, which is much simpler although a bit limited - split data lookup and reading itself, and implement the former in the special small application, while rely on other facilities (like HTTP servers) to get the data.

This is what was made. I wrote simple FastCGI application which starts data lookups and form URLs which are returned to the client application, which in turn fetches data from storage HTTP servers. There is one-to-one relation between any potenially failing object within storage cluster (one can install one elliptics node per disk or per server, or even per datacenter) and elliptics network nodes. FastCGI daemons (which can live on separate set of machines if needed) are persistent clients of that network, and the only task they do is elliptics network node's IP address lookup, which is then extended to static URL to actually get the data.
This URL is returned from the fastcgi daemon as redirect, but this is configurable.

I extended lookup message to optionally stat local storage on the node to actually test whether object is presented on the given node. Using multiple IDs for the same data object allows to redundantly store multiple copies, so that client could switch to another copy if object can not be found using previous ID. Elliptics network storage servers will take care about data relocation when servers go offline and online.

The only problem for this setup is how data is treated by the client and storage. Client expects dataflow from the single node starting from the beginning to the end, while elliptics storage uses transactions with its own protocol and on-disk storage format, which is processed by the library when appropriate IO API is called from the client code.

Problem can be solved if we will upload data not via elliptics network, but directly into the strorage, although using the same name conversion which could be done by elliptics internals. I.e. when we manually create directory structure and put there objects with names, which are equal to hash transformations of the real names, which then in turn will be made in fastcgi elliptics network daemon.
Let me show an example of how this is done. Let's consider an object called '/tmp/passwd.c' to be placed into the storage, which will use sha1 transformation function.

sha1('/tmp/passwd.c') = 8c23ac86ef943021cf6524f475c15f3d5d575deb
so we manually put this object into the storage network on the appropriate node (which handles covering ID range) with that 8c23... name.

FastCGI daemon configured to use sha1 transformation will receive URL like

GET some.host.net/blah?name=/tmp/passwd.c

take name part, hash it, lookup object with above 8c23... id and return following header:

Status: 301
Location: http://some_other_host.net/8c/8c23ac86ef943021cf6524f475c15f3d5d575deb

Simple. We only need to make an appropriate script for data upload. If we would use elliptics network for data upload, then above 8c23... transaction will contain history for data updates, and actual data transaction (or there could be multiple transactions if object was split into multiple parts to allow parallel reading) should be read from the history and then fetched from some other nodes using elliptics network API.

I will write such helper script and upload some content (currently I do this manually via ssh/scp :), so that I could stress-test setup before it goes up. So far things went pretty smooth.
Interested parties can check example directory in the git tree.

Stay tuned!

November 13, 2009 07:41 PM

Matthew Garrett: Legacy PC design misery

I've spent chunks of the last couple of days fighting a problem that's existed for about 25 years. The 8086 was a 16-bit processor with a 20-bit address space, limiting the maximum physical address that could be accessed to 1MB. However, quirks of the segmented memory system meant that addresses greater than 1MB could be constructed - these would wrap around to the bottom of memory. Because loading the segment registers was a time consuming operation, some programmers used this behaviour as a performance optimisation.

The 80286 introduced 24 bit address space. Unfortunately, this meant that the addresses that previously wrapped to the bottom of memory now pointed at real addresses - not ideal if you were expecting the old behaviour. IBM fixed this by tying the 21st address line (A20 - they're zero indexed) through an and gate, with the default behaviour being to keep it tied at 0 and thus maintaining the old wraparound behaviour. Applications that wanted to access the full address space needed to enable the A20 logic gate. IBM didn't want to add any extra hardware to their system if they could avoid it, so tied the other side of the and gate to a spare pin on the keyboard controller. By writing a couple of bytes to the keyboard controller, your PC-AT stopped pretending to be an XT and gave you access to all of the insanely expensive RAM it had stuffed in it. Hurray!

PCs have been emulating this behaviour since the AT was first cloned. Of course, this being the PC industry, many have got it wrong. There's a set of approaches for controlling the A20 gate that may work, varying in terms of performance and desirability. Most hardware will give the desired result (ie, I have no desire to run DOS executables from 1982, make my A20 work damnit) using any of the various methods of A20 enabling. Some hardware doesn't. The most common method used in bootloaders (where we still have access to system BIOS services) is to call int 15h with an ax of 0x2401, which asks the BIOS to enable A20 for us. This isn't implemented on all hardware, but we should get a failure back that lets us go and bang on the keyboard controller in an attempt to get it to pay attention[1].

Enter the Kohjinsha SC3.

I picked this up second hand in Japan. It's a ridiculously cute little tablet, only slightly larger than hardware that's comfortably in the MID range. It booted a Fedora liveCD perfectly, though having GMA500 graphics meant that what appeared wasn't terribly attractive. Installation proceeded happily enough, followed by a reboot and... nothing. Grub loaded the kernel and initrd, jumped to the kernel and everything hung.

So, for the past couple of days, I was stepping through the kernel setup code, trying to work out where and why it was hanging. I'd got it narrowed down to the region where the kernel tried to free the memory used by the initramfs, but the failure hopped around depending on my kernel build. Something was clearly very wrong. The strangest thing about this was that if I booted the liveCD boot menu and selected "Boot from local drive", everything worked perfectly. isolinux was clearly doing something that grub wasn't, but there's rather a lot of code to step through there.

Things became a lot easier once I found that the OpenSuse version of grub worked. Their grub has a rather smaller set of patches than ours, and only a few looked even plausibly relevant. It only took ten minutes or so to figure out that it was one that altered the A20 code. Things became much clearer then.

The main functional difference between the Suse A20 implementation and the upstream one[2] is that the Suse one explicitly tests whether the A20 enabling worked by putting values at two different addresses that would be the same if A20 is disabled. By comparing them, we know whether A20 is working properly or not. If not it can then fall back to other mechanisms. The Fedora code trusted the BIOS's claim that the int 15 call had worked. The Kohjinsha's BIOS lied, A20 remained disabled, grub copied the kernel and initramfs to chunks of address space that contained lies rather than RAM and everything fell over horribly.

Thankfully, not a difficult fix once the problem was identified. But seriously, people. How hard can it be not to screw this up?

(For an excrutiatingly detailed analysis of how hard it can be not to screw this up, see here)

[1] the Intel Macs don't implement the int 15 approach, but return a failure. They also don't have a legacy keyboard controller, so attempting to hit that resulted in grub falling over. The magic IO port approach works. Another example of how the Intel Macs aren't really PCs...

[2] grub2 implements the more paranoid check

November 13, 2009 02:32 PM

Dave Jones: setting signature based on From/To in mutt.

Lazyweb, help me out.
I use mutt to read multiple mail accounts. Using alternates, I have it rigged so that when I reply to someone it sets my From: to the same as the address that they sent the mail To: Straight-forward stuff.

What I’d like to do next is set my .signature based upon the same rule.
I thought that this can be done easily enough with send-hooks like..
send-hook mutt- 'set signature=~/.signature1; my_hdr From: Dave Jones <emailaddress1>'
send-hook mutt- 'set signature=~/.signature2; my_hdr From: Dave Jones <emailaddress2>'

But that just seems to make it pick signature1 regardless of any header.

I googled a while, but turned up dead ends. profiles look interesting, but I’d rather not have to swap between things manually. Likewise, it looks possible to do it on a per-folder basis, but I want this to work with =mbox where all accounts land unfiltered.

Anyone set up something similar ?

setting signature based on From/To in mutt. is a post from: codemonkey.org.uk

Related posts:

  1. email addresses in dmesg a bad idea. A while ago, the following innocuous printk was added to...

November 13, 2009 03:55 AM

November 12, 2009

Kernel Podcast: 2009/11/11 Linux Kernel Podcast

AUDIO: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091111.mp3

For Veterans Day (November 11th) 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Floppy, LZO, Resume, and Wakeup.

Floppy. Just when you thought nobody used floppy disks any more. Stephen Hemminger posted to let everyone know that “mount -o sync” support has a regression for floppy disk use cases in kernel 2.6.31. Some time between 2.6.30 and 2.6.31-rc1, the anticipated behavior of writes immediately completing and blocking until they hit the ext2-formatted disk broke and a copy followed by disk removal followed by unmount results in errors. This potentially may affect USB thumbdrive users, so has some wider relevance.

LZO. Albin Tonnerre posted version 3 of a patch series implementing generic LZO compression for kernel binary images on x86, ARM. The patches include support both for building and using these images, and their initramfses.

Resume. Rafael J. Wysocki and Linus Torvalds chimed in on Rafael’s previous posting concerning broken resume-from-suspend. After applying a patch intended to help diangose the problem, Rafael reported that errors were being generated by btusb_waker, which Linus said matched his “observation that only a few [Bluetooth] drivers seem to use workqueues, and btusb_disconnect() isn’t doing any work cancel”. Marcel Holtmann and others began discussing solutions.

Wakeup. Yinghai Lu posted version 2 of a patch intended to make doubly sure that ACPI wakeup code is located below 1M in physical memory on x86. The patch attempts to find a suitable region in the BIOS/EFI/firmwire specified “e820″ area (a table of memory mappings on such systems) and reserve it early on.

The latest kernel release is 2.6.32-rc6.

Stephen Rothwell posted a linux-next tree for Novemeber 11th. Since Tuesday, the i2c tree lost a conflict, the new tree gained a conflict, the wireless tree lost a build failure, the rr tree gained a build failure, the pcmcia tree gained a conflict, the tip tree gained a build failure, the percpu tree gained a conflict, and the usb tree also gained a conflict. The total sub-tree count is now at 148 trees, since the previous issues with pulling trees resolved.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 12, 2009 07:10 PM

Kernel Podcast: 2009/11/10 Linux Kernel Podcast

AUDIO: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091110.mp3

For Tuesday, November 10th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: AppArmor, Changing task UIDs, SECURITY_FILE_CAPABILITIES, and Stable tags and git workflow.

AppArmor. John Johansen posted version 3 of a 12 part patch series intended to re-implement the AppArmor security module (which was previously maintained out of tree by Novell, until it wasn’t, then seemed to die shortly after Canonical begun to support it, and now has returned in a new form in a posting from John, who is a Canonical engineer) upon the security_path hooks instead of the previous VFS hack. AppArmor is a path-based alternative to SELinux that is sometimes seen as being less complicated to setup, although this is debated. In any case, these patches seem more supportable for upstream inclusion.

Changing task UIDs. Enrico Weigelt, who is working on plan9 patches inquired as to the best way to implement plan9-style support for changing the UID of running tasks, perhaps through a new /proc entry. He then proceeded to post various replies to other threads he had not previously been involved with – amongst other things criticizing hald and dbus design, and espousing the virtues of plan9 (if only it had more users to sell us on its features?).

SECURITY_FILE_CAPABILITIES. Serge E. Hallyn posted suggesting the the Kconfig option SECURITY_FILE_CAPABILITIES be removed, specifically invalidating the case of SECURITY_FILE_CAPABILITIES=n, and meaning that such capabilities would always be enabled unless the user specified no_file_caps on system boot. The reason behind this suggestion stems from an apparent missunderstanding amonsgt a growing number of application developers that such support is always present, leading Serge to wonder if it might aswell just be by now.

Stable tags and git workflow. Ingo Molnar posted an RFC concerning stable tree git commit workflow. He noted that that previously, he would have to email (cherry pick) the specific pre-requisite dependencies for any stable patch forwarded to the stable team (or wait for an email when things didn’t apply to the stable tree), but felt that this could be optimized. So, Ingo has begun adding comments on “CC” lines in the patch indicating additional commits that should be included, e.g. “# .32.x: : “. These commits are added to a new “-stable” tag on Ingo’s -tip tree. He seeks comments.

Finally today, Chris Friesen had asked about correct use of IANA-registered ports on systems running sunrpc. Specifically, the RPC implementation as used by NFS can make use of ports that are reserved for other services, if a range has not been set aside ahead of time (and even then, it’s not optimal if you really want to run every service). But Trond Myklebust asked the obvious: “The people who are trying to run absolutely all IANA registered services on a single Linux machine that is also trying to run as an NFS client may have a problem, but then again, how many setups do you know who are trying to do that?”. The answer, one assumes, is less than one.

In today’s announcements: 2.6.31.6-rt19. Thomas Gleixner announced preempt-rt patch series release number rt19 for the 2.6.31.6 stable kernel. This was mostly a forward port to the latest stable tree, but also contains a missing pre-emption point in ksoftirqd. The patches are avaialable in the usual locations, amongst them: http://www.kernel.org/pub/linux/kernel/projects/rt.

The latest kernel release is 2.6.32-rc6.

Stephen Rothwell posted a linux-next tree for November 10th. Since Monday, there was a new sysctl tree from Eric W. Biederman that contains only the generic compat_sys_sysctl patches now that binary sysctls are going away. The net tree lost both a conflict and build failure, the wireless tree still has a build failure, and the trivial tree lost a conflict. Stephen reports the sub-tree count at 146, but that is incongruent with the new tree.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 12, 2009 07:09 PM

Kernel Podcast: 2009/11/09 Linux Kernel Podcast

AUDIO: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091109_corrected.mp3

For Monday, November 9th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: CFS, Cisco VPN, Fsck, Resume, and Too Many Signals.

CORRECTED: Indeed, I screwed this up by mixing up two patches. The following is about CFS task limit scheduling, not CFQ, patches for which I was reading about at the time also.

CFS Hard Limits. Bharata B. Rao posted version 3 of his CFS Hard Limits patch series. This is intended to allow for configurable hard limits on CPU used by task groups.

Cisco VPN. Mariusz Smykula, noting that this was “not yours problem” posted to let everyone know that kernels after 2.6.29 seem to break support for the proprietary Cisco VPN client, apparently needed on some “certified” systems that by implication cannot run vpnc or similar. The posting included a variety of links to users discussing the issues, though it does seem unlikely that the kernel community will rush to help Cisco with proprietary software.

Fsck. Ted T’so pointer out (in a thread entitled “document conditions when reliable operation is possible”) that “as the ext3 authors have stated many times over the years, you still need to run fsck periodically anyway.” This lead David Lang to question where that documentation was, to which Ted replied that it was in the LKML archives. Apparently, the lack of documentation that explicitly mentions this was a contributing factor in “SUSE11-or-so” ceasing to perform periodic fscks on its own because Pavel Machek could not find sufficient documentation justifying this when the decision was made.

Resume. Rafael J. Wysocki posted a request for help diagnosing a problem with the suspend and resume code in 2.6.32-rc. For several days, he has been trying to debug resume problems (that obviously might be suspend problems) on his Toshiba Portege R500. Apparently, it seems to be caused by leakage of preempt_count in the events kernel thread, but Rafael has never been able to capture a full oops message, so that is based only upon some detective work performed using gdb and a partial trace output. He did find a commit (from himself) that upon removal would make the issue unreproducible, but he believes that commit (preparing for early/late parts to resume) simply exercises code paths that make the problem more easily triggerable. He also found an earlier commit in which the leak lead to a warning (run_workqueue) that didn’t kill the box, but might be responsible for the hard lockup seen later on. Later, he found and posted a full trace, stuck in run_workqueue.

Too many signals. Naohiro Ooiwa posted a patch to the handling of the print-fatal-signals kernel boot parameter such that sysadmins will receive a warning when RLIMIT_SIGPENDING is exceeded and can choose to enable the additional logging facility to diagnose what is really going on.

Finally today, are you feeling motivated? Mark Pith announced that his research team (at the University of Amsterdam) were researching the “motivation factors of Open Source software programmers”. He would like you to complete a short survey that won’t exceed 15 minutes in length. The link to the survey is: http://bit.ly/Survey_Developers_Motivation.

In today’s announcements: Linux 2.6.27.39 and Linux 2.6.31.6. Greg Kroah-Hartman announced the release of kernels 2.6.27.39 and 31.6. These were in review over the weekend.

LTTng 0.167. Mathieu Desnoyers announced the latest LTTng patch for 2.6.31.6, encouraging all users to upgrade to the latest .31 series kernel since it contains security fixes.

The latest kernel release is 2.6.32-rc6.

Stephen Rothwell posted a linux-next tree for Novemeber 9th. Since Friday, several trees are feeling less “angry” (they’re always “in conflict” you see, according to Katherine). The sparc tree lost its build failure, the net tree lost a conflict but gained another for which Stephen applied a patch. The wireless, pcmcia, trivial and staging trees also gained similar conflicts. The total number of subtrees in linux-next remained steady at 146 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 12, 2009 07:07 PM

Dave Airlie: what a difference having hardware makes....

So I (and other radeon developers) debug a lot of radeon problems, both locally and with people over irc/bugzilla, and I often am quite slow to deal with bugs that I can reproduce locally, its usually a last resort to do remote debugging and its unfortunate for people who have hw bugs that we can't reproduce locally.

So what prompted this post?
https://bugzilla.redhat.com/show_bug.cgi?id=527874
KMS:RV515:X1400 Thinkpad T60 resume fails

So first up, my local Thinkpad T60p with an rv530 always resumed fine, my local T42 with 7500 mostly works okay as well, so
there goes my local reproduction.

So Peng works for Red Hat in Beijing and the week before kernel summit I sat down on irc for most of two days with him running various tests. We tracked it down in that 2 days to the fact that his video RAM wasn't getting setup properly on resume. The NMI on resume let me track down that when the gpu accessed the ring, it generated an error on the PCI bus, this led to checking the contents of the PCIE gart table (with a detour through kernel vmap page handling). The PCIE gart table is in VRAM, and upon checking it on resume noticed when we copied it back into VRAM it was getting mangled. So we could deduce VRAM was broken.

I handed off to Jerome and he got traces of the BIOS posting using VBE and using ATOM (which we use in the kernel), Jerome ruled out different parts of the engines and we got reports of it working for some people when powered down for a long time or other randomness, and we were going back and forth with ideas on what might be going wrong, and had started thinking we should power off various parts of the hw before suspend, and the problem was due to inconsistent hw state on resume.

Peng happened to visit Brisbane for a Red Hat meeting this week and brought the T60, and this morning I swapped my laptop for his. First of all I tried to play by plugging in a VGA monitor, this produced another bug where the LVDS would die when starting KMS, so I fixed that quickly first. So the VGA screen was also corrupted, and VRAM wasn't enabled at all. Next I tested vbetool posting worked, also suspend/resume to corrupt, unload radeon, vbetool post and load radeon worked. Then I started testing with Jerome's userspace atom init tool, doing a s/r, unload radeon, atom post, load radeon also worked fine. This is where it started to make no sense, since Jeromes tool was doing the exact same thing as the kernel parser. I started by blaming the atom delay code in the kernel but that proved a dead end after an hour or two. Next thing I enabled the kernel atom debugging and all of a sudden it resumed fine. So it was a timing issue somewhere in atom parser running the init code.

So enabling debugging put enough of a delay between operations that something that wasn't working before now succeeds. I started bisecting the debug messages, I removed half the debugging at a time until after about 3 hrs I got it down to one printk happening between two atombios commands. The surrounding code was reading and writing one of the memory controller setup registers on the GPU, so it pointed to some register write not getting fully into the hw before we read it back and write it again later. I changed the atom code to do a read back before writing regs for certain operations and viola all resumed fine.

So this took the best part of 8 hours, I reckon if I'd been doing the same over irc with Peng it would have taken at least a month of back and forth on irc to figure it out. Having the hardware locally even for a day made it possible to track it down and figure it out so much quicker and efficently. So the bad news for anyone with bugs we can't reproduce locally is that we generally will fix any bugs we can locally first just from a efficiency point of view, since we can fix them so much quicker and faster.

November 12, 2009 10:00 AM

Mel Gorman: Page allocator failure warnings in recent kernels

So, in recent kernels since 2.6.31-rc1, there is a seemingly benign problem whose apparently manifestation is page allocation failures of GFP_ATOMIC allocations. The system recovers but there are large stalls even though on server systems, everything goes faster overall. The problem is particularly pronounced when using certain wireless cards but manifests in harder-to-diagnose stalls on machines with low memory under stress. The development methodology means that kernels come out very quickly even though right now, I would really prefer if the world would slow down while my poor test machines try to catch up.

I think I have a solution to this but it take several hours each time to figure out if forward progress has been made or not.

The lesson learnt here? Panic makes for poor decisions. I sent one patch what looked great at the time but have found out in the last few hours that it really sucks. While figuring this out for sure, I have to wait looking at a screen to painfully slowly update. To help the waiting, I found some beer, it's the Irish thing to do. Wonder what the rest of ye do :/

November 12, 2009 12:36 AM

November 11, 2009

Evgeniy Polyakov: Elliptics network cloud and lookup requests

Currently elliptics network only supports plain lookup command, which will only tell the node address where object is supposed to live in given configuraion, but it does not check whether object is actually placed there.

So, to find out precise location of the data one has to perform at least two steps - lookup a node, where it is supposed to be, and issue direct request to that node to read part of the data. It is not very effective, so plan is to extend lookup command to perform actual object 'touch' if requested.

To date I kind of implemented needed FastCGI elliptics network daemon which lookups nodes according to URL it got, but it is rather limited from configuration point of view, i.e. there is a fair number of data hardcoded in the daemon, which should actually be read from config. I plan to use various environment variables FastCGI daemon is able to read to put there remote node addresses to contact, transformation functions to use, various timeouts and so on.

Also this daeomon will be the first user of the extended lookup command (yet to be written :).

Stay tuned, I plan to complete this whole setup and implementation this week and run some tests to find out whether solution with the direct data URL instead of getting it via elliptics network API is a good idea and what problems can be found in such installation.

November 11, 2009 09:38 PM

Dave Jones: Google wave observations.

I’ve not seen google wave yet, but here are my impressions based upon reactions from those who have.

Phase 1: HEY WHO HAS A GOOGLE WAVE INVITE THEY CAN GIVE ME?
This begging phase increases in popularity as more people actually get on the site.

Phase 2: THIS IS COOL, BUT WHAT THE HELL DO I DO WITH THIS ?
Turns out that the “I have something you don’t” novelty wears off quickly when people realise they didn’t actually need it.

Phase 3: HEY WHO ELSE IS ON GOOGLE WAVE ?
Belief that others may be able to enlighten them, and get them out of phase 2.

Phase 4: HEY THIS IS PRETTY USELESS.
Followed by neglect and forgetting they even have an account.

I’ve yet to see anyone praising google wave, but I’ve seen a lot of people go quickly through the above phases. I’m sure there are some people reading this through planet.*.com using it. Is it really “all that” ? Or is this google app going to be the next orkut ? (Remember when everyone thought that was the future? Ah, 2003 I miss you).

(This post isn’t actually me in phase 1. I couldn’t care less, and will probably skip straight to phase 4 when it goes public).

Google wave observations. is a post from: codemonkey.org.uk

No related posts.

November 11, 2009 06:52 PM

Stephen Hemminger: Powerpoint® Karoke contest

Anyone in the Portland area interested in a fun and creative event is invited to the 1st Timbertalkers Powerpoint® Karoke contest on Tuesday 11/24 at noon.

Meeting location is: 9403-B SW Nimbus Ave., Beaverton, Oregon

If you have never done PPTK, here are the rules:



In spirit of open source, it will really be a OpenOffice Impress contest, and the slides will be drawn from Creative Commons licensed decks.

November 11, 2009 04:58 PM

November 10, 2009

Pete Zaitcev: Dehalification

Is it just me, or nobody in GNOME/Freedesktop/etc. knew what they were doing for years (press "thread next" for more fun)? I wonder what David Zeuten has to say for himself about HAL. Also, what are the chances that whatever is being done now won't get rejected in, oh, a year?

November 10, 2009 06:35 PM

Dave Jones: a common hyperthreading misconception.

Despite having been around for seven years now, I still see a common misconception surrounding hyper-threading. People look at /proc/cpuinfo, see ‘ht’ in the flags line, and think “hey, I don’t have hyper-threading, /proc/cpuinfo is wrong!”.

But this isn’t the case. The ‘ht’ flag doesn’t signify the presence of hyper-threading or not. It signifies the presence of the ability to say yes or no as to whether the processor has any siblings. Basically “If I call this cpuid function, can I trust the results?”. This cpuid function is present in all intel processors since the Pentium 4.

So seeing a cpu with ‘ht’ in the flags, but no siblings is perfectly normal, and has been for all this time, but some people just don’t seem to get it.

a common hyperthreading misconception. is a post from: codemonkey.org.uk

Related posts:

  1. x86info hacking. Spent much of the afternoon today working on some updates...
  2. Fedora kernel packaging changes We recently made a change in Fedora kernel packaging, to...
  3. When acpi-cpufreq fails. The majority of modern CPUs that support CPU scaling now...

November 10, 2009 04:03 PM

Matthew Garrett: The ACPI Embedded Controller

Of course, the event model I described before is far too simple to be worthy of a place in the ACPI spec. At the most basic level, there's more possible events than there are GPEs to attach them to, so there's a need for some further complexity. This manifests itself in the form of the ACPI embedded controller (EC).

The EC is typically a small microprocessor sitting on your motherboard, often implemented in the same hardware as the keyboard controller. It shares a lot in common with the keyboard controller - on PCs it'll usually appear in system io space, with one register for writing a command or reading a status, and a second register for passing data back and forth[1]. There's 256 registers available, so a typical interaction might be to write the READ command (0x80) to the command register, write the EC register address to the data register and then read back from the data register to get the EC register contents.

The embedded controller will often be responsible for tracking information about the hardware, such as the temperature. Attempting to read the temperature through ACPI will execute an ACPI method - in the case of the temperature being monitored by the embedded controller, this method will attempt to read from an EC register. The EC driver then performs the read and returns the result, which gets converted into decidegrees kelvin and passed back to whatever made the temperature query.

But, as mentioned above, the EC also generates events. These may be in response to a user initiated event like a hotkey press, or may be triggered by some change in hardware state like a thermal trip point being passed. The embedded controller will then raise a GPE.

Unlike normal GPEs, the EC GPE is not handled by looking for a _Lxx or _Exx method. Instead, the ACPI tables provide information about the GPE that the EC is using. This may be in the form of a _GPE definition in the EC object in the main ACPI tables, or alternatively may be provided in an ECDT (Embedded Controller Descriptor Table), an optional table that provides all the EC information. In either case, the OS knows which GPE will be triggered by the EC. It then installs a handler that will be called whenever the EC raises that GPE.

Things get a touch confusing at this point. The first thing this handler does is read the command byte, which functions as a status byte on reads. It then checks whether the SCI_EVT bit is set. This informs the system that the GPE was in response to a hardware event, and so the EC handler writes a query command to the EC command register and then reads back a value between 0 and 255 from the data register. This is then mapped to a _Qxx method, with xx representing the number of the EC event read from the data register. Like the _Lxx and _Exx methods, the _Qxx method is then executed.

The problem with all of this is that the EC isn't that fast. When a byte is written to it, it's necessary to read back the status byte and check whether the IBF bit is set. This is set when the OS writes a byte to the data register, and cleared once the EC has processed it. The straightforward way to deal with this is to poll the status byte until the bit is cleared, and then write the next byte, but polling is slow and wastes CPU time. The EC can instead be set to interrupt mode, where it'll fire a GPE when the IBF bit clears.

The EC has one additional function. The ACPI spec allows for an i2c bus to be implemented through the EC, with EC registers mapping to i2c registers. The observant among you will realise that this means that there's an indexed access protocol being implemented on top of indexed access hardware, which is more layers of indirection than seem sane. For additional humour, this is usually only used to add support for ACPI smart batteries. ACPI batteries are generally abstracted behind a set of ACPI methods that provide information. Smart batteries instead speak i2c directly to the OS[2] for no real benefit. Linux handles these devices fine, and while the chances are you probably don't have one, the chances are also that if you do you haven't noticed.

The final quirk of ACPI events is that there's yet another means of delivering events. The term "fixed feature" is used to describe an ACPI device that isn't described in the ACPI tables. A power button may be implemented as a fixed feature device rather than a normal ("control method") device. This is indicated by a flag in the fixed feature block. Hitting a fixed feature power button will generate an ACPI interrupt, but no GPE. Instead the OS has to read the fixed feature block and note that the power button flag is set there. It then notifies userspace appropriately. Sleep buttons can also be implemented this way, but other devices will be in the normal ACPI tables and will generate either GPEs or EC events.

[1] On my laptop, these are ports 0x62 and 0x66 - compare to the keyboard controller's use of ports 0x60 and 0x64

[2] As directly as indirection via the EC can be...

November 10, 2009 02:58 PM

Dave Jones: Sharp TV violating the GPL

After years of using a on-it’s-last-legs 720i (i stands for i-strain) TV, this week I decided it was time to let go of the last big CRT I have, and pick up a shiny new LCD TV. After a few visits to stores to check out the options, I ended up choosing a Sharp AQUOS LC46LE700UN 46-Inch 1080p LED HDTV. I managed to find a place that had on offer for the weekend which saved me a few hundred dollars. It arrived today. It’s freaking huge. Somehow the TV looks even bigger at home than it did in the store.

After setting it up, and reading through the manual, I stumbled across a GPL notice, and an acknowledgement that it uses the following open source software components.
“linux kernel/busybox/uClibc/zlib/libpng/libjpeg/libiconv/DirectFB/OpenSSL/XLRPC-EPI”
It points you to where you can download the source code.

The 52MB helena-kernel.tgz is what I focused on first. It’s a 2.6.18 Linux kernel with a slew of stuff removed to keep the source tree small I guess. So arch only contains arm and um for example. What else is in there ? Some preempt-rt hacks, ltt, some debug stuff, a load of cvs ident damage from a broken import into their tree, and a bunch of other noise. There’s also a load of changes to files they never even compile. Like ipv6, and decnet, and sound drivers, and, and.. etc.

There’s nothing particularly notable in the tree that isn’t already upstream in some form or other. Except for an ARM port to the ‘mt5391′ platform. Work that was apparently conducted by an upstream vendor named mediatek. Why is it notable ? Well the copyright header for one thing ..

/*
* linux/arch/arm/mach-mt5391/core.c
*
* Copyright (C) 2006 MediaTek Inc
*
* This program is not free software; you can not redistribute it
* or modify it without license from MediaTek Inc.
*
* CPU core init - irq, time, baseio
*/

Unless I’m mistaken, this isn’t permitted under the GPL. If you give me GPL software, I have the right to redistribute it. If you add proprietary code to GPL code, it becomes GPL licensed. After discussion with Harald Welte, it turns out I was mistaken. It’s still not permitted, but it doesn’t automatically become GPL, the resulting work isn’t distributable by anyone.

I’ll bet Sharp don’t even know they’re doing this. They probably took the code sold-as-seen from mediatek, and assumed all was ok without even thinking of licensing concerns. I don’t know if gpl violations already had this one in their queue, but they do now.

Who knows, maybe MediaTek will wise up and actually contribute back to the source code they parasitically develop against.

Despite all this, I still like my new tv very much.

Sharp TV violating the GPL is a post from: codemonkey.org.uk

Related posts:

  1. Graphical objdump ? Hi Lazyweb. I’m curious if a tool exists already to...

November 10, 2009 04:02 AM

Matthew Garrett: ACPI general purpose events

ACPI is a confusing place. It's often thought of as a suspend/resume thing, though if you're unlucky you've learned that it's also involved in boot-time configuration because it's screwed up your interrupts again. But ACPI's also heavily involved in the runtime management of the system, and it's necessary for there to be a mechanism for the hardware to alert the OS of events.

ACPI handles this case by providing a set of general purpose events (GPEs). The implementation of these is fairly straightforward - an ACPI table points at a defined system resource (typically an area of system io space, though in principle it could be something like mmio instead), and when the hardware fires an ACPI interrupt the kernel looks at this region to see which GPEs are flagged. Then things get more interesting.

The majority of GPEs are implemented in the ACPI tables via methods with names like _Lxx or _Exx. The xx is the number of the GPE in hex, while the leading _L or _E indicates whether the GPE is level- or edge-triggered. If an ACPI interrupt is fired and GPE 0x1D is flagged as being the source of the interrupt, the ACPI interpreter will then look for an _L1D or _E1D method. Upon finding one, it'll execute it. What this method does is entirely up to the firmware - on most HP laptops, GPE 0x1D is hooked up to the lid switch[1] and so executing it will send a notification to the OS that the lid switch has changed state. The OS will then evaluate the state of the lid switch (generally by making another ACPI query) and send the event up to userspace.

How does the lid end up triggering GPE 0x1D? Things get pretty hardware specific at this point. Intel motherboard chipsets have a set of general purpose io (GPIO) lines that can, for the most part[2], be used by the system vendor for anything they want. For a lid switch, one of these lines is hooked to the switch and the BIOS configures the GPIO as an input. Pressing the switch will cause the GPIO line to become active. The GPIO lines are mapped to GPEs in a 1:1 manner, though with an offset of 16 - ie, GPIO 0xd will map to GPE 0x1d. If GPIO 0xd becomes active, GPE 0x1d will be flagged and an ACPI interrupt sent. The ACPI code will then do something to quash the interrupts, such as inverting the polarity of the GPIO[3], as well as send the notification to the OS.

Why are the GPIOs offset by 16 relative to the GPEs? The lower 16 GPEs (again, talking about Intel hardware) have pre-defined purposes[4]. These range from things like "Critically low battery" to "PCIe hotplug event" down to "This device triggered a wakeup". And the latter is what I'm most interested in here.

Various pieces of modern hardware can be placed into power saving states when not in use. The problem with this is that the user experience of having to turn on hardware before you can use it is not a good one, so in order to make this the default behaviour we need the hardware to tell us that something happened that requires us to wake the hardware up.

There's something of a chicken and egg problem here, but thankfully most of the relevant modern hardware has out of band mechanisms to tell us about things going on. The PCI spec defines something called Power Management Events (PME), which are driven by an additional current that's supplied to the hardware even when it's otherwise turned off. On plug-in PCI Express cards, firing a PME generates an interrupt on the root bridge and a native driver can interpret that, but for legacy PCI devices and integrated chipset devices the notification has to come via ACPI.

The example I've been working on is USB. It's a good choice for various reasons - firstly, there's already support for detecting when the USB controller is idle. Secondly, modern USB host controllers have support for generating PMEs on device insertion, removal or (and this is important) remote wakeup. In other words, as long as the USB bus is idle we can power down the entire USB controller. If the OS tries to access a USB device, we'll power it back up. If the user unplugs or plugs a device, we'll power it back up. If a previously idle device suddenly responds to some external input, we'll power it back up. And it's all nicely invisible to the user.

How does this work? The controller retains a small amount of power even when nominally pwoered down. This is used to keep the detection circuitry alive. When it receives a wakeup event, it asserts the PME line. The chipset detects this and fires a GPE. The OS runs this GPE and receives a device notification on the ACPI representation of the USB controller, telling us to power it back up. We do so and process whatever woke us - if the bus then goes idle again, we can power down once more.

The astonishing thing is that this all works. The only problem we have is that it relies on the machine vendor to have provided the ACPI methods that are associated with the GPEs. If they haven't, we can't enable this functionality - even though the hardware is capable of generating the GPEs, we have no method to execute to let us know which device has to be woken up. The GPE is never answered, we never acknowledge the PME and the hardware keeps on screaming for attention without getting any. And, more to the point, it never gets powered up and your mouse doesn't work.

There's a pretty gross hack to deal with this. In general, we know what the GPE to device mappings are - they're pretty static across Intel chipsets, and while AMD ones can be programmed differently by the BIOS we can read that information back and set up a mapping ourselves. This trick also comes in handy when some vendors (like, say, Dell) manage to implement one of the GPE events wrongly. Everything looks like it should work, but the method never sends a notification because it's buggy. In that case we can unregister the existing method and implement our own instead.

This code isn't upstream yet, but patches have been posted to the linux-acpi mailing list and with luck it'll be there in the 2.6.33 timeframe. My tests suggest about 0.2W saving per machine, which isn't going to save all that many polar bears but seems worth it anyway.

[1] _L1D = lid. Sigh.

[2] There's a few that are reserved for specific purposes

[3] So where before it had to be high to be active, it now has to be low to be active - this means that it'll now trigger on the switch being opened rather than closed, so you'll get another event when you open the lid again.

[4] You can find a list in the documentation for the appropriate ICH chip - the relevant section is "GPE0_STS" under the LPC interface chapter.

November 10, 2009 03:08 AM

November 09, 2009

Evgeniy Polyakov: Static HTTP content elliptics network

I decided to use fastCGI application to lookup needed node and use static processing web server to actually give it away.

To simplify things as much as possible following setup will be created: multiple server nodes will be combined into elliptics network cloud, where they will be 'separated' into multiple groups, where each group will correspond to some datacenter. To find out how this will be implement one can check homepage description related to redundancy and load balancing.

Each server will host a lighttpd server, which will be configured to give away some static data. URLs for that data will be generated by the appropriate fastcgi daemons running on the same servers. Each fastcgi daemon will be connected to the cloud as actual node, so it will be able to forward requests. Their main work will be actually to form static URL. Getting that there will be no updates in this setup at all (data objects will be loaded via external scripts into location, which elliptics network nodes can find), fastcgi daemon will be rather trivial.

So it will get data request, hash it into ID, lookup corresponding server within the cloud and return direct URL to that node and redirect status. Browser will receive 301/302 status reply and will connect to given node to actually fetch the data.

Currently there is no way to determine if object is actually present on given node in LOOKUP request, so fastcgi daemon can issue multiple lookup commands and return list of URLs where it can be found, so it will be a client job to run them one after another if no object was found or host is unaccessible. Or I can use read command and fetch 0 or 1 byte in fastcgi daemon to really find where object lives and determine if it is not accessible by one or another ID. Getting whole data object in fastcgi daemon is not a good idea, since files are rather big, number of daemons is very limited, clients can be connected over slow links and so on. Better model for such case could be a lighttpd plugin, but it is a different story.

In any way, fastcgi daemon's job in this setup is to find the server and form direct URL to the object, which means that data can not be stored in the database, but should live in files only, which will be processed by the static-giving web server.

That's the plan. So far I wrote my ever first fastcgi application, which so far only writes query parameters into the log. It is rather trivial, but I also managed to configure lighttpd and understand how the hell it can process different queries, which actually took a little bit more.

I believe most of the task is done. Almost :)
Its time to move home, omg, its tomorrow in Moscow already...

November 09, 2009 09:01 PM

Matthew Garrett: Looking to the past

It’s an oft-voiced suggestion that rather than looking at the bad things that happen in our communities, we should focus on the good things. There’s a number of highly successful geek women already – should we not be concentrating on encouraging more of them, rather than scaring people away with tales of thoughtlessness, discrimination and outright abuse?

Let’s draw an analogy. One day, a $20 charge appears on your credit card. You didn’t make it. You report it to your credit card company, who assure you that they take fraud seriously and then do nothing. A few days later, another $20 charge. Your credit card company tells you that such events are rare, unrepresentative of the general credit card experience and continue to do nothing. A week afterwards, another charge. This time your credit card company describes how they’re planning on implementing a brand new anti-fraud system, but that this is unrelated to any events that may currently be occuring and will give no details as to when it’s going to be rolled out. And proceed to ignore any further reports you make about fraudulant transactions.

Would you stay with this company? Or would you take your business somewhere else?

The problem with the “Let’s look to the future rather than spending too much time getting stuck in the present” argument is that it assures people that things will get better without providing a roadmap for getting there. It does nothing to validate their concerns or make them feel wanted within a community. It assumes either that people will stick with a community that doesn’t respond to their complaints, or that it’s possible to construct a community that’s welcome to an assortment of genders, ethnicities and lifestyles without any of those people being represented in the first place.

Ignoring people’s concerns is an excellent way to drive them away from your community. Doing so because of a potential future that’s probably conditional on you having those people in your community is short sighted and self defeating. Ignoring the present doesn’t benefit the future. It benefits the status quo.

(Originally posted here)

November 09, 2009 08:56 PM

Dave Jones: So I took a week off.



So I took a week off. is a post from: codemonkey.org.uk

Related posts:

  1. boot/init miniconf at plumbers next week. I’m MC’ing the boot/init miniconf next week at the Linux...

November 09, 2009 03:03 PM

Evgeniy Polyakov: Walking on Moscow quay

More under the cut.

read more

November 09, 2009 10:57 AM

Pavel Machek: 5 years in jail for wrong bits of data

Is possible in the U.S.. Time to fix your stupid child porn laws? Guilty until proven innocent is stupid...

November 09, 2009 08:36 AM

Kernel Podcast: 2009/11/08 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091108.mp3

For the weekend of November 8th, 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.

In today’s issue: AMD Geode, Ftrace, IO bandwidth control, modules, nconfig, per-cpu mm counters, regressions, and sysctls.

AMD Geode. Geode is a (relatively) low power SoC-type x86-compatible CPU from AMD that many have heard of due to its use by the OLPC XO laptop project. Although the chip is now discontinued, there are a number of users, and Matteo Croce noted that although the kernel has always treated it as a 586-class CPU, the Geode is “technically an i686″ processor. Given a few quirks (it lacks the “long NOP” or “NOPL” instruction, which can be emulated instead), it can be made to run as an i686 processor. Debate centered around whether it was “really worth it”, as Peter Anvin put it.

Ftrace. Michal Simek (who has been working on Ftrace support for Microblaze for some time) posted some example output and asked a number of questions of Steven Rostedt in relation to the implementation of the mcount function that is necessary for Ftrace to function correctly. Mcount is a function usually provided by GCC for applications that Steven intentionally replaces when compiling the kernel with profiling support such that he can hook into mcount and capture various information about functions as they are called.

IO Bandwidth Control. Vivek Goyal (who has an increadible amount of patience) posted an RFC in regard to the upcoming 2.6.33 merge window. Vivek is working on support for bandwidth limiting of IO and notes that recent CFQ changes actually add another layer of grouping of IO – this time within the CFQ scheduler – assigning IO based on the workload type (sync-idle, sync-noidle, and async IO). The question is whether to do bandwidth control at the outer level, or within each of these three workload type groups.

Modules. Rusty Russell noted that he has now applied Alan Jenkins’ whole series of patches to improve module loading speed through pre-sorting the symbol table and using a binary search on module load.

nconfig. Nir Tzachar posted version 5 of a “menuconfig” replacement, written using the most modern versions of the ncurses interface toolkit. The patch isn’t huge, and comes with documentation, so it might well be a candidate, if the kernel developers consider a replacement is necessary.

per-cpu mm counters. Kamezawa Hiroyuki followed up to Christoph Lameter’s previous posting of a new per-cpu array implementation for various counters currently living within the mm_struct. His concern was the overhead incurred in compiling summary statistics when userspace attempted to read the data, as is done by a variety of utilities, including both top and ps.

Regressions. Yanmin Zhang raised the issue of a 5% performance regression between 2.6.31 and the current 2.6.32-rc release. Much of these are attributed to recent scheduler changes, but not all. Mike Galbraith noted that there were some locking issues that were being fixed up that might have skewed benchmarks overly negatively, and that a fix was in the pipeline. Ingo Molnar wanted some more information about the precise setup Yanmin was using. ertainly, recent developments on Performance Events and “perf bench” will help.

Sysctls. Eric W. Biederman is currently working on various cleanups. Not content merely to have cleaned up VFS cache handing for sysfs, he decided at the same time to also take on binary sysctl support in the kernel. His 23 part patch series on that front will remove existing sysctl handling from all over the kernel tree and instead implement sys_sysctl as a wrapper over /proc/sys. Users shouldn’t notice, but kernel developers should take note.

Finally today, in replying to the AMD Geode debate over whether it was worth promoting Geode to be i686 (albeit with quirks), Alan Cox noted that checkpatch minor formatting warning output really is not intended to be useful until a patch has a serious likelihood of being accepted. i.e. while things are under development, it’s ok to take a chill pill and relax a little.

In today’s announcements: Linux 2.4.37.7. Willy Tarreau announced the latest release of the venerable 2.4 series kernel. Specifically, this latest release includes a number of potential NULL pointer deference bug fixes that all users should consider as potential issues, even if they have set mmap_min_addr to disallow the kernel from mapping the NULL page to userspace.

2.6.31.5-rt17. Thomas Gleixner announced version 2.6.31.5-rt17 of the preempt-rt kernel patch. The latest release is forward ported to 2.6.31.5, has some scheduler improvements, security fixes, and some tracer enhancements also. It is available from the usual places, including http://www.kernel.org/pub/linux/kernel/projects/rt.

Luis R. Rodriguez announced the stable compat-wireless tree for 2.6.32-rc6. This is a wireless tree backported to older kernels and allows users to make use of newer wireless drivers on older systems.

The latest kernel release is 2.6.32-rc6.

Greg Kroah-Hartman posted stable series review patches for 2.6.27.39 and 2.6.31.6. The former had 99 patches, while the later had 16. And as usual, responses were requested by Sunday afternoon.

Stephen Rothwell posted a linux-next tree for November 6th. Since Thursday, the PowerPC KVM fix went away at last, the sparc tree had a build failure for which he applied a patch, and the kvm and net trees had conflicts. The total sub-tree count remains steady at 146 trees in linux-next.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 09, 2009 05:32 AM

November 08, 2009

Valerie Aurora: Migrated to WordPress

My LiveJournal blog name - valhenson - was the last major holdover from my old name, Val Henson. I got a new Social Security card, passport, and driver's license with my new name several months ago, but migrating my blog? That's hard! Or something. I finally got around to moving to a brand-spanking-new blog at WordPress:

Valerie Aurora's blog

Update your RSS reader with the above if you still want to read my blog - I won't be republishing my posts to my new blog on this LiveJournal blog.

If you're aware of any other current instances of "Val Henson" or "Valerie Henson," let me know! I obviously can't change my name on historical documents, like research papers or interviews, but if it's vaguely real-time-ish, I'd like to update it.

One web page I'm going to keep as Val Henson for historical reasons is my Val Henson is a Man joke. Several of the pages on my web site were created after the fact as vehicles for amusing pictures or graphics I had lying around. In this case, my friend Dana Sibera created a pretty damn cool picture of me with a full beard and I had to do something with it.



It's doubly wild now that I have such short hair.

November 08, 2009 11:36 PM

November 07, 2009

Evgeniy Polyakov: Lytdybr and boring lifestyle

This week was sport-free, since my climbing partner was sick, so I devoted this time for hacking and mostly thinking.
Except than about programming I think about music - I see the progress and especially like how I learn new parts of melodies, usually simple enough like "Strangers in the night" or "Samba De Janeiro". But teacher always finds some scales or strokes which I can not play easily from the first attempt, so need to spend some time at my morning exercises, thus each lesson becomes actually a time, which highlights my weak trumpet playing sides only. Do not like this, but getting the dynamics of the progress I do not expect it to be long enough. The nearest goal is to extend high range to the high C, and better if it will be 3 octaves long, since right now I can rather easily play only from small piano E upto second octave E, maybe a bit higher, but not always stable enough.
Will move

I did not yet start to take piano lessons, but think about it more and more. Instead I try to train my ears reading notes and playing solfegio. Not very successfully yet though, and very slow so far...

Having a car can play a ditry trick - suddenly you can find yourself far away from civilization, ATMs and without money in the pocket

I made a nice 3 hours long (fortunately car makes this fun quite simple, since it is rather cold here already) photo session of the night Moscow quay. It is so beautiful, that I doubt my photos will allow to get it, but I believe there are some interesting shots. I will post them when they are ready, that's what I have handy yet without editing.

Those are recent happenings, which were not yet erased by more boring stuff... Stay tuned and, who knows, maybe there will be something more.

November 07, 2009 11:16 PM

November 06, 2009

Kernel Podcast: 2009/11/05 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091105.mp3

For Thursday, November 5th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: CVE-2009-2584, Generic per-cpu counter arrays, MM locking, page types, performance events, and the scheduler.

CVE-2009-2584. A security issue was recently found in a procfs function contained within the sgi-gru driver. It involved unsafe use of strncpy_from_user. Various people posted fix suggestions for it, while Linus noted that most of the logic in the offending function (options_write) was “utter sh*t as far as I can tell”. He posted a couple of entirely untested patches (Linus style) for others to take a look at. Meanwhile, it was also noted that few people had the hardware, which helped to mitigate the issue.

Generic per-cpu counter arrays. Kamezawa Hiroyuki, noting that the patch had been “ony my queue for a month”, posted an RFC patch intended to add support for generic percpu counter arrays. His patch uses the recent dynamic percpu support to create arrays of per-cpu data on the fly, using some macros such as DEFINE_COUNTER_ARRAY, and functions such as counter_array_init, and counter_array_add to manage entries being added to an existing array.

MM locking. Christoph Lameter posted an RFC MM patch implementing a variety of “accessors for mm locking”. Essentially, the idea is to abstract and wrap up use of mmap_sem such that it could eventually be ripped out and replaced without having to touch a lot of MM code once again. Christoph notes that the patch is “currently incomplete” but it does at least build.

Page Types. Fengguang Wu posted a followup to his previous patch enabling one to specify new page type information on the command line of the “page-types” utility (used to decode various VM data) with an example of how one could educate page-types about new types of page flags on the command line.

Performance Events. Hitoshi Mitake posted version 5 of a 7 part patch series implementing the “perf bench” command, and incorporating Rusty Russell’s original “hackbench” scheduler benchmark code.

Scheduler. Lai Jiangshan noted that a previous patch from Mike Galbraith didn’t seem to be mitigating the problems with the scheduler running tasks on the wrong CPU. In his case, the built-in kernel thread named “events” for CPU 1 was in fact shown (by using Ftrace) to be running on CPU0. Mike noted that the problem was likely to be in the migration code not holding the runqueue lock and thus not being safe against pre-emption and subsequent chaos.

In today’s announcements: AlacrityVM version 0.2. Gregory Haskins announced the 0.2 release of his AlacrityVM project. This is a modified KVM that uses a replacement virtualized IO bus for improved performance of, for example, network packet transfer between host and guest. The latest version includes some nice features, such as zero-copy transmits in the VENET driver. For further informatin, visit:
http://developer.novell.com/wiki/index.php/AlacrityVM.

The latest kernel release is 2.6.32-rc6.

Stephen Rothwell posted a linux-next tree for November 5th. Since Wednesday, the PowerPC KVM fix was still around, while the pcmcia, drbd, and catalin trees lost their issues, and the sparc tree gained a build failure for which Stephen applied a patch. The total sub-tree count remained at 146 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 06, 2009 01:30 PM

Evgeniy Polyakov: Elliptics network goes production

Kind of goes - there is a perfect task for this solution, which I can try to hook into this year. The New Year deadlines all deadlines, so there is about a month and a half for the task.

Task is quite simple actually - there is a huge library of files, which does not fit single storage machine. And although it is not that large, about 5-10 Tb of data for starters, next step is to suck in close to 200 Tb of data. Task is to allow on-demand reading without updates of the existing files, only new ones will be added with time. I expect millions of reads per day.

Files should be spread over multiple machines for read balancing, there should be multiple copies of each for redundacy. System should transparently handle failures (storage machines will be spread over multiple data centers). And the main request is to allow to fetch files over direct links, i.e. elliptics network provides data location and some usual HTTP server will give them away.

While I wrote this entry another cool task (re)appeared: clusterize some very popular monitoring system, which to date does not scale very well to existing amount of notification writers (about 200k small writes per second per small cluster). I need to provide fault-tolerant storage which will be able to suffer this load and allow simple horizontal scaling on demand.

Existing performance numbers show that elliptics network can easily handle all those tasks, but some obscure numbers created by the project author are usually not enough for those who deploy new system. As in any other business, people do not eager to try something new. New, shiny and likely buggy...

Well, let's show what we can do. I will post results and setup systems here.

November 06, 2009 01:21 PM

Pete Zaitcev: The litl thing

Apparently, it uses S3. My plan to take over the world is proceeding as I have foreseen.

November 06, 2009 12:08 AM

November 05, 2009

Kernel Podcast: 2009/11/04 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091104.mp3

For Wednesday, November 4th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Cgroups, FatELF, PerCPU MM counters, and Swap.

Cgroups. Balbir Singh posted to let everyone know that discussion is happening concerning the most appropriate place to mount the cgroup filesystem. Since the Linux Filesystem Hierarchy Standard (FHS) was written prior to the existence of cgroups, it has no specific advice, which leads to three alternatives. These are /dev/cgroup, /cgroup, or some place under /sys. Balbir prefers the first option, but that will require some co-operation with udev. He asks for advice from others as to the best place for this to live. Several people seem to be quite happy with /sys/kernel/cgroup (which is not the only filesystem that gets mounted there).

FatELF. Continuing the discussion on the relative merits of “FAT” image files containing multiple ELF objects, Mikulas Patocka made some interesting comments on Linux package managers, describing them as “evil”. In his opinion, FatELF might provide a means to ship single image files containing all of the files an application needs to execute in one object, similar to how Apple and other operating systems already do today. Mikulas is concerned about the relative difficulty Linux users face in installing software not provided by their distribution using package management software. He makes a good point, although FatELF may not be the solution to that particular problem.

PerCPU MM counters. Christoph Lameter, noting that support for generic per-cpu operations is now in the “percpu” and linux-next trees, posted a patch implementing per-cpu mm counters for tasks rather than single entires in mm_struct. This obviates the need for larger SMP systems to perform atomic updates to mm counters and (intuitively) implies a performance improvement. The only downside is occasionally having to iterate over each of these per-cpu values when the actual count values are being requested.

Swap. Following on from the recent discussion about OOM killer behavior and the various metrics that might be used in the future, Kamezawa Hiroyuki posted a patch that exports per-process (task) swap usage statistics via procfs. This happens through the addition of a new “VmSwap” entry in /proc/pid/status.

The latest kernel release is 2.6.32-rc6.

Stephen Rothwell posted a linux-next tree for November 4th. There had been no tree the previous day due to a national holiday in Australia, where he is based (and one trusts the horse race went well, too). Since Monday, there was a new “msm” tree (which is an ARM platform), the PowerPC KVM fix was still required, and a couple of other conflicts went away. The total sub-tree count increased today to 146 trees with the addition of the “msm” tree.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 05, 2009 12:16 PM

Kernel Podcast: 2009/11/03 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091103.mp3

For Tuesday, November 3rd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Block IO controller, FatELF, Ftrace, Performance, and Sysctls.

Block IO controller. The ever patient Vivek Goyal, fresh from the IO minisummit in Tokyo, posted the first version of a new IO bandwidth control patchset entitled that “Block IO Controller”. This RFC patch series aims to address the problem of there being no “one size fits all” IO control policy, and the need for different policies to be implemented for different uses. The patch introduces what Vivek calls the blkio cgroup controller, through which a management interface is provided that can be used to switch policies.

FatELF. Eric Windisch posted some example use cases for FatELF that he felt others should know about, in an attempt to counter some of the points made by Alan Cox previously. In particular, it would seem that Eric is into Cloud Computing in a big way and looks forward to having virtual machine images that can simultaneously run on a variety of different hardware. Although there is certainly some benefit provided by FatELF, it wasn’t clear how these problems couldn’t be solved as Alan had suggested – with different directories containing versions of the same binaries for the different arches.

Ftrace. Michal Simek posted to let everyone know that he is currently working on Ftrace support for the Microblaze CPU architecture (an FPGA-based soft core from the folks at Xilinx). In particular, he is looking at function trace support at the moment and how the mcount function is used to record entry into each individual function. He has a number of questions, and Steven Rostedt (the Ftrace author) was happy to help answer a number of them.

Performance. Alex Shi posted with an observation that performance testing had yielded results with a 20-30% drop off in the 2.6.32-rc5 timeframe. This seemed to be due to a cfq-iosched patch from Jens Axboe. Alex attached an example run of perf stat both with and without the patch, showing a clear difference between the two sets of data.

Sysctl. Eric Dumazet recently observed that sysctl table entries were quite expensive, due to a sentinel value added after each one in order to detect and avoid corruption of table entries. Eric noted that the sentinel need actually only contain a couple of pieces of data, and so he created a special sentinel entry struct called ctl_table_sentinel that was smaller in size. This would apparently reduce RAM utilization of such entries by 40%.

In today’s announcements: Userspace RCU. Mathieu Desnoyers posted to let everyone know that version 0.3.0 of his Userspace RCU patches is now available. This is an RCU implementation using the POSIX pthread functions that applications can use to take advantage of the same features as the kernel has done for some time. The latest version removes a function (call_rcu) for which he had provided differing arguments and semantics than the kernel.

The latest kernel release is 2.6.32-rc6. Linus Torvalds announced version 2.6.32-rc6 of the Linux kernel at 12:05pm US Best Coast Time (PDT). In his announcement, Linus noted that there had been a longer gap since rc5, due in large part to the number of kernel developers who have been away at the kernel summit in Japan or traveling to and fro. There was also an ext4 filesystem corruption problem that required additional time, and that had turned out to be due to enabling checksum testing of journal transactions during recovery. Linus thanked Eric Sandeen for tracking down that particular problem. He also seemed pleased at the number of regressions addressed since 2.6.31.

Stephen Rothwell announced that there would be no linux-next tree for November 3rd due to a public holiday in Australia where he is based, which has apparently also has “nothing to do with a horse race in Melbourne”.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 05, 2009 11:49 AM

Kernel Podcast: 2009/11/02 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091102.mp3

For Monday, November 2nd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: BKL, FatELF, Fast symbol resolution, OOM, and Performance benchmarks.

BKL. There is an ongoing effort to remove the BKL (Big Kernel Lock), which is the last stayover from early Linux support for SMP. Discussion of BKL removal was revived during the recent Real Time pre-emption mini-summit, and Jan Blunk is amongst those who have been looking at this from the filesystem level. He posted a series of patches intended to push BKL use down into individual filesystems from the generic kernel code (for example do_new_mount()) that it lives in today. He requests comments.

FatELF. There was some ongoing (and quite considerable) push back against the notion of supporting FatELF binaries. Chris Adams wondered aloud just what the target audience really was? As he sees it, embedded users don’t want the bloat, Enterprise distributions already have specific support processes in place for different architectures, and community distributions aren’t likely to want to deal with the increased build complexity and space requirements. Meanwhile, Alan Cox congratulated Ryan C. Gordon on re-inventing the concept of a directory – since directories already allow one to have multiple versions of a binary installed on a given system and to pick and choose between them. Sure that’s not as shiny as an Applesque approach, but it has worked for many decades at this point, and most of the distributions implement multi-arch (sometimes called multi-lib) using some kind of similar approach.

Fast symbol resolution. Alan Jenkins posted the latest version of his fast LKM symbol resolution patches. These take advantage of a binary search for symbol resolution at module load time, using a pre-generated (at build time) sorted table of exported kernel symbols. Using this approach, Alan has once again succeeded in reducing overall system boot time slightly on his netbook. The latest version of the patches has seen some limited testing on ARM and has also been built for Blackfin, so it’s not just x86 at this point.

OOM. Kamezawa Hiroyuki posted to let everyone know that he was putting code where his mouth was with a “total renewal” of the OOM killer code. This isn’t complete at this stage, but it is intended to keep the conversation moving. The first patch lays groundwork (including new OOM type classifications), while the second and subsequent patches add the ability to count swap use per process and implement a newly updated badness calculation that uses rss+swap as the base value but also factors in cpusets, and gives tasks a bonus for how far in the past their last allocation occured, and their runtime.

Performance benchmarks. Hitoshi Mitake posted to let everyone know that he has been working on integrating a benchmark subsystem into the existing – and already fairly extensive – “perf” (or performance events) utility. He asked Rusty Russell for permission to pull Rusty’s hackbench code directly into the kernel tree as part of this effort, which can be used by calling “perf bench sched” with whatever parameters one might wish to specify.

Finally today, Tilman Schmidt requests that we draw attention to the Kernel Cleanup wiki that Robert P J Day has been working on. The page at www.crashcourse.ca/wiki/index.php/Kernel_cleanup includes information about unused Kconfig variables, badly referenced ones, and general problems with kernel code that need further investigation in general.

In today’s announcements: LTP. Subrata Modak posted announcing that the Linux Test Project for October 2009 has been released. The latest version includes fixes, 119 test scenarios for EXT4 testing, new GETUID16/GETUID64/GETEUID16 and PTRACE system call tests, and much more. As usual, it is available at http://ltp.sourceforge.net/.

Sysprof. Soeren Sandmann announced version 1.1.4 of the sysprof CPU profiler. This is the latest version to be based upon the rewrite to make use of the new performance counters interface for exposing the low-level hardware counters. Since the previous 1.1.2 release, there have been a number of fixes. A download is available at http://www.daimi.au.dk/~sandmann/sysprof/.

The latest kernel release was 2.6.32-rc5.

Stephen Rothwell posted a linux-next tree for November 2nd. Since Friday, his fixes tree still has that PowerPC KVM fix, while there were a number of arch issues affecting ARM and OMAP in particular. The sub-tree count remains steady today at 145 trees in linux-next.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 05, 2009 03:57 AM

Kernel Podcast: 2009/11/01 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091101.mp3

If at first you don’t succeed. Welcome to version 2.0 of the LKML summary podcast. In this revamped version I will concentrate on the major issues under discussion on a given day, rather than commenting on every single patch, which had become an unsustainable load. I am still interested to hear from volunteers who might help to make the podcast workload less challenging on a daily basis.

For the weekend of November 1st 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.

In today’s issue: Fanotify, FatELF, Futexes, KVM, Memory Overcommit, Regressions, and Thread Naming.

Fanotify. Eric Paris posted a patch series implementing a new file mode entitled FMODE_NONOTIFY, which can only be set by the kernel itself. Its job is to indicate that an fd was opened by fanotify itself and should not cause future fanotify events. This allows one to obviate such livelock scenarios as would otherwise occur from fanotify close events resulting in repeated opens on a file that would then be closed and cause another event to be emitted.

FatELF. Ryan C. Gordon posted what he hoped would be his final round of FatELF patches. These extend the Linux kernel’s ELF binary format handler loader code to accept “FAT” images containing multiple ELF binaries, allowing for such features as multi-arch code encapsulated within a single binary. In some respects, the feature behaves similar to Apple’s Universal Binary format, which it was noted is covered by several patents. More information on FatELF can be found at http://icculus.org/fatelf/.

Futexes. Darren Hart, known for his involvement in the RT kernel community, recently posted an RFC patch series intended to make futex_lock_pi into a fully interruptible syscall. This would allow for canceling of locking requests, while preserving FIFO ordered wakeup and Priority Inheritance requirements, and without having to try to emulate this behavior in userspace. He included a test case demonstrator, which used an RT signal handler to abort the futex locking attempt. Arnd Bergmann responded that it should be possible to simply longjmp out of the test application signal handler and avoid modifying the kernel, something that Darren confirmed did work, but he was apprehensive as to whether there might be unintended issues in doing this.

KVM. Gleb Natapov posted a patch series implementing asynchronous page faults for paravirtualized KVM guests. Typically, a guest encountering a page fault becomes blocked until the faulting page is made available by KVM and the guest can be resumed. But paravirtualized guests are aware of the hypervisor and can interact with it. In this case by blocking only the faulting task within the guest and not the entire guest VM. The faulting page can then be swapped in while the guest is still running, using the assistance of a parallel thread within the hypervisor.

Memory Overcommit. Here comes the annual OOM killer discussion. Back in the middle of October, Vedran Furac sent a message entitled “Memory overcommit”, in which he posited how still today a trivial C program run by an ordinary user that attempts to perform large memory allocations can trigger the OOM killer and really take down a system (by killing many essential system services other than the guilty task) once overcommit_memory is disabled. In the example, Vedran had cited how 8 processes were killed, including the X server and some long running system daemons. He felt that the OOM killer really only served to give Linux a bad reputation amongst some users and that it was better to simply disable it by default – enforcing strict allocation only of the available free pages. Others disagreed, although Vedran had a point in saying the OOM killer might as well be renamed to TRIPK – Totally Random Innocent Process Killer.

Kamezawa Hiroyuki had made several mitigation suggestions against overcommit problems, including the use of oom_adj and explicit cgroups. But Vedran was more concerned with how the OOM killer algorithm seemed to be making the wrong choices in the first place as to which tasks should die. This is an issue that comes up every once in a while. Vedran and Kamezawa had previously taken the discussion off-list (to the mm list instead) but it now returned to LKML, Kamezawa having written a script to analyze the oom_score of existing processes on his own system and discovering (for example) that his GNOME desktop processes were being considered more bad by the OOM killer than the sample “allocate one 1GB of memory” task that had taken down Vedran’s box.

Kosaki Motohiro suggested that problem was the number of libraries the average desktop application is linked against, and also suggested that the OOM killer should not account for evictable file-backed mappings (such as libraries) in calculating the oom_score. This lead to a discussion as to the best meta to consider in making OOM kill decisions. It was deemed necessary to consider the VM size in order to catch swap-ed out fork bomb process attacks but Kosaki noted that basing oom_score on RSS + swap-entries figures would be acceptable to him as an alternative. This lead on to a lengthy discussion thread (and a number of patch iterations – including a nice analysis from Hugh Dickins), concerning the best ways to overhaul the OOM killer for modern systems and what exactly the criteria should be. Should it be that the biggest resident memory eater is always killed (which is hard to predict)? or should the total vm size (including resident and non-resident pages) factor into the decision?

Regressions. Caleb Cushing posted to let everyone know that his network performance has dropped off considerably since moving to 2.6.31.x. But the problem seems ellusive, having bitten in 2.6.30.x previously, then seeming to vanish before apparently re-appearing in 2.6.31.x. Having never performed a bisection before, Caleb wasn’t entirely sure of the process, but did post the log from a bisection hoping that others might chime in with some input.

Thread naming. John Stultz posted another iteration of a patch he has been working on that allows threads to renaming their siblings by writing into /proc/pid/tasks/tid/comm. This will allow thread managers to nicely set the task name of their children, for logging as well as for appearance.

In today’s announcements: The kerneloops.org report for the week of October 31 2009. Arjan van de Ven posted this week’s summary of recorded kernel oops logs from his kerneloops.org online service. A total of 18,023 oopses and warnings were logged over the past week, more than a 200% increase over the past week, though this week’s report co-incides with the latest Ubuntu release (which includes the ability to file such reports for the first time). The top warnings were in suspend_test_finish, acpi_idle_enter_bm and dev_watchdog.

The latest kernel release was 2.6.32-rc5.

Andrew Morton posted an mm-of-the-moment for 2009-11-01-10-01. It contains a fair number of patches against the 2.6.32-rc5 kernel.

Stephen Rothwell posted a linux-next tree for Friday. Since Thursday, he had a PowerPC KVM fix, some architectural fixes, and network and percpu conflicts that needed to be resolved. There are currently 145 sub-trees in linux-next.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

November 05, 2009 03:13 AM

November 04, 2009

Harald Welte: German news site Spiegel Online has video of my torched car

Some 9 months after some idiots have put my car on fire, the german news site Spiegel Online reports on a court trial unrelated to my car, but showing a video of my car.

Quite funny how they always dig out that footage. The court case was about an alleged failed attempt to torch a car, so showing two completely burnt cars in that article is not really sensible anyway.

As you can see from the article, there' already more than 250 burnt vehicles this year in Berlin.

November 04, 2009 01:00 AM

Harald Welte: Android Mythbusters (Matt Porter)

Some weeks ago I was attending Embedded Linux Conference Europe. My personal highlight at this event was the excellent Android Mythbusters presentation given by Matt Porter.

As you may know, Matt Porter was heavily involved in the MIPS and PPC ports of Android, so he and his team have seen the lowest levels of Android, more and deeper than even cellphone manufacturers ever have to look into it.

The slides of his presentation are now available for download. I would personally recommend this as mandatory reading material for everyone who has some interest in Android.

The presentation explains in detail why Android is not what most people refer to when they say Linux. What most people mean when they say Linux is the GNU/Linux system with it's standard userspace tools, not only the kernel.

The presentation shows how Google has simply thrown 5-10 years of Linux userspace evolution into the trashcan and re-implemented it partially for no reason. Things like hard-coded device lists/permissions in object code rather than config files, the lack of support for hot-plugging devices (udev), the lack of kernel headers. A libc that throws away System V IPC that every unix/Linux software developer takes for granted. The lack of complete POSIX threads. I could continue this list, but hey, you should read those slides. now!

Just one more practical example: You cannot even plug a USB drive to an android system, since /dev/sd* is not an expected device name in their hardcoded hotplug management.

Executive summary: Android is a screwed, hard-coded, non-portable abomination.

I can't wait until somebody rips it apart and replaces the system layer with a standard GNU/Linux distribution with Dalvik and some Android API simulation layer on top. To me, that seems the only way to thoroughly fix the problem...

November 04, 2009 01:00 AM

November 03, 2009

Valerie Aurora: ZFS gets deduplication - the right way

ZFS now has data deduplication - with the right configuration options for safety and performance in a compare-by-hash based storage system. From Jeff Bonwick's ZFS deduplication blog entry:

Given the ability to detect hash collisions as described above, it is possible to use much weaker (but faster) hash functions in combination with the 'verify' option to provide faster dedup. ZFS offers this option for the fletcher4 checksum, which is quite fast:

zfs set dedup=fletcher4,verify tank

The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random hash function, and therefore cannot be trusted not to collide. It is therefore only suitable for dedup when combined with the 'verify' option, which detects and resolves hash collisions. On systems with a very high data ingest rate of largely duplicate data, this may provide better overall performance than a secure hash without collision verification.

What I like is (1) the user chooses the hash function based on their security and performance needs, (2) the system can optionally check for hash collisions, and (3) the ZFS storage pool design makes it easy to migrate data to a new hash function if necessary. ZFS is the first deduplicating storage system I know of with these features. (Do let me know if there are others out there!)

November 03, 2009 01:23 AM

November 02, 2009

Evgeniy Polyakov: Data de-duplication in ZFS and elliptics network (POHMELFS)

Jon Smirl sent me a link describing new ZFS feature - data deduplication.

This is a technique which allows to store multiple data objects in the same place when their content is the same, thus effectively saving the space. There are three levels of data deduplication - files (objects actually), blocks and bytes. Every level allows to store single entity for the multiple identical objects, like single block for several equal data blocks or byte range and so on. ZFS supports block deduplication.

This feature existed effectively from the beginning in the elliptics network distributed hash table storage, but it has two levels of data deduplication: object and transaction. Well, actually we have transaction only, but maximum transaction size can be limited to some large enough block (like megabytes or more, or can be unlimited if needed), so if object is smaller than that, it will be deduplicated automatically.

Which basically means that if multiple users write the same content into the storage and use the same ID, no new storage space will be used, instead transaction log for the selected object will be updated to show that two external objects refer to given transaction.

Depending on transaction size it may have a negative impact, in particular when transaction size is smaller than log entry, it will be actually a waste of space, but transactions are required for the log-strucutred filesystem and to implement things like snapshots and update history. By default log entry size equals to 56 bytes, so it should not be a problem in the common case.

POHMELFS as elliptics network frontend will support this feature without actually any steps out of the box.

November 02, 2009 03:34 PM

Pavel Machek: Debugging MMC is easier...

...if you have MMC card inserted. Oops. I added enough of registration infrastructure and GPIO support to dream that mmc controller is detected, but it is still not enough to get card recognized.

November 02, 2009 01:23 PM

Pavel Machek: Dream booting

With Brian's help, I got recent kernel to boot on HTC Dream. Patches will follow.

November 02, 2009 10:38 AM

November 01, 2009

Paul E. Mc Kenney: Hunting Heisenbugs

My children's opinions notwithstanding, I recently found myself pursuing some nasty concurrency bugs in Linux's TREE_RCU implementation. This was not particularly surprising, given that I recently added preemption support, and the code really hadn't put up that much of a fight. In fact, I was getting the feeling that the bugs had gotten together and decided to hide out, the better to ambush me. This feeling wasn't far from wrong.

My first hint of trouble appeared when I began running longer sequences of rcutorture runs, seeing an occasional failure on one-hour runs. My first reaction was to increase the duration to ten hours and attempt to bisect the problem. Of course, even with bisection, finding the bug takes quite some time given ten hours for each probe, so rather than use “git bisect”, I manually ran parallel runs and (for example) quadrisected. I also ran multiple configurations. The results initially hinted that CONFIG_NO_HZ might have something to do with it, but later runs showed no shortage of failures in !CONFIG_NO_HZ runs as well.

The initial results of the bisection were quite puzzling, converging on a commit that could not possibly change RCU's behavior. Then I noticed that one of the machines seemed to be generating more failures than others, and, sure enough, this difference in failure rate was responsible for the false convergence. I therefore started keeping more careful records, including the machine name, test duration, configuration parameters, commit number, and number of errors for each run. These records proved extremely helpful later on.

Further testing showed that 2.6.32-rc1 (AKA 2.6.32-rc2) was reliable, even for the error-prone machine, and that 2.6.32-rc3 was buggy. Unfortunately, there are no RCU-related commits between 2.6.32-rc1 and 2.6.32-rc3. Unless you count commit #828c0950, which simply applies const to a few data structures involved in RCU tracing, which I don't and you shouldn't. So I ran a few more runs on 2.6.32-rc1, and eventually did trigger a failure. In contrast, 2.6.31 was rock solid.

Now there are quite a few RCU-related patches between 2.6.31 and 2.6.32-rc1, so I started searching for the offending commit. However, by this time I had written scripts to analyze rcutorture output, which I used to check the status of the test runs, stopping runs as soon as they generated an error. This sped things up considerably, because failed runs now took on average only a few hours rather than the 10 hours I was using as a (rough) success criterion.

Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?

Testing eventually converged on commit #b8d57a76. By this time, I getting a bit paranoid, so I ran no fewer than three ten-hour runs at the preceding commit on the most error-prone machine, none of which failed. But this commit does nothing to RCU, but rather makes rcutorture testing more severe, inserting delays of up to 50 milliseconds in RCU read-side critical sections. I therefore cherry-picked this commit back onto 2.6.31 and 2.6.30, and, sure enough, got failures in both cases. As it turned out, I was dealing with a day-one bug in TREE_RCU.

This did simplify matters, permitting me to focus my testing efforts on the most recent version of RCU rather than spreading my testing efforts across every change since 2.6.31. In addition, the fact that long-running RCU read-side critical sections triggered the bug told me roughly where the bug had to be: force_quiescent_state() or one of the functions it calls. This function runs more often in face of long-running RCU read-side critical sections. In addition, this explained the earlier CONFIG_NO_HZ results, because one of the force_quiescent_state() function's responsibilities is detecting dyntick-idle CPUs. In addition, it raised the possibility that the bug was unrelated to memory ordering, which motivated me to try a few runs on x86 — which, to my surprise, resulted in much higher failure rates than did the earlier tests on the Power machines.

I stubbed out force_quiescent_state() to check my assumption that it was to blame (but please, please do not do this on production systems!!!). Stubbing out force_quiescent_state() resulted in a statistically significant 3x decrease in failures on the x86 machine, confirming my assumption, at least for some subset of the bugs. Now that there was a much smaller section of code to inspect, I was able to locate one race involving mishandling of the ->completed values. This reduced the error rate on the x86 machine by roughly the same amount as did stubbing out force_quiescent_state(). One bug down, but more bugs still hiding.

I was also now in a position to take some good advice from Ingo Molnar: when you see a failure, work to increase the failure rate. This might seem counter-intuitive, but the more frequent the failures, the shorter the test runs, and the faster you can find the bug. I therefore changed the value of RCU_JIFFIES_TILL_FORCE_QS from three to one, which increased the failure rate by well over an order of magnitude on the x86 machine.

Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

Given that the race I found involved unsynchronized access to the ->completed values, it made sense to look at other unsynchronized accesses. I found three other such issues, and testing of the resulting patches has thus far turned up zero rcutorture failures.

And it only took 882.5 hours of machine time to track down these bugs. :–)

This raises the question of why these bugs happened in the first place. After all, I do try to be quite careful with RCU-infrastructure code. In this case, it appears that these bugs were inserted during a bug-fixing session fairly late in the TREE_RCU effort. Bug-fixing is often done under considerably more time pressure than is production of new code, and the mistake in this case was failing to follow up with more careful analysis.

Another question is the number of bugs remaining. This is of course hard to say at present, but Murphy would assert that, no matter what you do, there will always be at least a few more bugs.

Answer to Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?.

Answer to Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

November 01, 2009 11:35 PM

Evgeniy Polyakov: Comparing Key/Value Stores

From pl.atyp.us.

Following storage systems were checked:
* tabled (git clone on 10/27) using boto
* Cassandra 0.4.1 using thrift
* Riak (hg clone on 10/27) using jiak
* Voldemort 0.56
* Tokyo Tyrant 1.1.37 (Cabinet 1.4.36) using pytyrant
* chunkd (git clone on 10/27) using own chunkd.py based on Python’s ctypes module
* Keyspace 1.2 using the built-in Python interface

Results can be found in a spreadsheet, but for lazy ones I want to note, that Tokyo Tyrant was far away from any other concurent (in order of 4-20 times). But since it is single-server storage, it would not be fair to compare against others, which can scale.

Actually I need to say 'could scale', since I did not find any at least remotely similar to fairly scaled numbers, most of the applications behave worse when running on 2-3 nodes cluster.

One can compare them against elliptics network numbers, but getting that it is my results, one can assume it is unfair comparison. I'm pretty sure authors of the all above storage systems had their 'nice' results too.

November 01, 2009 03:34 PM

October 31, 2009

Harald Welte: Enabling jabber in WebOS on the Palm Pre using a binary patch

One of my main complaints about the palm Pre is that there is no support for the major IM protocol's such as jabber, icq, aim, msn, ...

As I discovered, they're actually using a library (libpurple) that supports all those protocols. It's just the UI and the intermediate LibpurpleAdapter program which artificially restrict the features that this library offers.

So it sounds to me like palm is getting money or other favors from Google to artificially restrict the capabilities of the Webos messenger.

As I have described in this mail to the webos-internals mailing list, you can actually use a very simple one-byte binary patch to LibpurpleAdapter to enable jabber support.

After that binary patch, you can add jabber contacts with the regular user@jabber-server.doma.in address and use the regular messenger application for keeping in touch with your jabber contacts. Just like how it is supposed to be.

Legal notice: Making this binary patch is legal, since LibpurpleAdapter is actually released under LGPL. If you have a working build environment for the Pre with all the libpurple headers, you can of course modify the source code and recompile it (as explained in the mail).

Side note: The libpurple-adapter source code that Palm has published on opensource.palm.com does not correspond to the actual binary code. This is a LGPL violation. However, since palm is the copyright holder, nobody can really do anything about it. But it once again shows that the software build/release process does not automatically generate the source code packages and that there is an erroneous manual process involved :(

October 31, 2009 01:00 AM

October 30, 2009

Pete Zaitcev: Blog-resident development in the clouds

In case folks don't know, I'm a massive blogger, but it's not blogging about programming and most especially not blogging while programming. I dabbed in it, but it became very obvious to me that it was a province of douchebags and Rusty Russel (who blogged good things about lguest and other projects). The end of my dabbing occured when jbj declared that he's "taking development of RPM 5 to the blog". Seeing that put a capstone into my communication phylosophy. We kernel programmers do the business on mailing lists, Jon Corbet summarises the results.

But it looks like outside of the kernel, a different way of life arose, congealed, or whatever. I cooperate on Hail with Jeff Darcy, and I learned today that he has what is a programming blog. Darcy is not as exceptional as Rusty among kernel hackers. Cloud-y folks, they all blog. But I never knew what to make of that, if it was Sturgeon's law. However, Jeff is not a random wanker, he codes good things. He's also fully versed in good e-mail: no top-posting or HTML from him.

Not sure if this blog is going to explode with programming detail, but even if I'm not as cool as Jeff Darcy or Rusty Russel, why the heck not. It may be worth documenting the thinking missing from commit logs.

In case anyone asks, I still hate Twitter.

October 30, 2009 07:37 PM

October 29, 2009

Valerie Aurora: Bay bridge workaround

For my money, the Bay bridge can stay closed. I couldn't believe what a difference it made when the Bay bridge was closed over Labor Day weekend. My crappy, noisy, stressful SOMA neighborhood became quiet and pedestrian-friendly. Birds sang. Property values would skyrocket. Even just closing half the lanes would make a huge difference.

Anyway, to do my teensy-tiny part in making this a possibility, I just want to remind people that you can work around the Bay bridge closure even if your ultimate destination isn't on public transit. Just take BART across and get a Zipcar the rest of the way.

October 29, 2009 05:06 PM

Harald Welte: India prohibits import of GSM handsets without IMEI

As has been reported at telecomtiger.com, the Commerce Ministry of India has banned the import of mobile phones with no IMEI.

This is somewhat funny, as the IMEI is stored in flash memory in all the phones that I have seen in recent years. Tools to erase or change the IMEI can be found for many popular phones, including (but not limited) to the many MTK based inexpensive phones from China.

So sure, you can now no longer import a device legally with no IMEI, but well, any self-respecting organized criminal will find a way to erase or alter the IMEI anyway ;)

October 29, 2009 01:00 AM

October 28, 2009

Matthew Garrett: More GMA500

But is Intel really the party at fault, here?

For shipping a gpu without open drivers? Given that the alternatives involve someone else designing, fabbing and releasing a piece of hardware under Intel's name without being sued in the process, I'm going to have to say "Yes".

(Note that while Moblinzone.com is a website owned by Intel, the writers don't appear to be Intel employees)

October 28, 2009 06:05 PM

October 27, 2009

Dave Jones: An update on the state of my head.

First off, thanks to everyone who commented on my last post, or sent email expressing concern etc. Much appreciated. Though it did make me feel like I was in an episode of house, with the number of diagnosis’s I got from everyone who had had something similar, or known someone, or known a doctor etc.

So I had my head scanned last friday, and got the results today. It showed up nothing of concern. (Which shot down the majority of the suggestions I got from people, Dr House would not be impressed with you). While a clear report in some ways was a relief as it ruled out so many things, in other ways it was annoying because I still didn’t know for sure what has been going on with the headaches over the last month.

The current theory is that I’m suffering from cluster headaches. The symptoms sure do sound familiar. (Right down to the cute graphic, though mine is the right eyeball mostly). So I got a prescription today for some naproxen and imitrex. The latter reminded me why high-deductable insurance is a bad idea. $149 for a months worth. Suck.

Hopefully they will at least make the pain manageable. How long I’ll have to take them for is currently unknown.

An update on the state of my head. is a post from: codemonkey.org.uk

No related posts.

October 27, 2009 10:45 PM

Pavel Machek: umount: /mnt2: device is busy.

I hate this part of unix behaviour. I'm root, yet some forgotten bash in some xterm somewhere prevents me from unmounting device. Yes, lsof exists, and it often works, but... I hope we can get revoke support soon and introduce working unmount -f...

October 27, 2009 10:43 PM

Stephen Hemminger: Ubuntu 9.10 hates kernel developers?

Ubuntu has never been the easiest distribution to do kernel development, but it looks like with 9.10 it has made things too painful. I need to build and install kernels all the time, and usually just update grub menu manually. But now with grub 2 in Ubuntu 9.10 they have wrapped the grub menu in grub-mkconfig. Why?

It would be great if the system was setup so just doing 'make install' in the kernel source put in the kernel and updated the grub.cfg, but no that would make too much sense.

P.s: they managed to break the sky2 driver somehow, the connection won't come up and negotiates the wrong speed. It turned out not to be a kernel problem; wiring issue (speed), combined with some Network Manager changes

October 27, 2009 10:02 PM

Rusty Russell: Not Always Lovely Blooms…

So, with my recent evangelizing of Bloom Filters, Tridge decided to try to apply them on a problem he was having.  An array of several thousand of unsorted strings, each maybe 100 bytes, which needed checking for duplicates.  In the normal case, we’d expect few or no duplicates.

A Bloom Filter for this is quite simple: Wikipedia tells you how to calculate the optimal number of hashes to use and the optimal number of bits given (say) a 1 in a million chance of a false positive.

I handed Tridge some example code and he put it in alongside a naive qsort implementation.  It’s in his junkcode dir.  The result?  qsort scales better, and is about 100x faster.  The reason?  Sorting usually only has to examine the first few characters, but creating N hashes means (in my implementation using the always-awesome Jenkins lookup3 hash) passing over the whole string N/2 times.  That’s always going to lose: even if I coded a single-pass multihash, it’s still having to look at the whole string.

Sometimes, simplicity and standard routines are not just clearer, but faster.

October 27, 2009 04:46 AM