Kernel Planet
March 19, 2010
New weekend is coming, and it there is a new snow in Moscow.
But unfortunately I'm unlikely to move to the 'mountains' this weekend - I damaged my leg next week so that even walked three days with a crutche. It was a bit warm day, although it was lower than -10C early morning when I moved to the ski resort.
But on the slope weather was noticebly hotter - about 0 degrees Centigrade or so and quite moist. So that my new skis felt quite uncotrollable in the high and middle stand. But when sit rather low I was able to control skis at quite high speeds, although this requre substantial muscle efforts.
I managed to film a small porn video on how I ski over the red trace in Stepanovo. Phone in left hand is not the best way to fight for Oscar, but it was fun. There is completely no feel of speed, although it was substantial for me at least - more than 40 km/h (about 11 meters per second). Calculated by dividing trace length by moving time, so effectively it does not take into accout arc length, which I prefer to make small to medium.
On such speeds I manage to outrun many of the skiers and almost all snowboarders. But since I have essentially no technique (I moved to outdoor traces three times, each time I spent about 3-5 hours on the slope), it is likely that I move quite wrong. And this can explain problems I sometimes get during the movement on the slope.
Add here weather and wet snow and result is quite simple: I fall. I do not care about that until I feel the pain longer than a day or so. And this week was my first time when pain was that strong and long.
I managed to outrun some other boarder and was not able to control skis, so fell and flew several meters away from the trace breaking the boarding :)

well, it was quite simple to break that bearding net, but there was a noticebly gap out of the trace, where I moved several meters crawling over the snow. leg did not hurt that much on the trace, but when I moved home pain started to show up.
Currently I feel mostly ok, although play table tennis quite slowly and can not move without slight lameness. Well, recently I moved with a crutch only :)
So, things are getting better.
In a meantime I added fair number of tasty things into elliptics network project, namely broke addressing storage model - now each node stores IDs which are greater than node's ID. This breaks compatibility but allows simple human understanding of how objects are spread over the storage.
Also implemented random transformation function selection for read IO requests in fastcgi frontend, now we can balance erading among multiple data copies. Dropped BerkeleyDB support - Tokyo Cabinet performs way faster, so I do not see any reason to support both. Made a big step towards completed merge support, I expect it to be finished very soon, which will be the first 2.7.x release - there is a fair number of changes accumulated already.
And as a tasty project to warm up the brain I decided to implement a rhyme generator based on Levenstein-Damerau distance and sound-syllable similarity algorithm. It was not formalized even in my head yet, but it is interesting thing to think about.
Also managed to win a judgement against development company which built my house (without judge and defendant though). I'm quite close to finally get property rights on my appartments and to sold it for good. I believe its time to make living place wider.
So far so good. Stay tuned!
March 19, 2010 08:28 PM
I noticed this article about "India's newly rich farmers". The article describes at least a few of the current modes of woefully disgusting wasteful behaviour practised by some rich Indians quite accurately. But most of those individuals described there are not and never were farmers. According to most dictionaries, a farmer is one who works the land for agriculture. The persons in that article and their obese flesh-eating Jabba-the-Hutt Hindi-song singing Bollywood-belting mothers have never ever worked the land with their _own_ hands. They're not farmers and do not qualify for the respect that that term ought to deserve.
March 19, 2010 07:59 PM
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100314.mp3
For the weekend of March 14th 2010, I’m Jon Masters with a summary of the week’s LKML traffic.
In today’s issue: The 2.6.34 merge window, anonymous inodes, ATA 4KiB sector issues, cpuhogs, ext4, PCI, and USB console support.
The 2.6.34-rc1 merge window. Linus Torvalds announced the release of the first 2.6.34 RC kernel on Monday, March 8th 2010 at 12:33pm Best Coast Time (PST). In closing the merge window early, he hoped to make a point in line with previous comments on the issue of getting merge requests in in a timely fashion. Quoting Linus, “but in general the merge window is over. And as promised, if you left your pull request to the last day of a two-week window, you’re now going to have to wait for the 2.6.35 window.” According to Linus, nearly two thirds of the changes are in drivers (when factoring in 50% drivers/ code, 5% sound/ code, and 10% firmware). Of the remaining bits, about half is architectural and the rest is, well, the rest. So far, about 850 developers are involved. Linus again refered to his Fedora Nouveau rant in ending with a reference to the need to upgrade libdrm/nouveau_drv versions if using that driver.
Several architecture maintainers gave their excuses and requested pulls later, but Linus drew the line at a request from James Bottomley to pull SCSI pieces two days later, on March 10th. James noted that he had been en route back from India, nobody had told him the merge window would close early, and that the only commit added to his tree since the merge window closed on Monday was a bug fix. Linus said he was “not going to pull” and that the whole point behind closing the merge window early was because of people posting pull requests late that “should have been ready when the merge window _opened_”. James objected to the unpredictability of the merge window closing, but Linus said that “WAS THE WHOLE F*CKING POINT!”, in order to avoid last minute pull requests, and added that he would in future not even say how long the merge window was going to be in order to have requests ready the moment the window opened. Unfortunately for James, Linus wanted to make a point and he seemed to meet Linus’ criteria for doing so. Doug Gilbert later pointed out that people should not attack James just because he was the subject of “yet another Linus rant”.
Anonymous inodes. Dmitry Torokhov recently started a thread entitled “S[E]Linux going crazy in 2.6.34-rc0″ (but note the corrected capitalization of “SELinux”). He was experiencing a side effect of some recent work by Al Viro, as well as others, to switch various subsystems such as inotfiy over to use anon inodes rather than their own “filesystem” type. Previously, inotify had used its own filesystem called simply and obviously “inotifyfs”. This allowed for SELinux rules to match on various notification events on an “inotify_t” filesystem type of filesystem. But with the trend to convert to anonymous inodes, there becomes no easy way to write SELinux rules to confine applications (if that is what you actually want to do), and the existing rules go insane, as this author recently saw on a rawhide system that happened to be running SELinux. Eric Paris proposed various workarounds – type a, and type b – of the “revert” everything back to how it used to be, or create support for differing security contexts for anonymous inodes. The latter seems more likely to happen though the thread dried up at that point and nothing further was said on the topic until Eric Paris sent a pull request for some notify bits a week later.
ATA 4 KiB sector issues. Tejun Heo started a new thread entitled “ATA 4 KiB sector issues”, in which he lamented the current state of support for larger sector size ATA devices (those using 4K rather than 512 bytes as their natural unit of size – someone please add a comment to this article with a description for the term used to describe the natural size of a disk, its “word size”). Apparently, the transition will be “quite painful”. In his lengthy email, the gist of which is covered by an article on the kernel.org wiki at: http://ata.wiki.kernel.org/index/php/ATA_4_KiB_sector_issues, Tejun covers the issue of backwards compatibility, DOS partition table support, and that beast of beasts – Windows. Interestingly, I didn’t see a specific mention of the issue of unaligned writes when using journalled filesystems and ensuring commits have hit the disk, but I’m sure that’s covered somewhere in there. I suspect this is now required reading if you work on disk and block bits. James Bottomley added some useful notes about the lack of bootloader support, etc.
CPU Hogs. Tejun Heo posted a patchset intended to generalize the case of monopolizing a CPU (or a set of CPUs) with a single kernel thread. The cpuhog functionality can be used by any kernel code that needs to grab one or more CPUs exclusively for some period of time, such as [k]stop_machine, which does just thus during module load in order to ensure that it is safe to fiddle with the kernel symbol table. For good measure, Tejun also fixes the kernel migration threads to use cpuhog while he’s at it. LWN had a writeup on this topic later, and your author has a pet project in mind that should benefit already from using this patchset. Thanks Tejun Heo!
ext4. Christian Borntraeger posted asking about e4defrag support for compatible ioctls (as in the case on his system, with a 64-bit x86_64 kernel and 32-bit IA32 userspace environment). He suggested, “[l]et[']s just wire up EXT4_IOC_MOVE_EXT for the compat case.” This lead Jeff Garzik to wonder aloud what the overall status was of ext4 defragmentation support. Jeff noted that he had actually poked at defragmentation support himslef in the past and was “hopeful that I will see defragging in a Linux distribution sometime in my lifetime”. Eric Sandeen noted that such support had previously been in Fedora (briefly) but was removed because he (Eric) wasn’t so happy with the code. Since I happen to know Jeff has a good many years ahead of him, one hopes that he will get to see many great things, including ext4 defragmentation. Separately, Michael Tokarev pointed out another 32-bit userspace on 64-bit kernel issue with compatible ioctls, this time affecting AIO. Jeff Moyer was on the case with an initial test patch that he could use succesfully with the libaio test harness built with -m32 while he continues to work in general on further AIO cleanups for the longer term.
PCI. Alex Chiang posted an updated patch based upon some awesome work that Matthew Wilcox had done to provide sysfs PCI slot to device mapping directory entries that can be used to determine which physical slot a device is actually installed in within the chasis of a given system. This will be of use to a number of projects, including efforts to name network interfaces according to the slot they reside in (rather than their MAC address) for distributions needing to support single system images – at least, that’s one possibility that comes to mind. I have pinged a few people myself to see if this will be of use to that effort in general, and there are bound to be many more.
USB Console. Jason Wessel posted a 6 part patch series entitled “usb console imprevements series”, containing “aggregated and ported…usb patches I have previously posted which are not mainlined into a single series aimed at providing a stable [USB] console”. Jason began with a recap about what the problem with USB consoles currently is – that they are not synchronous (as opposed to regular serial UART consoles which are) and so will drop data on the floor if there is no room to buffer it when interrupts are disabled. The new code introduces intentional delay loops calculated through imperical testing using an FTDI USB part (a common part on many embedded boards, such as the BeagleBoard JTAG debugger sitting on this author’s desk).
In today’s miscellaneous items:
* some early dev_name() patches from Paul Mundt allowing early platform device code to use dev_name() before the guts of the driver core are online.
* This author was bitten by a recent bad commit from Al Viro that caused opendir() to succeed on regular files. I posted a question about it and was told that it had already been fixed. Indeed, it had.
* Ongoing debate happend about reducing the number of memory allocators in use on x86 systems, per a previous note from Ingo that there were 5 possibilities depending upon phase of boot and this needed to be reconciled.
* A rant from Finn Thain about a “coding style” fix patch for Macintosh that reduced a comment length to fit in 80 characters. Finn thought this was an utter waste of time, and repeated a comment often heard elsewhere, “checkpatch.pl is great but code that fails it is NOT always wrong.” and, ‘”Check patch” is a good idea but “check existing code” is a waste of everyone’s time. Sometimes, cleanup patches do more harm that good, for example a well intentioned “if” cleanup this week completely misunderstood how the identation is supposed to work and was also summarily rejected. Ben Herrenschmidt’s only response to this mini-rant was “Amen !”.
* Mitake Hitoshi concurred with Guangrong Xiao’s posted results showing an *improvement* in performance of userspace mutexes when lock trace events were enabled. Reproducer code was posted and confirmed.
* Some useful documentation was provided on Linux’s circular buffering and memory barriers support from David Howells.
* Support for specifying in the environmental variable context of a kernel emitted uevent whether it came because of a kernel_firmware() or a kernel_firmware_nowait() request was postulated by Johannes Berg (to handle the case of built-in drivers requesting firmware not in an initramfs). Kay Sievers pointed out that many events are re-triggered during boot and so the firmware loader cannot know what state the system is in, and therefore it might be better to leave requests for unsatisfiable firmware around “forever” until they are cancelled from userspace rather than trying to cunningly work around the issue of firmware not being present in an initrd context with special uevent environment variables.
* and the jabs at SELinux security labeling continued with Al Viro coming up with a few amusing retorts in the “Upstream first policy” thread and Ingo Molnar comparing SELinux relabeling wait times to fire doors, “we should prefer a one inch thick fire door that opens and closes fully automated to a five inches thick fire door that people keep always-open with a chair”. Ingo contends that all too often, people “turn off the whole thing” because of various frustrations and so there is less overall security than might be the case with a slightly less perfect system. Dave Airlie called SELinux relabels “the new fsck” and called for journalling.
In today’s announcements:
Benchmarks. Anca Emanuel announced some new Phoronix benchmarks for kernels 2.6.24 through 2.6.33, showing that performance has generally improved by 770% from 2.6.29 to 2.6.30 and only regressed very slightly in 2.6.32. Regretfully, however, 2.6.33 does not perform nearly so well, and, according to the Phoronix quote, “PostgreSQL performance atop the EXT3 file-system has falled off a cliff”. Full details are available on the http://www.phoronix.com/ website.
RT 2.6.33-rt6. Thomas Gleixner announced the release of version 2.6.33-rt6 of the RT patchset that he and others are continuing to develop against the 2.6.33 series kernel. As he mentions, there was an -rt5, but it was more of a separation point in the git tree. With the merging of some bits into that older tag, MIPS support rejoins the RT tree thanks to Wu Zhangjin. As usual, the RT patch is available on the kernel.org website, in the section devoted to such projects, or in the head (rt/head) and stable (rt/2.6.33) branches of the “tip” tree maintained by Ingo Molnar. Details: http://www.kernel.org/pub/linux/kernel/projects/rt/
The latest kernel release is 2.6.34-rc1.
Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-03-09-19-15. Hiroyuki Kamezawa posted an updated version of his OOM notifier memory cgroup patches against this latest tree. Andrew later posted an mmotm for 2010-03-11-13-13. And in other “mm” news, Mel Gorman posted the 4th version of his “memory compaction” patches.
Greg Kroah-Hartman posted some review patches for stable kernels 2.6.33.1, and for 2.6.32.10. These were subsequently released.
Finally today, Robert P. J. Day asked whether it was still worth him running his “cleanup” scripts (that look for problems with kernel config options) after each merge window closes. Randy Dunlap thought “yes”, and was even more happy that Robert had posted his scripts for him and others to use. Details: http://www.crashcourse.ca/wiki/index.php/Kernel_cleanup_scripts Robert followed up later with another email saying that most of his popular cleanup scripts have now been posted, which is great.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.
March 19, 2010 09:09 AM
I like seeing LWN writers pick up small patches and explain what they are why they are important. As a developer, often the impact of a change is not obvious and without further explanation significant changes go unnoticed. The recent story about Generalized TTL Security Measures in lwn.net is one such example.
But, when a story comes out, the writer should do research on the background. First, it is nice to give some credit to the author :-) and Vyatta, as well as also some history. I did this patch based on an enhancement request for the current Vyatta version. The starting point was a (unaccepted) patch to Quagga, and existing implementation for FreeBSD systems. It was one of those patches where the kernel change took less time than writing the test programs.
Also, the initial patch wasn't perfect since (nothing ever is), since it broke time wait sockets, and missed the case of ICMP messages. Both should be fixed by the time 2.6.34-rc2 comes out. Also, the necessary support has not been integrated into upstream Quagga (yet).
I appreciate the review and feedback from Eric, Andi, David, and Pekka for making this work.
March 19, 2010 12:15 AM
March 18, 2010
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100307.mp3
For the weekend of March 7th, 2010, I’m Jon Masters with a summary of today’s LKML traffic.
In today’s issue: Console, DRM, ext4, integrating tools, sensors, split function and data sections, union mounts, and versioning.
Console. Eric W. Biederman posted an intuitive patch for /dev/console opening, effectively ensuring that it is always available even if the root filesystem has no /dev. “This effectively guarantees that there will be a device node, and it won’t be on a filesystem that we will ever unmount”. Al Viro replied “hell yeah”, and took the patch “with thanks”.
DRM. This weeks thread length of the week prize goes to a thread entitled, “drm request 3″ in which Dave Airlie tried to pull some patches into the 2.6.34 merge window. These contained, “[f]ixes for default y + CONFIG_STAGING + CONFIG_DRM_NOUVEAU enabled”. Linus wasn’t very happy when he booted with these patches (nouveau interface version 0.0.16) and saw an error message saying “[drm] wrong version, expecting 0.0.15″. This lead to a rant about backwards compatibility, and that he hadn’t even been warned it would break existing user space (in his case, Fedora 12). Linus even found that the commit that introduced the breakage did so explicitly, but again noted, ‘why the hell wasn’t I made aware of it before-hand? Quite frankly, I probably wouldn’t have pulled it. We can’t just go around break people[s] setups. This driver is, like it or not, used by Fedora-12 (and probably other distros). It may say “staging”, but that doesn’t change the fact that it’s in production use by huge distributions. Flag days aren’t acceptable’. This lead on to a thread in which Linus and others (including Jeff Garzik) noted that Fedora 12 was shipping this driver in “production” and so more should be done to ensure that the kernel could be tested on older systems, while others said the driver was all along a “use at your own risk” driver (Jesse Barnes). Personally, this author solved the problem by using another graphics chipset a long time ago. Daniel Stone probably had the best solution, “fuck it, it’s Friday. To the pub”.
The DRM thread also deviated into a discussion of “Upstream first” as a distro policy, and then onto specific patches in other distributions that aren’t in upstream. For example, Ubuntu carrying AppArmor. That lead on to yet another tangent in which James Morris felt he was being personally attacked for the lack of the patches being upstream. Ingo Molnar (and later, Linus, who seemed to share a similar viewpoint – that there needn’t be only one security answer) decided to weigh in, noting that it had been “a few reasonable months after the last big security flamewar”, and wanting to see a “rehash or fair summary of the pathname versus labels arguments” (refering to the fact that SELinux uses file labeling and complex rules, while AppArmor uses simple file paths). Ingo feels that pathnames are a “far more fitting abstraction to any ‘human based security process’ on Linux than ‘labels’”. Ingo called out that there was a lot of security research based on labels but essentially said none of that mattered due to the difficulty of practically using label based security. Quoting Ingo again, “[i]n other words: [I] see [SEL]inux’s main failure in that it somewhat blindly aims for a security model that is sees as the technical most secure, while not being intellectually open to the fact that we very likely _cannot know in advance_ which of the models will make Linux more secure in the long run. It would seem Ingo would like AppArmor to be less of a “hostile competitor” and more of a “natural ally” to SELinux. The idea is that there can be two different security mechanisms for different use cases.
Ext4 performance concerns. Justin Piszcz had recently raised the issue of the relative performance of ext4 for “large” writes vs. XFS. Justin was seeing almost half the write throughput when using ext4 as opposed to XFS and was concerned. After asking various questions, to which the replies included that he should use “nice” numbers of disks (e.g. 9 for the specific RAID case he was looking at) that made no difference, the thread seemed to dry up without any concrete conclusions other than that a performance issue exists and requires some further investigation using blktrace, etc.
Integrating tools. Ingo Molnar, in a thread entitled “Re: KVM usability”, made some remarks about the relative virtues of having “unified repositor[ies]” in which both the kernel and userspace tools are combined in one place, such as with the Performance Counters tools. Ingo believes that one reason why Apple can “consistently out-develop Linux” is “in part due to there not being a strict [C]hinese [W]all between the Apple kernel, libraries and applications – it’s one coherent project where everyone is well-connected to each piece”. This maybe true, but it’s just as likely in this author’s opinion that Apple is benefitting from that, coupled with the fact that it owns every piece and can hand down edicts from on high about what every piece will do, and when. In any case, the thread is worth reading – it was surprisingly short given the potentially contentious comments that could have made great flamebait.
Sensors. Dima Zavin (Google) replied to Jean Delvare’s attempt to have the ALS (Ambient Light Sensors) subsystem pulled, saying that the kernel was on the road toward having one subsystem under drivers/ for ALS, one for Proximity sensors, one for Accelerometers, etc. all with similar interfaces, and that a better approach would be a single “sensors” subsystem. He offered to help work on just that. Jean was interested, but didn’t want to hold up having the ALS patches pulled, favoring reworking them later on. He was subsequently dismayed when Linus and others started asking why ALS wasn’t just using the input subsystem for events, saying that he didn’t care where the code went but that discussions had been ongoing for 5 months already and he didn’t want to hold things up for another 5 months when people decided to bring this up during the merge window rather than before. The conversation then took a tangent into different rate devices (some of these “sensors” can operate at many KHz, above what the “input” subsystem is intended for). Linus contended that these devices, just like joysticks, were input devices. The conversation appears to have stalled at this point without a resolution.
Split function and data sections. As some of you will know, various attempts have been made over the past year to add support for compiling the kernel with the GCC options “-ffunction-sections”, and “-fdata-sections”. These cause the kernel to generate one ELF section for each function or data related object, and make life very easy for optimization tools (that can remove whole sections) as well as kernel patching utilities such as Ksplice. Tim (Ksplice) Abbott was happy with the latest round of patches, though he did have some questions about the “rename kernel’s magic sections with compatbility with -ffunction-sections -fdata-sections” patch series, especially about where certain renames were being used. For example, he wondered aloud how renaming “.text.reset” to “.text..reset” would affect AVR32 systems, because he couldn’t see how the original “.text.reset” was being populated anyway (answer: it wasn’t). As Tim mentioned, he wanted input from Haaard Skinnemoen, who provided the comment on “.text.reset” amongst other feedback.
Union mounts. Valerie Aurora posted version 1 of an RFC patch series (against Al Viro’s for-next tree) entitled, “Union mount core rewrite”. This, as it implies, is a complete rewrite of parts of the code implementing union mounts. Val has previously written about the goals and implementation of her work in various LWN articles. Separately, Val wondered aloud whether it was now possible to have multiple read-only layers in union mounts.
Versioning. Paul McKenney posted a patch placing the SHA1 git hash of the latest commit in the kernel version line on boot if available, or “[Not git tree]” in the case that a non-git tree was use to build.
In today’s miscellaneous items:
Large numbers of git pull requests started to come in for 2.6.34 (including everything from core kernel to networking and sound), there were some further nested SVM patches from Joerg Roedel, a large number of KVM updates (including a lot of PowerPC bits, Microsoft Hyper-V patches, and some x86 emulator cleanup), a new “platform-drivers-x86″ git tree reference was added to the MAINTAINERS file (as maintained by Matthew Garrett, who posted a pull request for the latest bits also), a new generic x86 “NMI Watchdog” built upon performance events from Don Zickus (by way of Ingo Molnar actually making the pull request for Don’s previously posted patches), version 3 of the memory controller groups dirty page limits patches from Andrea Righi, an affirmation from Andrew Morton that the “Linux Checkpoint-Restart” patches could be posted to LKML following 2.6.34-rc1 (Oren Laadan also mentioned how the patches will refuse to do a checkpoint if they believe they cannot do so safely, reporting this back to userspace), the latest “compat-wireless” tree for stable kernel (2.6.32) users that contains the latest 2.6.33 bits from Luis R. Rodriguez, version 3 of a patch series providing for 512KB readahead rather than 128KB from Fengguang Wu, various trivial and staging patches from Greg Kroah-Hartman (as an aside, Alan Stern raised some concerns about the way Greg’s scripts generate those patches), a request to pull the Ceph distributed file system client into 2.6.34 (along with various input about changes made since the 2.6.33 merge request) from Sage Weil, some Performance (perf) Counters “live mode” patches from Tom Zanussi that allow perf data to be directly processed as it is captured “without ever touching the disk”, some paravirt (PV) extension patches for HVM (Hybrid virtualization support) in Xen from Sheng Yang, and Ted Ts’o complained about dynamic device filesystems with initramfses in a mini-rant about how 2.6.33 could not boot with an LVM root on his Ubuntu 9.10 userspace. He added that, “of course, the initrfamfs environment is so crappy that there are no debugging aids — not even a working pager”.
In today’s announcements:
Git 1.7.0.2. Junio C Hamano announced the latest maintenance release of Git version 1.7.0.{1,2}. The second .2 posting had a few minor patches since .1, including fixing support for GIT_PAGER. Whether or not it is technically an SCM, I will cease using that term in this podcast, following some feedback from listeners of this podcast.
LTP. The Linux Test Project was released for February 2010. The latest release comes with a reminder that there “has been multiple chnges for building/installing the test suite after the recent changes in Makefile infrastructure”. This month’s release didn’t come with any corrupt script warnings.
Userspace RCU 0.4.2. Mathieu Desnoyers announced version 0.4.2 of his Userspace RCU “urcu” library. It includes some patches from Paolo Bonzini adding generic uatomic ops support for architectures not explicitly supported by liburcu, including (effectively free support) for IA64 and Alpha when using GCC versions 4.0-4.5, and a bugfix in urcu-bp which is the “User-Space Tracing” version of the urcu library. Mathieu has asked me to point out that an patent exemption was made to cover use of RCU in LGPL code such as urcu, so my previous comments about GPL patent concerns were a little too severe.
The latest kernel release was 2.6.33.
Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-03-04-18-05.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.
March 18, 2010 12:11 PM
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100228.mp3
For the weekend of February 28th 2010, I’m Jon Masters with a summary of the week’s LKML traffic.
In today’s issue: Linux 2.6.33, ACPI, Cgroups, Checkpoint and Restart, OF Device Tree, Firmware, and x86 embedded.
Linux 2.6.33. Linus Torvalds announced the final release of 2.6.33 on Wednesday February 24th at 12:06pm Best Coast Time (PST). The final release includes a relatively small number of final fixes on top of rc8. As Linus says, the most notable thing may be the Nouveau integration and modesetting support. Others may notice the mainlining of DRBD and the fact that the AS IO scheduler is now gone (”since keeping it around and just causing confusion seemed to not be worth it any more. You’re supposed to use CFQ instead”). Daniel walker asked Linus whether he still planned to try a one week merge window this time, to which Linus said, “No. But I might do a ten-to-twelve day thing or something like that – just to make sure that anybody who tries to game the system and send their merge request late will get summarily ignored. So I’m going to stop being so predictable that people can tell that exactly two weeks after the last release is where the merge window closes, and if people want to make sure their stuff merged, I had better have a merge request in my inbox earlier than thirteen days after the release.” The pull requests started pretty much immediately, and with the usual vigor. Separately, Con Kolivas announced 2.6.33-ck1, which includes his BFS scheduler and various other “desktop” focused bits.
ACPI. Rafael J. Wysocki posted an RFC patch concerned with removing race conditions from ACPI event handlers. The first race concerns the execution of handlers while they are being removed, the second is a locking issue.
Cgroups. Andrea Righi posted an intruiging RFC patch series intended to provide per-cgroup dirty page limits. The idea is that the maximum amount of dirty pages a cgroup is allowed to have can be limited, and if a cgroup exceeds this count, it will be forced to perform write-out immediately.
Checkpoint and restart. Oren Laaden posted version 19 of his “Linux Checkpoint-Restart” patchset. As a reminder, these patches are intended to allow systems to handle failures by taking whole system checkpoints and restarting all activity from that point in the event of failure. The latest patchset is intended to address previous concerns from Andrew Morton and others, and is apparently able to checkpoint and restart both screen and vnc sessions, and support live migration of network servers between hosts. The project has a checklist of TODOs on its wiki: http://ckpt.wiki.kernel.org/.
OF Device Tree. Grant Likely asked Linus to pull in his OF device tree rework for 2.6.34. Grant has recently been working on ARM support, in addition to the PowerPC, Microblaze, and SPARC changes covered in this pull. Hopefully, OF device tree emulation will finally provide one mechanism for supplying data to the kernel that can be common across many different architectures, in addition to those that do “real” OpenFirmware in the vendor firmware.
Firmware. There was some discussion about kernel firmware versioning, and whether kernel firmware should be wrapped in a container format making it more suited to SO library style versioning. This happened in response to the folks behind the open sourcing of the Atheros WiFi firmware seeking advice on the best way to handle compatible and incompatible versions. David Woodhouse has advocated for the use of more library-like versioning, but was not a big fan of introducing the complexity of such wrappers. In the end it was decided that the kernel developer maintained linux-firmware package should provide firmware files of the form foo-$(API). Those wanting a sub-versioned file like foo-$(API)-$(VAR) could provide one if they so wish.
x86 embedded. Graeme Russ posted a very detailed and well reasoned description of his embedded x86 port, which is not in any way based upon PC hardware, in which he uses U-Boot to transition to 32-bit Protected Mode and directly calls the kernel’s “32-bit BOOT PROTOCOL” described in Documentation/x86/boot.txt. He was having some issues though handling kernel relocation that turned out to be due to documentation differences between the bzImage format and the current reality. Peter Anvin was his usually very helpful self.
In today’s miscellaneous items: A fix for SPARC32 from Rob Landley (apparently, SPARC32 has been broken since 2.6.28, which isn’t surprising since this author and most other Linux SPARC users seem to be running SPARC64 kernels), various debugging from Thomas Gleixner and John Kacur on the recent 2.6.33 RT patch, version 6 of a patch series intended to add lockdep-based diagnostics to rcu_dereference() from Paul McKenney, a series of PPS implementation patches from Rodolfo Giometti (useful for those needing accurate time sources on a serial line), a patch to increase readahead size to a default of 512K from Fengguang Wu (the previous default was 128K), a bunch of s390 updates for 2.6.33 final from Martin Schwidefsky (including kernel image compression “finally…after only 10 years”), some patches intended to document the rfkill sysfs ABI from Florian Mickler, some more nested SVM (virtualization within virtualization on AMD compatible systems) from Joerg Roedel intended to aid running Microsoft Hyper-V with nested SVM (which doesn’t quite work yet even with these according to Joerg), a number of rather cool gdb and early debug updates from Jason Wessel (who has now split kdb and early debug out into two separate trees), version 4 of the “concurrency managed workqueue” from Tejun Heo, a discussion about order 1 allocation failures started by Frans Pop (the failures were under GFP_ATOMIC, but Frans felt that they were particularly ugly given plenty of cache was available for reclaim), David Howells proposed removing EXPERIMENTAL from NFS_FSCACHE in order that it could be compiled into the standard Ubuntu kernel (since, as he says, “As Arjan van de Ven pointed out…the EXPERIMENTAL flag doesn’t mean that much any more”, and a lengthy discussion of linux-next “requirements” that is worth reading, if you have the time.
In today’s announcements:
iproute2. Stephen Hemminger announced release 2.6.33 of the iproute2 utilities that “includes bug fixes and support for all the new features in kernel 2.6.33. This integrates a number of minor bug fixes from Debian aswell”. The update is available at http://devresources.linux-foundation.org/.
RT 2.6.33-rt4. Thomas Gleixner announced version 2.6.33-rt{2,3,4} of the RT kernel patchset. This updates to Linus’ latest tree and includes a number of fixes to bugs reported by John Kacur and others. It is available from the usual location: http://www.kernel.org/pub/linux/kernel/projects/rt/ Thomas noted that “rt/2.6.33 branch is now stabilization only. The rt/head branch will follow linus tree from now on, so it will inherit all (mis)features which come in the merge window. Separately, John Stultz announced that he had forward ported Nick Piggin’s VFS scalability patches to 2.6.33-rc8-rt2, and that it applies to 2.6.33 without any collisions. He requested feedback as he had yet to do any serious stress testing with the patchset (yet).
The latest kernel release was 2.6.33.
Greg Kroah-Hartman released an updated stable Linux 2.6.32.9.
Finally today, Mikael Abrahamsson suggested that some TLC be given to the Wikipedia article on the Linux kernel as it “doesn’t even mention the new -rc system” (in the “development model” section of the article). He wondered if anyone who knew exactly what was going on could write up the new world order on that wiki page for the rest of the world to see. That does not seem to have happened as of this writing.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.
March 18, 2010 05:35 AM
So sane multi-seat handling was something I wanted to make KMS do at some point and designed for but never quite implemented.
So in an attempt to maybe get help out people who are interesting in this I've gotten two seats on a single card working here to a demoable level.
http://people.freedesktop.org/~airlied/multiseat/
contains a kernel patch + libdrm patch.
The kernel patch pretty much contains 3 pieces:
(a) ability to create "render" device nodes with an attached list of output resources it controls (crtcs/encoders/connectors).
(b) hardcoded render node setup for my X1900 - two parts - core drm creates 3 devices nodes, radeon driver assigns hardcoded
resources to the nodes - in this case render node 0 gets a crtc + DVI + encoders, and node 1 gets the other crtc/DVI/encoders, and
render node 2 gets no outputs.
(c) drm mapping fixups for multiple device nodes - this is something we should probably cleanup independently of this patch.
the libdrm patch just contains support to use an env var to pick the device path.
With this xorg.conf and the two startx wrappers I can run two X servers separately.
TODO:
(a) define a kernel/user interface to set seats and nodes up. The DRM control node is there specifically for this purpose but I never got around to specifying this interface. It basically needs a few methods:
1. Create new render node with output configuration.
2. Remove render node.
These would have to rely on their being no users of the render or legacy device nodes in advance. The kernel would
also have to get the driver to validate the output configuration. The output configuration would be a list of IDs for crtcs/encoders/connectors.
(b) maybe add a drm device path to xorg.conf so each card section can specify one, would help get away from BusID also.
(c) make a sane userspace interface to use it all - I suspect you'd need something in gdm/ConsoleKit to configure this sort of
thing, you'd have to construct per-card multi-seat profiles with a list of the outputs and stuff you want on each seat etc.
At this point I'm just trying to flesh out my backlog of projects and figure out how long they will take to do properly, feel free if someone is interested in picking this up and running with it.
March 18, 2010 01:59 AM
March 17, 2010
I just came across this story (http://goo.gl/EbqP) today, and given my name, and given that I fancy myself a bit of a foodie, who could resist? (Not that I considered the deep-fried, dunked-in-sugar-syrup mess that passes for General Tso’s chicken in most fast food Chinese restaurants to be gourmet food, mind you!)
Here’s the first thing you should know: The general had nothing to do with his chicken. You can banish any stories of him stir-frying over the flames of the cities he burned, or heartbreaking tales of a last supper, prepared with blind courage, under attack from overwhelming hordes. Unlike the amoeba-like mythologies that follow so many traditional dishes, the story of General Tso’s chicken is compellingly simple. One man, Peng Chang-kuei — very old but still alive — invented it.
But what’s “it”? Because while chef Peng is universally credited with inventing a dish called General Tso’s chicken, he probably wouldn’t recognize the crisp, sweet, red nuggets you get with pork fried rice for $4.95 with a choice of soda or soup. All that happened under his nose. It all got away from him…
No related posts.
March 17, 2010 03:38 PM
March 15, 2010
Audio: http://media.libsyn.com/medi/jcm/linux_kernel_podcast_20100221.mp3
For the weekend of February 21st, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.
In today’s issue: AMD TSC, anon_inode flags, extents, LSI MegaRAID, md RAID, SSE, UML, and XZ.
AMD TSC. Mark Langsdorf (AMD) posted a patch entitled “Option to synchronize P-states for AMD family 0xf”, in which he reminded readers that AMD Family Oxf processors (that is AMD Athlon 64s and AMD Opterons) do not have P-State and C-State invariant TSCs – that is to say the TSC increments at the current frequency of the CPU core, and not at some fixed frequency that would be more useful to those using it as a timing source. It is nonetheless possible to scale the TSC readings to be used as a time source, if all CPUs in the system adjust their frequency at the same time and to the same amount. To do this, Mark modifies the PowerNow! driver with a new “tscsync” parameter. He reminds us that there are many other possible clock sources in a system, but customers want something particularly lightweight in some situations, like the TSC.
anon_inode flags. Matt Helsley noted that existing anon_inode interfaces often do not support flags that can be set by using fcntl(). He proposed a series of 4 patches to signalfd, timerfd, epoll, and eventfd that would allow the same flag behavior as their corresponding creation syscalls. Davide Libenzi, the original author of the anon_inode bits, signed off.
Extents. Jari Sundell reported an issue with sparse files on ext4 in which many extents nonetheless sequentially placed on disk were not merged by the filesystem. This manifested in the form of 3000 or more extents for a 250MB bittorrent download file (aside: bittorrent pulls many file pieces at once from many different sources and so relies heavily on sparse files).
MegaRAID. LSI posted to let everyone know that they were interested in an overhaul of the MegaRAID driver to support future HBAs. Rather than make a lot of changes to the existing code, they were interested in, and were encouraged to create a new driver for the newer parts. Matthew Wilcox may have detected a hint of reasoning behind why they had been a little resistive to not having a single heavily hacked driver and suggested an approach that could be used to “make your management happy” in effectively combining two drivers together into a single object file with two separate sets of PCI tables being handled and different functions within. Whatever the eventual decision, the thread ended there with no followup.
md. Justin Piszcz started a discussion thread entitled “Linux mdadm superblock question”, in which he asked about RAID superblock types. The older version 0.90 superblock format supports autoassemble within the kernel, whereby the kernel can automatically create the appropriate RAID device without having to use tools within an initrd/initramfs (the initramfs itself is not required in that case, otherwise it is if you want to use RAID). Justin wanted to know whether there were any benefits for a 2TB RAID1 boot volume in moving to a higher versioned superblock without autoassemble support.
The conversation lead Peter Anvin to point out some issues with a recent change in mdadm, which now apparently creates 1.1 version superblocks by default. Peter noted that the 0.9 superblock format doesn’t make it possible to easily distinguish RAID partitions from whole volume RAID devices, but the problem migrating to 1.1 is that 1.1 uses the bootblock for its superbock and so can cause problems with bootloaders such as grub that result in people having to regenerate their entire disk if they want to easily boot with it. Version 1.2 of the md RAID superblock uses the same 1.1 superblock format but at a different location than the bootblock, and so Peter favors a default of using 1.0 or 1.2, but not 1.1 as the mdadm default.
The entire md RAID thread is worth reading because it took a tangent off into a lengthy debate about the merits of using (or being required to use) initramfses, time taken to boot using an initramfs (or if not using one – the plan is to remove autoassembly from the kernel for good, so good luck booting within an initramfs if you want RAID in the longer term), and tools such as AEUIO that can build a customized initramfs image. Of course, every distro and his dog have also re-invented initramfs creation.
SSE. There’s a long-standing philosophy of avoiding floating point (FP) or other general usage of optional compute units such as SSE, SSE2, and so forth from within the kernel itself. Using these units requires saving state, and that isn’t typically done (for performance reasons). However, these optional units can often handle very large word sizes and so can be useful for those seeking to optimize existing kernel routines. Luca Barbieri posted, starting a new thread entitled “use SSE for atomic64_read/set if available” to do just that on x86-32 systems as an alternative to some of the more complex code being used today (including disabling pre-emption very briefly). Peter Anvin and Luca got into a somewhat lengthy debate about FPU etiquette (especially with regard to Peter’s view that kernel_fpu_begin() and kernel_fpu_end() be wrapped around kernel calls to the FPU, and Luca’s view that this expensive state change could be skipped in the case that only specific registers need to be saved and restored in such situations as in his patch). Peter Zijlstra, though not objecting to a cleanish implementation, suggested that one might want to “run a 64bit kernel already”. In the end Luca decided to re-write his other patches explicitly in assembly to avoid future complications with GCC changes, and to hold off on the SSE piece in question until another day.
UML. Remember the work a few weeks back to bring initial task userspace stack sizes in line with those permitted by rlimit? Well it turns out that the patch was a little too restrictive and was causing UML (User Mode Linux) to segfault on startup. The issue was raised by a number of people, including Adam Nielsen, who was also told that it is not possible to run 32-bit UML instances on a host 64-bit kernel or vice versa. They must match.
xz. Discussion continued on the potential for migrating kernel.org over to use ZX format compressed files. Phillip Lougher offered some defense of the venerable gzip format, emphasizing its cross-platform nature (there are even completely separate implementations available in Java for the inclined), and Andi Kleen pointed out the relative availability of tools that handle gzip files or bzip2 vs. xz, but others seemed to agree that various contrived scenarios not that relevant directly to kernel developers don’t warrent holding off an eventual migration to some better compression format.
In today’s miscellaneous items: An updated version of the OOM killer rewrite was posted by David Rientjes (including a patch that treats task running on different sets of CPUs as unlikely to be interfering with oneanother), the third round of KVM patches for 2.6.34 from Avi Kivity (including 1GB page size support, and an initial implementation of “Hyper-V” support for those desperate enough to need or want to run a Microsoft virtual machine guest), some seqlock implementation cleanups from Thomas Gleixner, a “foruth [sic] general posting of the newest version of the AppArmor security module” that is essentially a rewrite of the existing AppArmor code to use the existing hooks in the LSM security infrastructure rather than custom VFS patching, Grant Likely posted “basic ARM device tree support” (yaaaay!), Denys Vlasenko posted another attempt at supporting split out function and data ELF sections (one section per function or data item – something that is great for Ksplice), and Microsoft revived their work in Hyper-V recently (Hank Janssen seems to be trying really really hard to do the right things).
In today’s announcements:
Gujin 2.8. Etienne Lorrain announced a new release of the Gujin bootloader. It has some really nice options for device emulation, El-Torito emulation for booting Live-CD images, and a lot more besides.
RT patchset 2.6.32.12-rt21. Thomas Gleixner announced an updated RT patchset containing “fixes and cherry-picks from all over the place”, as well as some tracer fixes. The short log includes two scheduler fixes, some futex fixes, and some architectural stuff for ARM support.
RT patchset 2.6.33-rc8. Thomas Glexiner also announced the first RT release for the 2.6.33 stable series kernel. Thomas says he is pretty excited about the stability of this latest patch series, and the overall patch size is still falling quite considerably. He ends, “We are zooming in, but there is still a way to go”.
util-linux-ng 2.17.1. Karel Zak announced the release of util-linux-ng 2.17.1. This latest release includes an option to fdisk to disable DOS-compatible mode from the commmand line.
The latest kernel release was 2.6.33-rc8.
Finally today, the end of an era. Christine Caulfield announced that she is orphaning DECnet support in the kernel, due to “lack of time, space, motivation, hardware and probably expertise”. Apparently, “judging from the deafening silence on the linux-decnet mailing list [she] suspect[s] it’s either not being used anyway, of the few people that are using it are happy with their older kernels.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.
March 15, 2010 09:23 AM
Just in case you are wondering why there are no updates here: I'm
currently on holidays in Taiwan and thus not working much on my various
projects, i.e. no major updates on this blog until some early/mid April.
March 15, 2010 01:00 AM
March 13, 2010
Suddenly decided to make something bad, while waiting for other things to settle.
Years ago I used to believe that I know something about hacking. Not kernel hacking, but the one, related to cracking of various supposed to be secure systems. Starting from ciphers down to code 'issues' (described in phrack, yup). It is rather laughable right now :)
But today I know, that reading (and even understanding) smart and hardcore articles is really far from being able to apply given knowledge to some practical problems. So, I decided to start (and complete practical implementation) from really simple things: various mono/poli alphabet ciphers. Ceasar/rot13 and Vigener are good choices afaics.
Bruce Schneier's Self-Study Course in Block Cipher Cryptanalysis does not allow me to easily fall asleep, although its a little bit fun to compare my alphabet ciphers analysis versus even simplest crackdowns described in the article.
Anyway, it took me a day or less to hack up a semi-automatic Vigenere and monoalphabet cipher cracker in LISP. In Vigenere code there are two steps: find out key length and split and decode monoalphabet enciphered text blocks.
The former task is performed manually - the only application created searches trigrams in text and shows their frequencies. When there are 3 or more trigrams it is possible to find key length with rather high probability. Bigrams will work too, but error rate is higher. One have to find distance between the same trigrams and found greatest common divisor, which will likely be equal to cipher key length.
Vigenere cipher is theoretically uncrackable when unique key long enough to cover input message is used. But in practice shorter keys are used, which are then repeated number of times so that resulted key string length becomes equal to plaintext message. So, when cipher starts to repeat itself, the same letters will be encrypted into the same ciphertext.
This method is named after Friedrich Kasiski, who found it 150 years ago. When key length is found, we split ciphertext into separate strings, where each one encrypted using monoalphabet cipher. It consists of letters separated by key length, i.e. the each string will be formed from $string_number + $i * $key_length letters, where $i runs over ciphertext until it is fully covered.
Monoalphabet ciphers are rather trivial to crack using frequency analysis. Namely we should find the most frequent letters in the encrypted text and they will correspond to the most frequent letters in the language used for plaintext. Difference between those letters is a cipher shift, so it is trivial to recover the text just by replacing each letter with the one shifted by calculated number.
Here is example I used (random New York Times article, cut just to show idea, I also converted it to downcase and dropped all non-letter character, since gcipher application does not understand that):
After spending months searching for a bipartisan consensus on financial regulatory reform, Senator Christopher Dodd, chairman of the banking committee, is expected to unveil his own bill on Monday, without one Republican supporter.
afterspendingmonthssearchingforabipartisanconsensusonfinancial
regulatoryreformsenatorchristopherdoddchairmanofthebanking
That's how it was encrypted and resulted ciphertext:
$ gcipher -c Vigenere -k asknfalwiwf v1.enc v2.enc
axdrwsaavznnywbstsoaafrurvsgqkzwgihkeyidwvytnkoaxudkvbnnsxpn
awnmczlsdbwycankwmkoaftznkdwikdbuhpnlkidurnnrxwvkktzoofnv
Trigram search uncovers that the most frequent ciphertext trigram 'tzo' is spread in the text with the distances being multiple of 11, which does correspond to above key length.
Splitting text and calculating frequences is a rather simple technical task. After performing decoding we got following result:
lfterwpeydiygmonxhsdeacchinkfocabtpartmsaycoysensysoyfiyancielrpguwatorcreqorxsenaxornhrtstopler
afterspendingmonthssearchingforabipartisanconsensusonfinancialregulatoryreformsenatorchristopher
Second line is a plaintext message, which differs in less than half symbols. I was not able to decode some of the text parts because of small enough text length, so that frequency analysis did not always provide correct data. But even as is decoded text allows to read and recover data.
On this positive note I will start preparig for the weekend skiing.

Stay tuned!
March 13, 2010 11:48 PM
So, after long time I have found real sci-fi... and yes, I enjoy it better than "Blade runner". It is CC-licensed, too :-). Second book is here, but starting with Starfish is probably better. Enjoy!
And... zaurus is now usable without X. As in -- you can actually use Fn-3,4 to adjust brightness and power button to suspend it... Only in git for now, and big thanks to metan.
March 13, 2010 07:44 AM
March 12, 2010
THIS IS A PROOF OF CONCEPT - its not going to be upstream unless someone else dedicates their life to it, (btw anyone know anyone in ASUS?)
So NVIDIA unveiled their optimus GPU selection solution for Windows 7, so I decided to see what it would take to implement something similar under DRI. I've named it PRIME for obvious reasons.
Goals:
1. Allow a second GPU to render 3D apps onto the screen of the first, pickable from the client side.
2. Just target the rendering side, I'm assuming the GPU power up/down is similiar to what was done for the older switching method.
Restrictions + limitations:
1. Must have compositing manager running
2. Must have second screen configured for slave card (doesn't need to be used)
Test system:
Intel 945 IGP + radeon r200 PCI card - yes this won't be a speed demon.
Terms:
Master: the IGP displaying the output - intel
Slave: the GPU rendering the app - radeon r200 in this case.
Step 1: kernel support
http://git.kernel.org/?p=linux/kernel/git/airlied/drm-testing.git;a=shortlog;h=refs/heads/drm-prime-test
http://cgit.freedesktop.org/~airlied/drm/log/?h=prime-test
The kernel requirements were simple, we needed a way to share a memory managed object between two kernel device drivers.
The kernel has a GEM namespace per device, however this isn't good enough to share with other devices, so I introduced a new PRIME namespace with two ioctls. One ioctl allows the master device to associate a device buffer handle with a name in the prime namespace, and the other allows the slave device to associate a prime namespace handle with a buffer. When the master creates a prime buffer the kernel associates the list of pages with the handle, and when the slave looks up the same handle it retrieves the list of pages and fakes up a TTM buffer populated with those pages as backing store. I've added the concept of slave object to TTM to allow for this.
The drm repo contains the API wrappers + intel + radeon pieces to call the association functions for buffer objects.
Step two: DRI2 Protocol
http://people.freedesktop.org/~airlied/prime/0001-dri2proto-add-prime-token.patch
http://people.freedesktop.org/~airlied/prime/0001-prime-support-for-mesa.patch
From the X server point of view a recent change to the DRI2 layer allowed for multiple device driver names to be associated with a DRI2 end point. The client can request either a DRI or VDPAU device name currently. I firstly extended the DRI2 protocol, to add a new buffer type, called PRIME, and added a hack to mesa's glx loader to request the prime driver if an environment variable was specified.
Step 3: X server DRI2 module + drivers
http://people.freedesktop.org/~airlied/prime/0001-intel-add-prime-master-support.patch
http://cgit.freedesktop.org/~airlied/xf86-video-ati/log/?h=prime-test
http://people.freedesktop.org/~airlied/prime/0001-dri2-prime-hackfest.patch
This was the messiest bit and still requires a lot of change. First up I added an interface for the drivers to register as PRIME master and slaves. Intel driver registers as master, radeon as slave for my demo. We store these in an array. When a client connects and requests prime driver, we mark the drawable and redirect the dri2 buffer creation requests to the slave screen driver. Also the drm authentication is sent to both kernel drms. It then hooks the swapbuffers command where it does a region copy, and redirects this to the slave driver, and damages the pixmap in the master driver. Now the "interesting" part, my original implementation simply grabbed the window pixmap at the dri2 create buffers time, however there is an ordering issue with compositing, this pixmap is pre-composite redirection so isn't actually the pixmap you want to tell the kernel to bind to both gpus. This turned out to function badly, I could see gears all stretched over the front buffer.
So a quick coke + chocolate break later, I had enough sugar to bash out the hack that now exists. DRI2 calls the slave driver copy region callback, which checks if the drawable pixmap is on the same screen, if its not, it checks if we've marked the pixmap as a prime pixmap (i.e. one that belongs to the master). It is, it swaps in the slaves copy, otherwise it callsback into DRI2. This callback calls the Intel driver to make the buffer object backing the pixmap, shareable, and returns the handle,then calls into radeon with the handle to create a new pixmap pointing at the shared buffer object. Once all that is done, radeon copies the back buffer to the shared front pixmap, we return and damage is posted and the compositor grabs the window pixmap and displays it.
So does it work?
On my blistering fast test system with X + xcompmgr running glxgears was going at 150fps from the r200 PCI card. Hopefully I can get some time on a faster system or one of the dual laptops.
Caveats:
- When a window manager is running the gears get all corrupted, this looks like the clipping and/or stride matching between
the drivers isn't correct. I suspect something with reparenting and decorations, I'm not enough of an X guru to understand this yet, hopefully one of the other hackers can fill me in. Also before it gets reparented and redirected a frame can land on the real front buffer, again clipping should take care of this, but isn't working yet. I need to workout how clipping and that stuff works in X/DRI2. - talk to ppl about clipping then JDI.
- Once a client has connected as a prime, we don't tear it down properly, so later clients can end marked as prime. - work out some sort of resources to turn stuff off
- Reference counting on the pages in the kernel is iffy, currently i915 ups the page list refcount but never drops it. solution JDI
- hardcoded /dev/dri paths in dri2 for slave device - solution JDI
- radeon driver could in theory be a prime master - solution JDI
- nouveau could support prime master/slave also. - solution nouveau guys JDI
- requires an ugly second screen in xorg.conf to load the slave driver. Can we have a 0 sized screen or maybe a rootless second screen. - solution : rearchitect X server to allow drivers without screens (6m-1yr work)
- pageflipping needs to be hacked off in intel driver. - work out and then JDI
Where is the video?
Once I get it working with a window manager on a useful machine I might do a video of two gears going.
Where now?
Well this is a purely academic exercise so far, after a week of kernel fighting I decided to do something new and cool. To make this as good as Windows we need to seriously re-architect the X server + drivers. At the moment you can't load an X driver without having a screen to attach it to, I don't really want a screen for the slave driver, however I still have to have one all setup and doing nothing and hopefully not getting in the way. We'd need to separate screen + drivers a lot better. Having some sort of dynamic screens would probably fall out of this work if someone decides to actually do it.
The kernel bits aren't as ugly as I thought but I'm not sure if upstreaming them is a good idea without the others bits. The refcounting definitely needs work also the cleanup when clients exit.
DRI2 needs some more changes, I might try and flesh it out a bit more and then talk to krh about a sane interface.
I'm probably going to get forced task switch quite soon, so I might just get to having this running on a W500 or T500, before dropping it for 6 months, so if anyone wants a neat project to play with and has the hw feel free to try and take this on.
ASUS feel free to send me one of the real optimus laptops and I'll get nouveau guys hooked up and try and RE the nvidia DMA engine.
March 12, 2010 06:16 AM
March 08, 2010
...sponsored by Microsoft, tommorow at 18h. I decided to take a look, so there will be some fun ;-). [And I guess I can always run away when it gets too bad.]
March 08, 2010 09:09 PM
I've been going through the glibc sparc optimized assembler routines
to see if anything can be improved. And I took a stab at seeing if
strlen() could be made faster. Find first zero byte in string, pretty
simple right?
The first thing we have to discuss is the infamous trick coined by
Alan Mycroft, way back in 1987. It allows to check for the presence of
a zero byte in a word in 3 instructions. There are 2 magic constants:
#define MAGIC1 0x80808080
#define MAGIC2 0x01010101
If you're checking 64-bits at a time simply expand the above magic values
to 64-bits on 64-bit systems.
Then, given a word the check becomes:
if ((val - MAGIC2) & ~val & MAGIC1)
goto found_zero_byte_in_word;
Essentially we're subtracting MAGIC2 to induce underflow in each
byte that has the value zero in it. Such underflows cause bit 8
to get set in that byte. Then we want to see if bit 8
is set after subtraction in any byte where bit 8 wasn't set before
the subtraction.
To get the most parallelization on multi-issue cpus, we want to
compute this using something like:
tmp1 = val - MAGIC2;
tmp2 = ~val & MAGIC1;
if (tmp1 & tmp2)
goto found_zero_byte_in_word;
to reduce the number of dependencies such that the computation
of tmp1 and tmp2 can occur in the same cpu cycle.
Then there is all the trouble of getting the source buffer aligned
so we can do the fast loop comparing a word at a time. The most
direct implement is to read a byte at a time, checking for zero,
until the buffer address is properly aligned. This is also the
slowest implementation.
The powerpc code in glibc has a better idea. If dereferencing a
non-word-aligned byte at address 'x' is valid, so is reading the
word at 'x & ~3' (or 'x & ~7' on 64-bit). This is because page
protection occurs on page boundaries, and x and 'x & ~3' are on
the same page.
The only thing left to attend to is to make sure we don't match the
alignment pad bytes with zero. This is solved by computing a mask
of 1's and writing those 1's into the word we read before we do
the Mycroft computation above. In C it looks something like:
orig_ptr = ptr;
align = (unsigned long) ptr & 3;
mask = -1 >> (align * 8);
ptr = (void *) ((unsigned long) ptr & ~3UL);
val = *ptr;
val |= ~mask;
if ((val - MAGIC2) & ~val & MAGIC1)
goto found_zero_byte_in_word;
At which point we can fall into the main loop.
Once we find the word containing a zero byte, we have to iteratively
look for where it is in order to compute the return value. How to
schedule this is not trivial, and it's especially cumbersome on 64-bit
(where we have to potentially check 8 bytes as opposed to 4).
Anyways, let's analyze the 64-bit Sparc implementation I'm hacking on
at the moment. I'm targetting UltraSPARC-III and Niagara2 for
performance analysis. Simply speaking UltraSPARC-III can dual-issue
integer operations, and Niagara2 is single issue and predicts all
branches not taken (basically this means: minimize use of branches).
davem_strlen:
mov %o0, %o1
andn %o0, 0x7, %o0
ldx [%o0], %o5
and %o1, 0x7, %g1
mov -1, %g5
Save away the original string pointer in %o1. At the end we'll compute
the return value as "%o1 - %o0". Align the buffer pointer and load a word
as quickly as possible. We load the first word early so that we can hide
the memory latency into all of the constant and mask formation we need to
do before we can make the Mycroft test.
%g5 holds the initial part of the mask computation (-1, which gets expanded
fully to 64-bits by this move instruction) and %g1 will have the shift
factor.
sethi %hi(0x01010101), %o2
sll %g1, 3, %g1
or %o2, %lo(0x01010101), %o2
srlx %g5, %g1, %o3
sllx %o2, 32, %g1
sethi %hi(0x00ff0000), %g5
%o2 is going to hold the "0x01" expanded to 64-bits subtraction
magic value. %o3 wil first hold the initial word mask, and then
it will holds the "0x80" magic constant. We can compute the
two 64-bit magic constants into registers in 5 instructions.
Pick either of the two constants, we choose the "0x01" here because
we'll need it first. This is loaded first using "sethi", "or".
This gives us the lower 32-bits of the constant, then we shift up
a copy by 32-bits, then or that into the lower 32-bit copy to
compute the final value. "0x80" is "0x01" shifted left by 7 bits
so a simple shift is all we need to load the other 64-bit constant.
The "0x00ff0000" constant will be used while searching for the zero
byte in the final word.
Next, we mask the initial word and fall through into the main loop.
orn %o5, %o3, %o5
or %o2, %g1, %o2
sllx %o2, 7, %o3
Mask in the pad bits using mask compute in %o3. Finish computation
of 64-bit MAGIC1 into %o2, and finally put MAGIC2 into %o3. We're
ready for the main loop:
10: add %o0, 8, %o0
andn %o3, %o5, %g1
sub %o5, %o2, %g2
andcc %g1, %g2, %g0
be,a,pt %xcc, 10b
ldx [%o0], %o5
This is a real pain to schedule because there are many dependencies.
But the "andn", "sub", "andcc" sequence is the Mycroft test, and
those first two instructions can execute in one clock cycle on
UltraSPARC-III. The ",a" annul bit on the branch means that we
only execute the load in the branch delay slot if the branch is
taken.
Now we have the code that searches for where exactly the zero byte
is in the final word.
srlx %o5, 32, %g1
sub %o0, 8, %o0
We over advanced the buffer pointer in the main loop, so correct
that by subtracting 8. Prepare a copy of the upper 32-bits of
the word into %g1.
andn %o3, %g1, %o4
sub %g1, %o2, %g2
add %o0, 4, %g3
andcc %o4, %g2, %g0
movne %icc, %g1, %o5
move %icc, %g3, %o0
This is divide and conquer. Instead of doing 8 byte compares, we
first see if the upper 32-bits have the zero byte. We essentially
redo the Mycroft test on the upper 32-bits of the word.
If the upper 32-bits have the zero byte, we use %g1 for the comparisons.
Otherwise we retain %o5 for the subsequent comparisons and advance
the buffer pointer by 4 bytes. This is what the final two conditional
move instructions are doing. Note that these conditional moves use
'%icc', the 32-bit condition codes.
The astute reader may wonder why we just can't use the upper 32-bits
of the Mycroft computation we made in the main loop? This doesn't work
because the underflows can carry and cause false positives in upper
bytes of the word. For example, consider a value where bits 35 down
to 24 have hex value "0x0100". The subtraction of MAGIC2 will result
in "0x8080". The real zero byte is the lower one, not the upper one.
So we can't merely use the upper 32-bits of the already computed 64-bit
Mycroft mask, we have to recompute it over 32-bits by hand.
Now we're left with 32-bits to check for a zero byte, we make extensive
use of conditional moves to avoid branches:
mov 3, %g2
srlx %o5, 8, %g1
andcc %g1, 0xff, %g0
move %icc, 2, %g2
andcc %o5, %g5, %g0
srlx %o5, 24, %o5
move %icc, 1, %g2
andcc %o5, 0xff, %g0
move %icc, 0, %g2
add %o0, %g2, %o0
We check starting at the low byte up to the highest byte. Because
the highest byte, if zero, takes priority. We add the offset of
the zero byte to the buffer pointer.
Finally:
retl
sub %o0, %o1, %o0
We compute the length and return from the routine.
Many many moons ago, in 1998, Jakub Jelinek and his friend Jan Vondrak
wrote the routines we use now on sparc. And frankly it's very hard to
beat that code especially on multi-issue processors.
The powerpc trick to align the initial word helps us beat the existing
code for all the unaligned cases. But for the aligned case the existing
code holds a slight edge.
So now I've been trimming cycles as much as possible in the new code
trying to reach the state where the aligned case executes at least as
fast as the existing code. I'll check this work into glibc once I
accomplish that.
The Mycroft trick extends to other libc string routines. For example
for 'memchr' you replicate the search character into all bytes of
a word, let's call it 'xor_mask' and in the inner loop you adjust
each word by using:
val ^= xor_mask;
Then use the Mycroft test as in strlen(). Another complication with
memchr, however, is the need to check the given length bounds.
This can be done in one instruction by putting the far bounds into
your base pointer register (called '%top_of_buffer' below), then
using offsets starting at "0 - total_len" (referred to as
'%negative_len' below).
Then your inner loop can do something like:
ldx [%top_of_buffer + %negative_len], %o5
addcc %negative_len, 8, %negative_len
bcs %xcc, len_exceeded
...
We exit the loop when adding 8 bytes to the negative len causes an
overflow.
If you're interested in this kind of topic, bit twiddling tricks and
whatnot, you absolutely have to own a copy of "Hacker's Delight" by
Henry S. Warren, Jr.
March 08, 2010 05:09 PM
March 07, 2010
Yep, that's me; and yes, I know what the cue to slow down the horse is -- lean back and use both reins. And yes, you can stop the horse by doing "slow down" three times...
But that's not a way to stop the horse. If you are going full gallop and need to stop, you want full stop now cue, not three slow down cues.
Now, I knew some horses that were actually very good at stopping, and yes, there's huge difference between stop now and slow down to full stop. Cue those horses were trained to was "whoa"...
So I tried teaching that cue to young stallion here, and it does not really work. Or rather... it works a bit too well.
I know many horses where "whoa" means slow down so I sometimes utter it when I want to just slow down... and then the horse comes abrubtly to full stop. What is worse, many other words trigger same response -- I guess they are too similar for stallion's ears.
There must be some reasonable cue, that is impossible to mistake for the horse, and unlikely to be given accidentally by the rider... unintended full stop is almost "and now climb back to the horse" event... but what is it? For now I know "whoa" is neither :-(.
(And for the record, I probably could teach horse to do full stop on something completely crazy -- like hand touching his tail -- he's learning almost too quick.)
March 07, 2010 07:50 PM
March 05, 2010
I haven't had much time for blogging recently, too much exciting work
going on at OsmocomBB:
- we now have simplistic support for Uplink (transmit) on SDCCH/4
- we have a minimal Layer2 (LAPDm) implementation
- we can send LOCATION UPDATING REQUEST to the network, and receive
the respective response
- there's wireshark integration, i.e. all packets on the L1-L2 interface
can be sent into wireshark for protocol analysis
There are still many limitations, but this is a major milestone in the project:
We have working bi-directional communication from the phone to the network!
The limitations include:
- The cell has to use a combined CCCH (SDCCH/4 on timeslot 0)
- The cell has to use no encryption/authentication
- The layer2 is not finished, especially re-transmissions will not work yet
- There's no power control loop yet
- There's no timing advance correction
However, most of those are more or less simple
we know what needs to be
done, its just a matter of getting it done kind of tasks. There are no big
unknowns involved, and particularly no further reverse-engineering of the hardware
is required.
Also, the existence of a stable bi-directional communications channel between
the network and the phone means that anyone interested in working on the higher
layers can now actually do so. Completing and testing layer2 as well as
RR/MM/CC on layer3 is a major task in itself, and it definitely requires
the lower layers to be there.
The other good part is that development of layer2 and layer3 can happen
entirely on the host PC, where debugging is much easier and there's no need for
cross-compilation and we can use all the usual debugging options (gdb,
valgrind, ...)
I'm now almost heading off for holidays (starting March 10), so don't expect
any major progress from me anytime soon. I hope other interested developers
will be able to take it from here and fill in some missing gaps until I'll get
back.
March 05, 2010 01:00 AM
March 04, 2010
Folks,
Sorry for the delay. I should have updates out before the end of the week. Thanks. Remember, this is a spare time project and takes a lot of effort to do properly.
Jon.
March 04, 2010 08:50 AM
They are quite dramatical, but are very small yet - I committed search protocol changes. Now node stores transactions with IDs greater or equal than node's ID (it stored smaller or equal IDs previously), which is incompatible with current node searching, but allows to maintain human readable and logical (for humans) ID generation.
So, when node has ID, say, 0100..., it will host data transactions, which start from 01 (its the highest byte). It is much more convenient to configure nodes with this in mind, than to calculate what is less than 01, namely FF... IDs.
I also committed initial metadata support, but neither low level IO backend supports that yet, and I will leave only Tokyo Cabinet DB and file backends, BerkeleyDB support will be dropped, because of its slowliness. It is still in a development stage, since there is no clear vision on where this functionality should live - client or server.
I.e. it is possible that client will tell that it wants to insert metadata X into given object, and server will read/modify/write metadata blob itself, or it is possible that client will download whole metadata blob, update it locally and then write it back to server, which will replace old one with the new data. Likely I will use the former case, since it simplified client development, which should be a higher priority than server simplification.
We also found an interesting bug or feature of the storage - in some cases it is not possible to remove object, it will be recovered from the dead. Let's say we have two object copies and one node was turned off. Automatic recovery (not present yet though) will create another copy from the first one on alive nodes. Subsequent object removal will kill both copies on running nodes. When turned off node goes online again, autoamtic recovery tool will resurrect removed object from the copy presented on this node.
To date it is all a pure theory, since there is no separate metadata in the storage, thus no automatic recovery (admin should run special tool with properly crafted log file currently) and it does not remove objects from the storage. But still, described problem will hit us badly when we will actively use it.
And while there is no merge implemented either (it is kind of being materialized in my mind while we talk), solution will involve new history entry creation instead of actual data removal. Thus transaction log will contain a note that given object was removed. In case of network split and parallel object removal and update in different parts (which can not contact each other during this event) of the storage, this will also allow to implement correct and complete transaction history log by synchronization daemon.
Thus object will never be deleted from the storage, and instead its history will be updated to store a note about its status. File system checker will be extended to support a mode, when it will actually remove objects from the storage after they were marked (and resolved during merge with other logs if needed) as deleted after some timeout, which should be big enough to eliminate such ghost nodes appearence.
And the last but not least discussed issue concerns storage size and related limitations. Let's say that we reached our current storage capacity and want to add several another machines, which will add 50% of the current volume. We want to spread data equally between all nodes, thus we will need to update every node's ID to shift it a little, so that new nodes entered addressing ring and formed a fair ID distribution. Amount of transaction copies in this case is quite large - more than a half of all data will have to be transferred over the network, which will take a while.
Also, when we add new empty node into the storage, it will kind of hide data it is supposed to host (according to ID distribution) until it is copied to the new node from the neighbour. Thus there should be a poilicy, which will forbid simultaneous update of all servers, since there is a possibility that suddenly all added nodes hide all copies of some objects. It will be recovered of course, but it will take some time, which in some cases is not appropriate.
One of the solutions for the described storage size issue is different storage policy. We can implement multiple virtual datacenters, where each new virtual datacenter corresponds to newly added set of machines. In this case we will extend write application so that it could 'touch' old hash functions (and thus old virtual datacenters) first to determine whether it can store data there and move to the new machines if there is no space in the old ones. Reading can issue a parallel lookup to all virtual datacenters asking for given object ID.
This scheme has latency limitations as well as network traffic growing with new virtual datacenters involved, but it can be a good decision for smaller setups though.
Virtual datacenters (or configurable hash/transformation functions used to generate transaction ID) becomes one of the most flexible 'tools' to implement different storage setups.
Stay tuned, there will be more news soon!
March 04, 2010 12:08 AM
March 01, 2010
Although IT professionals should take care to avoid engineering envy, it is often useful to learn from the experiences of other engineering disciplines. In this posting, I will compare and contrast construction of a building to implementation of a large software project.
Leaving aside financial engineering, building construction starts with an architect, who lays out the general shape and look of the building. A structural engineer creates a detailed design, with an eye towards ensuring that the building will remain standing despite the best efforts of wind, gravity, and plate tectonics. A construction engineer works out the details of the construction process — for example, it is good if the building can support itself while being built as opposed to doing so only when completed. Other engineering specialties may be required as well, for example, HVAC (heating, ventilating, and air conditioning).
Once the building is built, different skills are needed, including operating engineers, maintenance personnel, and janitors.
A very similar sequence of events can play out for a large software application. Software architects (for better or worse) lay out the general shape of the project, developers design and code it, and others ensure that it is built, tested, and safely ensconced in some source-code management system.
However, once the application is completed, it is likely that its care and feeding will be taken over by application, database, and system administrators. The architects and developers will switch to other projects (possibly version N+1 of this same application), and perhaps even retire or otherwise move on. Of course, if the application runs at multiple sites, there might well be a separate set of administrators for each site. But for simplicity, let's assume that this application runs at only one site.
Now suppose that it is necessary to parallelize this application.
This is tantamount to major structural change to the building, such as adding several new floors. A structural change of this nature is clearly not a job that you would normally entrust to operating engineers, maintenance personnel, or janitors.
But what else can you do if the original architects and developers are gone?
March 01, 2010 01:43 AM
In the Motorola/Compal C155 phone
supported by OsmocomBB, we have found a ringtone
melody chip called SPMA100B from sunplus.
As strange as it might seem, this is the only part used in the phone for which we have
not been able to find any kind of programming information. So if you know anything
about how to program this part from software (register map, programming manual, ...)
please let me know!
And no, we don't need electrical/mechanical data sheets, thanks :)
March 01, 2010 01:00 AM
February 28, 2010
Visited an old mine today... Actually for an orienteering run. And seen some pretty impressive tech...
Mine is actually from 1890 or so, and it was running up to 1997 or so. They had some wonderful hacks -- like steam engine, still powering the elevator up to 1997, but running on compressed gas.
And because they did not use the computers to control the elevator, they had to use two-operators, and blackbox type device recording elevator speeds and communication over single rope. Speedometer used mercury. Impressive.
But... on the other hand they kept things simple. Steel pipe was used for communications 500 meters underground. Single part. In 1980, they'd probably use two analog phones and a battery, about 10 parts total. Today, we'd probably use two computers, running VOIP over ethernet, for about 1000 milion parts total. Is not progress wonderful?
February 28, 2010 09:30 PM
I used to hate skiing - I wasted 3 years in running ski section in univercity, while I could play football or, let's say, chess. Well, there was no chess section, but whatever else it could be more interesting than ski.
And this year I opened myself alpine ski. I did it about 15-20 years ago previously when was in school, and it was simple small plastic skis. Technology made a significan progress since then and I got ability to test real skis.
That's what I did this and previous weekends - two days in Stepanovo ski resort. It was essentially the first time I tried big slope (not that big compared to real resorts in Europe of course, just about a kilometer or less and 100 meters drop) and real snow. And it was fucking incredible - it is fast, it is long enough to feel the speed and ground, it is quite different - there are multiple traces and a lot of small roads from main trace, where one can ride over hummocks and small ski jumps.
I bought myself all equipment except skis itself - want to touch different things first, but I believe I will get my own next time. With the proper equipment it is not cold, warm or wet, it is just ubercool. Getting that I basically have no technique, I open lots of cases for myself all the time. And I believe that I have some progress, maybe not that good, but very pleasant for myself.
I tried long blue trace previously, but today I started a red one. And it was fucking beautiful - so fast and so strong. No boring places and long waits, just pure pleasure of speed and control. On this trace I found myself moving noticebly more technically than on a simpler trace.
I started to sit lower, put legs closer and change ski edges using mass center and not ass or legs, pipe changing arcs became shorter and with longer radius, which increased speed compared to plain skiing.
Of course it was not always perfect, and frankly I believe it looked like crap and was a real crap from good technique point of view, but it was very pleasant for me, and that's what matters. I want to get another hour or so with good teacher, who will tell me where main problems are, since I can not see how I made a slope. Sometimes I flew over the trace couple of meters and than landed in 'different positions' usually already without skis moving on my body another dozen of meters. But I like it too - it shows complex cases and sharps instincts.
Currently I believe there are no somewhat big parts of my body, which do not try to scream and ache. Especially shine bones (hard to move or stay long enough) and various leg muscles, but it is not a problem - I will be fresh again in a day, and hundred or so of "The Glenrothes" and couple of hours playing piano and trumpet will quickly help me. So plan is to make another turn next weekend or preferably move to ski resort couple times.
Fucking incredible. Just love it!
February 28, 2010 06:43 PM
Due to the big storm Thursday night, we spent two days without power. After freezing ourselves on Friday, we decided to spend Saturday at a friend's place (thank you Aris, Chris and Sarah). While checking on our house, there were always crews at work trying to clear up the fallen trees, reopen roads and reconnect power and communications lines. A big thank you goes out to the power and telco crews who are working around the clock to clear up the mess and reconnect New England.
February 28, 2010 05:24 PM
February 27, 2010
I've uploaded man-pages-3.24 into the release directory (or view the online pages). The most notable changes in man-pages-3.24 are the following:
- The addition of three pages by David Howells describing the kernel key management facility: add_key(2), request_key(2), and keyctl(2). (These pages were formerly part of the keyutils package.)
- The fcntl(2) manual pages adds documention of F_SETOWN_EX and F_GETOWN_EX, which are new in Linux 2.6.32.
- Minor changes to many other pages.
February 27, 2010 05:00 PM
February 26, 2010
Barnes and Noble released the nook source code last week. This includes the code to busybox, uboot and their kernel. Unfortunately, the uboot and kernel code both appear to be missing swathes of code found statically linked in the binaries that they're distributing. License compliance is hard, let's flail wildly.
February 26, 2010 06:31 PM
Okay I've been busy elsewhere but dragged myself back to try and finish this for upstream
v10 of the patch is up
http://people.freedesktop.org/~airlied/vgaswitcheroo/0001-vga_switcheroo-initial-implementation-v10.patch
changes are mainly that mjg59 was right about keeping ugly things in the drivers.
adding ATRM support to get the ROMs on ATI hybrid for the discrete card was actually a pain with the previous code design,
so I moved lots of it around again, and now the discrete ROM can be retrieved via the ATRM method.
I've tested it on the W500 and it works as well as before, which means still the 3rd or 4th switch fails and locks the machine up,
I need to debug this further.
The refactored code should hopefully make it easier to fill in the nvidia/nvidia and intel/nvidia blanks for mjg59.
Update 1: v11 is now up
http://people.freedesktop.org/~airlied/vgaswitcheroo/0001-vga_switcheroo-initial-implementation-v11.patch
It should fix the failure to switch to IGD the 2nd time hopefully.
Update 2: v13 is now up, it blindly implements nvidia DSM changing, but I've no idea if it works. Hopefully someone can test it and give me some feedback. Its nearly all guesswork from work mjg59 did.
February 26, 2010 05:04 AM
February 25, 2010
Chris asks where OpenSolaris is headed. My reaction: nobody cares anymore. FreeBSD established itself as the alternative to Linux, and that leaves Solaris with no niche. So, whatever. It is much more important what is going to happen to OpenOffice and MySQL. Also, Sun carried a pretty large assortment of lesser projects, such as Lustre.
February 25, 2010 11:13 PM
February 24, 2010
ld gives you "Can not allocate memory".
(turned out to be a corrupt object file)
February 24, 2010 07:21 PM

I'm probably moving my office to be above the garage.
In preparation for that, I did the whole "get CAT6 networking to the new location" thing, which has involved re-acquainting myself with our crawlspace. Spending my days crawling around, hoping I'm not going to encounter any dead mice (or live ones, for that matter).
I obviously already had cable going to various locations in the house, but the way that had happened, I'd done them one at a time, and my current office ended up being the hub for it all. And since I really wasn't going to re-route all the cables and make the new office be another hub of chaos, and I certainly wasn't going to leave the hub in what will become a kids bedroom, the above is the result.
Beautiful it ain't. It's a real media center enclosure, but the networking hubs that are meant for those things are overpriced and generally just pitiful 4-port 100Mbps switches with dubious firewall capabilities, so I'm just installing my own. And some day, I'll actually add the screws that hold the boxes where they are supposed to go, rather than just sitting in a pile on top of each other at the bottom of the box.
I haven't had the energy to fix the telephone wiring. As you can see, I now have the header for getting that particular mess sorted out too, but I'm not the person who created that particular "rat king" of cabling under our house in the first place. So I'm not feeling the need quite acutely enough to spend another few hours crawling around straightening out all that wiring. Same goes for TV cabling. You can kind of tell what part of the house wiring I actually care about...
February 24, 2010 10:38 AM
February 23, 2010
FLOSS weekly podcast has interview with someone from Symbian foundation. Interesting point is, that even Symbian people acknowledge Android as good, but will try to attack it from below, by using less power and running on smaller device. They even have a blog.
What they do not have is working system on real hardware... which is quite interesting. They claim to be using qemu and beagleboard, citing lack of drivers and claiming no open devices exist. I guess someone should show them OpenMoko or HTC Dream (ADP1). Plus they do have their own c++ dialect, with proprietary compiler and
Ouch, and what they do have is design by comitee. Actually design by 4 comitees :-(.
Anyway, it is great to see more opensource competition in cellphones; and I hope it does not mean death of Maemo platform.
February 23, 2010 09:20 PM
Ok, so I got paper version of Blade Runner... and I enjoyed it, even through I expected a bit more.
But now... my android seems to be downloading electric sheep at 100MB/night rate. And yes, it continues during the day, too, and would probably do more if I had better connection than GPRS.
No, rebooting the phone did not help. According to 'spare parts', component responsible for the traffic is 'media'... which is alias for 'download manager' and pretty much opaque. So I tried plain old tcpdump, to find that it is talking to 1e100.net; I have custom rom but it was still trying to download updates.
Solution is "easy": disable background data. Unfortunately, it also disables market and gtalk. Is there better solution?
Oh and it is now clear. Androids do not dream of electric sheep, they dream of digital donuts.
February 23, 2010 05:16 PM
February 22, 2010
Ran "yum update" today on F12 and the rewrite of BIND configuration produced a fail-to-start again. Only instead of a blatant syntax error with unbalanced braces like when DNSSEC was first enabled, they merely referred a non-existing file (/etc/pki/dnssec-keys//named.dnssec.keys). BTW, I looked everywhere, it's not a part of any package we ship in Fedora. What a facepalm, in the middle of stable release too. You know, the anti-Rawhide people always bring it up how Rawhide is "not guaranteed" to work. Well, is F12 "guaranteed"?
For about four recent releases it became noticeable that Fedora folks put a lot of effort into the QA and polish, but once release is out of the door, controls are relaxed and all sorts of dubious code flows freely in the guise of "security" updates. The S-word is some kind of a magic key that trumps any basic quality. The net result is going to be people installing releases and then never updating, once they catch up on what's happening. What's worse, once this folk wisdom gets established, it cannot be easily reversed even if updates become quality checked.
February 22, 2010 08:41 PM
My God, I've been vaguely aware of the HTML5 video train
wreck but I hadn't realised just how much of a fucking
abortion the rest of the HTML5 'standard' is.
I had the misfortune to read the section
on character encodings over the weekend, and it almost
made me lose my lunch.
Not only does it codify the crappy and unreliable practice
of applying heuristics to guess character encodings, it also
requires that a user agent deliberately ignore the
explicitly specified character set in some cases — for
example, text explicitly labelled as US-ASCII or ISO8859-1
MUST be rendered as if it were Windows-1252!
It justifies this idiocy, which it admits is a 'willful
violation', on the basis that it aids compatibility with
legacy content. By which of course it means "broken
content", since this was never actually necessary for anyone
who published content correctly even with older versions of
HTML.
But that doesn't make any sense — surely legacy
content won't be identifying itself as HTML5? It might be
reasonable to do these stupid things for legacy content, but
not HTML5. The complete mess we have with charset labelling
is a prime example of where the RFC1122 §1.2.2 approach of
being lenient in what you accept has
turned out to be massively counter-productive — if
we'd simply refused to make stupid guesses about character
sets in the first place, then people would have actually
started getting the labelling right.
The sensible approach to take with HTML5
would just have been to say "All content which identifies
itself as HTML5 MUST be in the UTF-8 character encoding. A
conforming user agent MUST NOT attempt to interpret content
as if it has any other encoding; any invalid UTF-8 byte
sequences MUST be shown using the Unicode replacement
character U+FFFD (�) or equivalent."
Or, if we really must continue to permit the legacy crap
8-bit character sets, it should have said that the content
MUST be in the character set specified in the HTTP
Content-Type: header or equivalent
<META> tag.
Keep the stupid heuristics for legacy content by all means,
but it should be forbidden to render HTML5
content in a character set other than the one it is labelled
with, and all invalid characters (including the C1
control characters in ISO8859-1 which in Windows-1252 would
map to extra printable characters like the Euro sign)
MUS be shown as U+FFFD (�). And then the people
who publish broken crap would see that they're
publishing broken crap, rather than thinking it's OK because
the browser they use just happens to assume the same
character set as the system they're publishing from.
To me, HTML5 looks less like a standard and more like a set
of broken hackish kludges to work around the fact that
people out there aren't actually capable of
following a standard.
February 22, 2010 12:31 PM
February 21, 2010
I knew 6230 is a good phone, and yes, it seems to come back. I lost it in a bus twice already (and good people returned it both times), lost it from bycicle and a horse back...
I went to the mountains, and estimated the trip from bus to Petraska at 4 hours (arriving at cca 23:30). But I selected
shorter way over ski slope and made it under two... only to realize that I lost 6230 somewhere.
I was told I had no chance to find it; but in nice, quiet night ringing and blinking phone is rather easy to find so I disagreed, and went back for a rescue -- 6230 still had signal and was ringing
somewhere in the mountains.
But I was pretty suprised when I found the 6230 -- it was 5 centimeters under the snow, getting direct hit from snow gun for about 2 hours... I only found it because of light. Battery was low, but phone is alive and continues to work.
To whoever designed 6230: thanks!
February 21, 2010 07:26 AM
February 20, 2010
Headed through Germany 26th through 3rd March or so, then Lithuania via Poland. Back via Singapore on 24/25 March.
My email will be intermittent (I hope!) but if you’re around and want to grab a meal or a beer with us, ping me!
February 20, 2010 07:02 AM
I've spent the better part of the day with , renaming files/functions/include paths, Makefiles, autotools and the
like.
The result of this is a new sub-project called libosmocore that
gathers all the shared code between the network-side GSM implementation
OpenBSC and the phone-side implementation OsmocomBB. The library is
portable enough that it can run on a proper OS (like GNU/Linux) but
also be cross-compiled to work on the actual phone without any OS.
On the other hand we now have a master Makefile in OsmocomBB to build
libosmocore for host PC and target (phone), as well as the osmocon
and layer2 host programs and the phone firmware itself.
Let's hope I can now return to writing actual code...
February 20, 2010 01:00 AM
February 19, 2010
As I mentioned, I headed to Pittsburgh last week to give some talks at CMU and find out something about what they're doing there. Despite the dire weather that had closed the airport the day before, I had no trouble getting into town and was soon safely in a hotel room with a heater that seemed oddly enthusiastic about blasting cold air at me for ten seconds every fifteen minutes. Unfortunately, it seems that life wasn't as easy for everyone - ten minutes after I arrived, I got a phone call telling me that the city had asked CMU to cancel classes the next day.
This turned out to be much less of a problem than I'd expected - whether because of their enthusiasm to learn about ACPI or because they simply hadn't noticed the alert telling them about the cancellation, a decent body of students turned up the next morning. After a brief chat with Mark Stehlik, the assistant dean for undergraduate education, I headed off to the lecture hall. The fact that I can now just plug my laptop into a VGA cable and have my desktop automatically extend itself continues to amaze me, as does OpenOffice's seemingly unerring ability to get confused about which screen should have my content and which should be showing me the next slide. Nevertheless, facts were imparted and knowledge dropped on those assembled. I'm even reasonably sure that the contents were factually accurate, which is a shame because the most attractive part of teaching always struck me as being able to lie to students who will then happily regurgitate whatever you tell them because in case it turns up on the exam. Perhaps this is why I'm safer out of academia.
Lunch offered an opportunity to visit the Red Hat sponsored lab, which was pleasingly located somewhere other than a basement. The guy on the right of the picture is Greg Kesden, the director of undergraduate laboratories in CS there - it was wonderful to get an opportunity to see the machines getting used, and students seemed genuinely appreciative of the facility.
After lunch I spent a while talking to Satya about the Internet Suspend and Resume project. This is an impressive combination of virtualisation and migration, using a Fedora-based live image to bring up an OS on arbitrary hardware before downloading a machine image and launching it. The majority of the data is pulled in on demand, meaning that initial performance can be slow but ensuring that data is only downloaded if it's needed. When the user is finished, the delta between the original image and the new one can be pushed back to the server while remaining cached on the local machine in case the image is used again.
It's an interesting approach, combining the flexibility of thin clients with the advantages of having actually useful computing power at the local end. There's a few functional awkwardnesses, such as some VMs being unhappy if images are migrated between machines with different CPU features, and it obviously benefits from having significant bandwidth. But the idea of being able to combine the convenience of a floating session with the knowledge that you can still keep copies of your data on you is an attractive one, and I'd love a future where I can move my session between my laptop and a desktop.
After that there was some time to talk to Bill Scherlis and Philip Lehman about the software engineering courses that CMU run. Part of the minor in software engineering includes a course requirement to make a meaningful contribution to an existing software project, from design through to submission and upstream acceptance. I had the opportunity to talk to a couple of the students about this and the differences they found between working with the Mozilla and Chrome communities, which I'll try to write up at some point.
Finally I gave a presentation on Fedora and some of the issues that we face in providing a useful OS when patents and recalcitrant hardware vendors do their best to thwart us. Despite the ice outside and the significantly-below-freezing temperatures, enough people turned up that sorties had to be sent out to find extra chairs. It was great to see how interested people were in learning about what we do, although it's probably the case that the free pizza did help encourage people.
After that it was an early trip back to the airport, where I found that my plane was delayed and the only "restaurant" still open was McDonalds. Even so, I left with the feeling that it had been an interesting and educational visit. Many thanks to David Eckhardt, who runs the OS course I presented to and who looked after me all day - thanks too to Joshua Wise who picked me up when David was running late due to the ground being covered with blocks of ice.
February 19, 2010 09:35 PM
So I was in Costco waiting for a car tire rotation and check yesterday. Wasting time, I blew three bucks on a slice of pizza and a sundae, and looked around for a place to sit down and pig out. The place was packed, and it was the middle of the day.
So I sat down next to this group of people, and realized that one reason it was busy was that apparently people use the Costco foodcourt as a lunch place. Fair enough. A couple of bucks gets you a long way there.
Sitting there, I can't but help overhear that it's apparently some religious discussion going on. Ok, so it's the local God Squad having their lunch meeting, no biggie. They're apparently talking about Africa, and about life and death decisions etc - at least one of them is a missionary.
And that's when it gets strange. One of them starts to seriously talk about praying demons away, and then after the prayer has driven the demon out of the person, you have to support the person so that the demon doesn't come back. And nobody laughs at him.
Seriously? What year is it again? I'm pretty sure they didn't have Costco foodcourts in the middle ages, but maybe there was some time warping going on.
What the hell is wrong with people?
February 19, 2010 11:51 AM
Last, but not least, I am proud to announce the OsmocomBB project publicly. During the last
7 weeks, a small group of skilled developers has been working on this
It has now reached a point where we can
- scan the spectrum for the strongest signal GSM channels
- lock onto them and performing AFC (automatic frequency control)
- decode the SCH burst to obtain BSIC and GSM frame time
- decode the BCCH of the cell, pass it over to the host PC and feed it into
wireshark
Since this in itself is a valuable and useful milestone of the project,
it was the ideal opportunity to take this project public.
There's still a lot of work to be done in many areas. Most of them are not
even related to the GSM air interface. So if you're familiar with C
development on an ARM7TDMI based microcontroller, know your way around
I2C and SPI, are familiar with the GNU toolchain for ARM and want to
help us out: Please join the baseband-devel mailing
list right away!
February 19, 2010 01:00 AM
February 18, 2010
Its original draft could be read previously, but I believe it became a little bit outdated, so requires some highlighting.
But first, let's clear the status of fsck log checker. I completed its implementation, which is now capable of supporting consistent number of copies in the storage. It does not allow to merge different transaction logs yet.
To determine object to check it uses special text log file, which among other info contains name of the object and transformation functions to work with. Each transformation function will produce unique ID, which will be checked in the storage. For example we can put there sha1 and md5 transformation functions, so we will have two IDs equal to appropriate hash of the input name (and optionally hash of the transactions content).
When some objects are not presented in the storage, checker will download first existing copy and try to upload it using transformation functions corresponding to missing objects. So, if object with ID being equal to md5(name) is present and sha1(name) isn't, then checker will download all transactions stored in the existing object and upload them using sha1 transformation, thus recovering requested number of copies.
Checker currently requires log file to get information from and admin to start the process.
Background fsck is supposed to eliminate both needs.
Basic idea is to store some metadata with each object, which will tell origin of the given object and how it was supposed to be stored in the elliptics network. Thus we can timely or on request parse metadata for all objects in the given node (or only part of them), create a log file and run existing checker against it.
It becomes similar to what extended attributes are in the existing filesystems. Metadata can contain information not only about what object is, but also its IO permissions or access policies, owner information and anything else we would like to have there, which will allow to implement at least basic security model for elliptics network as well as simplify POHMELFS port.
February 18, 2010 08:32 PM
I just read on the xorg mailing list that Paypal stole USD$5k from xorg and another 5k to some Brazilian bankers. I have only used Paypal once and they gave me a USD conversion rate which was half that of legitimate banks and it was all a big drama and felt really unfair. So I really hope that Xorg is able to recover those funds. Maybe one of yous fellas is a lawyer and can help Xorg?
February 18, 2010 04:08 PM
February 17, 2010
Anssi Hannula posted a patch to add Gobi 2000 support to qcserial and provided me with support for gobi_loader. I've added the gobi_loader code here. You'll need Anssi's kernel patch from here, and probably also my followup patch with extra IDs from here. Note that the 2000 devices need an extra firmware file (UQCN.mbn) as well as the apps.mbn and amss.mbn files.
The qcserial driver is currently broken in 2.6.32 and later. It's due to the switch to using kfifo for usb serial, but we haven't been able to work out the actual cause. I'm looking at alternative approaches.
February 17, 2010 09:56 PM
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100214.mp3
This podcast is brought to you by the colour blue and way too much coffee, together reminding you to check out the awesome power of the BeagleBoard Open Source hardware project at http://www.beagleboard.org/. My new Rev C. board was responsible for the delay getting this issue out…too much fun was had.
For the weekend of February 14th, 2010, I’m Jon Masters with a summary of the weeks’s LKML traffic.
In this issue: Linux 2.6.33-rc8, x86 bootmem, NFS, OOM, Performance Counters, Relaxation, Stack Sizes, and SysFS mutability.
Linux 2.6.33-rc8. Linus Torvalds announced the release of version 2.6.33-rc8 on Friday February 12th 2010 at 11:49 am Best Coast Time (PST), saying that he hoped it would be the last before 2.6.33 final. He added that, “A number of regressions should be fixed, and while the regression list doesn’t make me _happy_, we didn’t have the kind of nasty things that went on before -rc7 and made me worried”. This kernel includes fixes for the netfilter bugs that I discovered, as well as some KMS regression fixes. In a separate discussion thread started by John Hawley (warthog9), it was debated when kernel.org should move over to using xz (LZMA2) as a replacement for bzip2 compression (remember when bzip2 was trendy and new?). John proposed various migration options before the thread verred off into a discussion around when an eventual 3.0 Linux kernel would come, and what that would actually mean in practical terms – just an arbitrary future release? I expect that LWN will have a typically witty writeup of this discussion sometime this week.
Bootmem. Back in October last year, Ingo Molnar had stated that the kernel may not need the “bootmem” allocator on x86. At the time, he noted that there were 5 different allocators on x86, depending upon the boot stage (to say nothing of the other core allocator options): the generic allocator, the early allocator (bootmem), the very early allocator (reserve_early), the very very early allocator (early brk model), and the very very very early allocator (basically just build time allocation). By initializing the x86 page allocator earlier in the boot process, Yinghai Lu attempts to do just what Ingo had suggested, now in version 6 of his patchset.
NFS. Hirofumi Ogawa noticed (2.6.33-rc6) that recent kernels could not mount remote NFS version 3 shares, because of a userspace visible change in the kernel nfsd server. If he specified “vers=3″ at mount time, all was well, but the kernel was not falling back to v3 correctly when v4 fails due to a change in error handling. Bruce Fields noted that this change was actually intentional and that the userspace tools had been updated, but decided to revert the patch that caused this change for the time being – at least until the new versions of the mount tools are much more widespread than right now. Bruce sent a patch entitled (”informingly”) “2.6.33 fix” to Linus.
OOM. David Rientjes posted a patchset re-implementing the OOM killer, in the wake of a number of discussions concerning its brokenness. It includes a complete rewrite of the badness() heuristic, which he is then described in some detail within the corresponding patch. Quoting David, ‘The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of “allowable” memory. ” Allowble,” in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current’s cpuset, or a memory controller’s limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space.”
Performance counters. Christoph Hellwig had complained that a patch had been merged back in September from Arjan van de Ven entitled “perf_core: provide a kernel-internal interface to get to performance counters”. That was intended to facilitate in-kernel use of the performance counters framework, but it was Christoph’s opinion that it had no users and should be reverted. Ingo Molnar countered that there actually were a growing number of users, now including the latest work by Don Zickus to create a generalized NMI watchdog handler.
Relax. Michael Breuer posted an interesting analysis of the implementation of the function cpu_relax on x86 systems. This function is called during spinlock spinning cycles in order to give the CPU a break (power management, etc.). Apparently, that function currently uses a nop, but both the Intel and AMD documentation recommend the PAUSE instruction instead (partly because it can be detected on recent CPUs and used to give special treatment to guest instances running under virtualization that are wasting CPU cycles when multiple vpus are allocated and some are spinning away). Arjan van de Ven, and others too, seemed to find this odd, and Artur Skawina wondered if this might be an odd alignment issue. Nonetheless, Michael detects a noticeable performance impact in various tests between these two instructions.
Stack sizes. The kernel contains various task startup code that will create a vma region for its stack use. Existing kernels make this size determination based upon the PAGE_SIZE for the architecture, even though this really is independent of the userspace code that will use the stack, and even given existing rlimits that might see the stack theoretically larger than has been allowed by system limits. Michael Neuling sent a patch to decouple stack sizing from PAGE_SIZE and to default to basing it upon the rlimit.
SysFS. Amerigo Wang posted an RFC patch implementing “mutable sysfs files”. The basic idea is that all potentially “mutable” (that is to say, files that may be yanked out from underneath at any time a hotplug or other operation occurs) files should use a specific API to avoid warnings.
In today’s miscellaneous items: An interesting discussion started by Salman Qazi (Google) centered around a missunderstanding of the ptrace API (and eventual iteration from Oleg Nesterov that the existing API sucks), a January XFS update from Christoph Hellwig (noting new support for netlink provided quota communication, better power saving in XFS kernel threads), Mel Gorman posted version 2 (v2r12) of his “Memory Compaction” patch series that is intended to “defragment” memory by reconciling GFP_MOVABLE pages, and another one of Al Viro’s entertaining rants, this time about pohmelfs and its use of direct access to the current->fs->{root,mnt} entries.
In today’s announcements:
Git version 1.6.6.2. Junio C Hamano announced an update to the 1.6.6 series of the Git SCM tool, releasing version 1.6.6.2. This contains a few fixes.
Git version 1.7.0. Junio C Hamano also announced version 1.7.0 of the Git SCM had been released. This is the latest official version and includes a number of behavioral changes to “git push”, “git send-email”, and other commands as previously noted in this podcast. Users should read the release notes before upgrading if they want to make sure they catch all of the improvements.
Linux 2.6.32.8. Greg Kroah-Hartman, apologizing for the slight delay due to a few crashes that had been reported and a need to verify a security fix, as well as various travel plans, announced the release of 2.6.32.8. It contains a few fixes 2.6.32 users really should have on their systems.
The Linux Storage and Filesystems Summit. James Bottomley announced that the annual Linux Storage and Filesystems summit will take place concurrently with the VM summit on the two days before LinuxCon in Boston (Sunday and Monday), on the 8th and 9th of August. Interested parties can visit either the Linux Foundation website, or email agenda topics to the program committee at lsf10-pc@lists.linuxfoundation.org.
Userspace RCU 0.4.1. Mathieu Desnoyers announced the latest release of his Userspace RCU implementation (remember, patent encumbered, but with a waiver for GPL projects). Version 0.4.1 contains a compilation fix for s390.
As a followup to last weekend’s kerneloops statistics, Arjan van de Ven also posted statistics purely for the 2.6.33 at that time. In his statistics, he showed that the most popular oops was in memcpy_toiovecend (found 391 times).
The latest kernel release is 2.6.33-rc8.
Andrew Morton announced an mm-of-the-moment mmotm for 2010-02-11-21-15.
Don’t forget to read my latest blog posting on jonmasters.org for more information on using the Cyclades TS-3000 with kgdb for remote target debugging, and don’t forget to support Jason Wessel’s proposed kgdb and kdb merge for 2.6.34. You know it makes sense to get this out there widely.
That’s a summary of the week’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.
February 17, 2010 01:35 PM
February 16, 2010
For those of you wondering, I’m still working on union mounts, just heads-down on a major rewrite to fix the hairiest problems. Right now I’m perhaps 90% of the way through rewriting the actual lookup code, the dense nutty core of union mounts. This will fix one of the most difficult problems with the current code, massive code duplication between cached, real, and restricted real (“hash”) lookups:
http://lkml.indiana.edu/hypermail/linux/kernel/0910.2/01572.html
I rewrote this to be one function with a loop centered around __lookup_hash() and it’s looking pretty good.
This rewrite is one of the hardest coding problems I’ve ever worked on, and I have a lot of respect for the original union mount authors, Jan Blunck, Bharata Rao, Miklos Szeredi, David Woodhouse, and everyone else who has ever worked on a unioning file system. Not to mention the regular VFS authors – the cost of pathname lookup is one of the most crucial elements of operating system performance and it takes a lot of work to make it go fast.
February 16, 2010 09:04 PM
I'm in KL now struggling to find cheap used electronics. I needed a bunch of microsd and SD cards, didn't matter what size or make, used or new, I just wanted them really cheap (RM5 or less). So I posted on the fleamarket forums, Mudah and Cari but few sellers. It seems like most people here trash the stuff rather than selling it on. I wanted a handycam that can write either mp4 or wmv to SD or microSD for filming demos, so condition didn't matter much and image quality could be average, and was willing to put down around RM150 but no sellers for that either. Where does all the old electronics go? Into the trash? Melted down? What a pity for cheapskates like me. Even stuff like perspex boards only gets sold new which means it is all expensive. Where does all the used stuff go?
February 16, 2010 04:50 PM
My PS/2 adapter seems to have died at last... Or, actually, it still works, but it takes several reboots to have it "grab" and start working. It was getting worse gradually, perhaps a capacitor is dying somewhere or whatnot. So, I hooked up a Belkin keyboard that I obtained many years ago for some kind of USB testing, and what do you know: I worked with computers for 27 years now and this is probably the second or third worst keyboard that I ever touched (the so-called "Cuban Videoton" or "CID" - the terminal made in the Island of Cuba - was the worst, and it had a couple of good competitors, one of which hailed from Yerevan, Armenia). The problem is subtle: keys of the Belkin Scorpius 980 Plus have a random friction in them. To write it in a blog, it sounds like a ridiculously petty complaint, but it's real. Typing anything correctly is a pain, and I have to program in C on it, goddamit.

I was thinking about killing two birds with one stone by getting one with a built-in touchpad in the laptop position. My trusty old ALPS touchpad is great and all, but it developed a peculiar problem: its feet became hard with age and it slides. The common Adesco keyboards get mixed reviews and the listed sizes are contradictory or not credible. Amazon has one SolidTek type that seems like the right size and design. One problem though: $40 price. Isn't it a bit high for what seems like a rather dubious quality? It's not like I am on welfare, it's just... not an Apple or Daimler-Benz product to command a price like that.
So, yeah.
UPDATE: Peter Zijlstra pointed out the Lenovo UltraNav, which is definitely a quality unit, but it has all me (mis)features of a ThinkPad: left Ctrl and Fn are swapped, buttons that go along with the nipple offset the touchpad down, Esc is way far up. I already have a T400 and I hate all of that. Otherwise, it's perfect.
UPDATE 2010/03/01: After some consideration, I went with the the IBM keyboard because of (a) quality and (b) 100% key pitch.
True, it has all the disadvantages of the Thinkpad layout, but at least to type on it is not painful. BTW, no Microsoft button.
Oh, and the ALPS touchpad is finally retired after 13 years of service without reproach. It probably is the oldest computer peripheral in the house by far, because usually I recycle ruthlessly.
February 16, 2010 06:11 AM
Mikael noted in my previous post that Con Kolivas’s lrzip is another interesting compressor. In fact, Con has already done a simple 64-bit enhance of rzip for lrzip, and on our example file it gets 56M vs 55M for xz (lrzip in raw mode, followed by xz, gives 100k worse than just using lrzip: lrzip already uses lzma).
Assuming no bugs in rzip, the takeaway here is simple: rzip should not attempt to find matches within the range that the backend compressor (900k for bzip2 in rzip, 32k for gzip, megabytes for LZMA as used by lrzip). The backend compressor will do a better job (as shown by similar results with lrzip when I increase the hash array size so it finds more matches: the resulting file is larger).
The rzip algorithm is good at finding matches over huge distances, and that is what it should stick to. Huge here == size of file (rzip does not stream, for this reason). And this implies only worrying about large matches over huge distances (the current 32 byte minimum is probably too small). The current version of rzip uses an mmap window so it never has to seek, but this window is artificially limited to 900MB (or 60% of mem in lrzip). If we carefully limit the number of comparisons with previous parts of the file, we may be able to reduce them to the point where we don’t confuse the readahead algorithms and thus get nice performance (fadvise may help here too) whether we are mmaped or seeking.
I like the idea that rzip should scale with the size of the file being compressed, not make assumptions about today’s memory sizes. Though some kind of thrash detection using mincore might be necessary to avoid killing our dumb mm systems :(
February 16, 2010 01:21 AM
February 15, 2010
I mostly read the article about Coverity's experience in the trenches as something I would read at The Daily WTF. Which I don't read, let alone daily: it's too far removed from my world. Still, some of that may come handy one day. Like this:
How to handle cluelessness. You cannot often argue with people who are sufficiently confused about technical matters; they think you are the one who doesn't get it. They also tend to get emotional. Arguing reliably kills sales. What to do? One trick is to try to organize a large meeting so their peers do the work for you. The more people in the room, the more likely there is someone very smart and respected and cares (about bugs and about the given code), can diagnose an error (to counter arguments it's a false positive), has been burned by a similar error, loses his/her bonus for errors, or is in another group (another potential sale).
But other than that, bah humbug. My universe is gcc (or maybe LLVM at the most). The heroic tales of fighting people who write C in StudlyCaps mean nothing to me. The only real import of the article is how Sparse needs more attention. If nothing else, Free Software developers need to counter-patent everything in Sparse for when Coverity comes for us, we'll be ready.
February 15, 2010 07:53 PM
As the kernel archive debates replacing .bz2 files with .xz, I took a brief glance at xz. My test was to take a tarball of the linux kernel source (made from a recent git tree, but excluding the .git directory):
linux.2.6.tar 395M
For a comparison, bzip2 -9, rzip -9 (which uses bzip2 after finding distant matches), and xz:
linux.2.6.tar.bz2 67M
linux.2.6.tar.rz 65M
linux.2.6.tar.xz 55M
So, I hacked rzip with a -R option to output non-bzip’d blocks:
linux.2.6.tar.rawrz 269M
Xz on this file simulates what would happen if rzip used xz instead of libbz2:
linux.2.5.tar.rawrz.xz 57M
Hmm, it makes xz worse! OK, what if we rev up the conservative rzip to use 1G of memory rather than 128M max? And the xz that?
linux.2.6.tar.rawrz 220M
linux.2.6.tar.rawrz.xz 58M
It actually gets worse as rzip does more work, implying xz is finding quite long-distance matches (bzip2 won’t find matches over more than 900k). So, rzip could only have benefit over xz on really huge files: but note that current rzip is limited on filesize to 4G so it’s a pretty small useful window.
February 15, 2010 07:56 AM
February 14, 2010
It took a while to prepare a new release of the distributed hash table storage elliptics network, but here we go. This is still a minor version bump, although amount of changes is rather large for small update.
Likely this will be the last releae in 2.6 release cycle, since in parallel we are cooking up a completely new versioning and merge logic as well as data synchronization. Btw, this release breaks to some degree that logic, but there is a tool to fix things up. It will be automated in the next versions.
But let's dig into details and changelog:
- Data integrity checker. Although a little bit undocumented (see example below), it allows to check whether given object is present in the storage with requested number of its copies. And if number of found objects does not correspond to config, it will automatically download and upload data with the desired IDs. Later this tool will also be able to upload data into the storage. This checker will be a base for background FSCK, which will be a simple script, which will parse metadata and start checker with given log. It also supports external library call for requests merge.
- [FCGI frontend]: cookie, timeouts, tunable headers, variable content types, more and clean XML.
- Rewritten network state and reconnection logic. This makes NATed box support trivial (we do support it), client nodes became even simpler than ever, less code, less bugs, everyone is happy.
- Debian debug package.
- Fair number of bug fixes. This version is used in production, if time permits I will describe this load in details later.
Modulo possible bugs, main work is concentrated on the filesystem checker. There are two problems to solve.
The first one is absence of transaction log made by requested transformation function, or in plain words - absence of copy of the object in the storage. This happens when some node went offline and returned empty or was replaced. Or did not return at all. In this case fsck application will check how many copies are present in the storage and automatially download one of them (the first one from config) and upload with given ID.
Second issue to resolve is transaction merge. Elliptics network by default uses transactions for every update, so there is no object as is in the storage, instead reader will download transaction log, parse it and select transactions which cover requested object range. It is hidden in API of course, but it is possible to manually select needed transactions, for example to support versioning and data snapshots. As tasty effect two fully equal transactions (objects) will not use two times more space, since there are appopriate transaction reference counters.
Currently there are multiple (5) merge strategies, but practice shows that they introduce more harm or misunderstanding at best, than actual goodness. So I decided to drop them all in favour of trivial timestamp based merge algorithm. Of course it is possible to merge transactions based on private algorithm, which can be called from fsck daemon. We have request to allow external modules to merge objects based on actual data.
This version disables content synchronization during node joining. Instead admin has to call fsck application with externally stored log of the uploaded data to check whether things are ok and fixup what was broken. It will be automated and no external log will be required in the next versions.
Fsck application log file should look like this:
3 0,0,0 sha1,md5 object_name
where '3' is object creation flags - without transactions, just like those created by FSCK frontend. Will be removed in the next version.
'0,0,0' is a placeholder for object parsing information meaning start,end,update_existing. Start and end are positions of the starting and ending symbol in the object_name used to generate ID. Zeroes mean automatic detection. Update_existing is not currently supported, in the next version if set will upload local file named object_name into the storage no matter if its copies are already present.
sha1,md5 - transformation functions used to generate ID from object_name. This setup uses two copies - each one created by appropriate hash.
object_name - name of the uploaded object. Its hash (or actually transformation of the name using presented functions, it is allowed to be some other function than plain hash) will be object ID.
Stay tuned, work is boiling and results are very close!
February 14, 2010 05:38 PM
February 13, 2010
Jonathan Schwartz’s resignation via Twitter reminded me of a strange facet of Sun company culture: I’ve never known so many married couples working for the same company. Some them even worked on the same project together. For the same boss. From home.
Now, the exact percentage of married couples in a company can’t be used to compare companies directly – after all, it depends heavily on things like industry, age, and local marriage laws – but it seems linked to another facet of Sun company culture: Complete, almost embarrassing disconnect from public opinion.
The post-Google standard company perks – free food, on-site exercise classes, company shuttles – make it trivial to speak only to fellow employees in daily life. If you spend all day with your co-workers, socialize only with your co-workers, and then come home and eat dinner with – you guessed it – your co-worker, you might go several years without hearing the words, “Run Solaris on my desktop? Are you f—ing kidding me?“
Schwartz’s “the financial crisis did it” explanation for Sun’s demise is a symptom of an inbred company culture in which employees at all levels voluntarily isolated themselves from the larger Silicon Valley culture. Tech journalists write incessantly about the exchange of expertise and best practice between companies as a major driver of the Bay area’s success. But you have to actually talk to your competition to do that – over a beer, or maybe a pillow.
February 13, 2010 03:22 AM
After six weeks of full-time hacking, with the help of a few friends, we have
made it to receiving actual BCCH data from a GSM cell.
So what does this mean? As I have indicated publicly at the 26C3 conference:
Now, that we have managed to create a working GSM network-side implementation
(OpenBSC) during the last year, we will proceed to do the same with the phone side.
Initially we spent quite a bit of thinking on building our own custom hardware.
But while planning for the first prototype, we realized that it would simply
distract us too much from what we actually wanted to do. We don't want to take
care of component sourcing, prototype generations, quality assurance in
production, production testing, etc. -- All we want is to write a Free Software
GSM protocol implementation for a phone.
Unfortunately (as usually in the industry), the silicon and device makers do
not publish sufficient documentation about their devices to enable third-party
developers to go ahead and write their own software: The never ending
problem of Free Software in many areas beyond more-or-less standardized
hardware like in the PC industry.
So, if you want to write Free Software for such a device, you have two options:
- Reverse engineering the existing hardware and writing your code based on
that information
- Building your own hardware and then writing the software you wanted
to write.
I've been involved in both approaches multiple times while looking only at the
application processor (the PDA side) of mobile phones: OpenEZX and gnufiish are
two more or less abandoned projects aimed at reverse engineering. Openmoko was
the project that had to build its own hardware as a dependency to be fulfilled
before writing software.
If you're not a company and don't want to sell anything, the reverse
engineering approach looks more promising. You can piggy-back on existing
hardware, don't need to take care of sourcing/production/certification/shipping
and other tedious bits.
If you are a company and want to generate revenue, then of course you want
to build the hardware and ship it, as it is what you derive your profits
from.
So, just to be clear on this: Neither OpenEZX, nor gnufiish nor Openmoko were
ever about writing Free Software for the GSM baseband processor, i.e. the beast
that exchanges messages with the actual GSM operator network. But this is what
we're working on right now.
It's about time, don't you agree? after 19 years of only proprietary software
on the baseband chips in billions of phones, it is more than time for bringing
the shining light of Freedom into this area of computing.
To me personally, it is the holy grail of Free Software: Driving it beyond the
PC, beyond operating systems and application programs. Driving it into the
billions of embedded devices where everyone is stuck with proprietary software
without an alternative. Everybody takes it for granted to run megabytes of
proprietary object code, without any memory protection, attached to an
insecure public network (GSM). Who would do that with his PC on the Internet,
without a packet filter, application level gateways and a constant flow
of security updates of the software? Yet billions of people do that with
their phones all the time.
I hope with our work there will be a time where the people who paid for their
phones will be able to actually own and control what it does. If I have paid
for it, I determine what software it runs and when it send which message or
doesn't.
Oh, getting back to what our work: It will be published as soon as it is
sufficiently stable and fit for public consumption. You won't be able
to make phone calls yet, but we'll get there at some later point this
year.
February 13, 2010 01:00 AM
February 12, 2010
An Eminent Reader privately indicated some distaste for the non-technical nature of recent parallel programming posts. Given that many of the obstacles to successful development of parallel software are non-technical, there will be future non-technical posts, but there is no reason not to take a technical break from these issues. And so, just for you, Eminent Reader, I present this parallel programming puzzle.
This puzzle stems from some researchers’ very selective struggles with parallel algorithms. Of course, it should be no surprise that many people, researchers and developers included, will struggle quite happily with their “baby”, but will even more happily bad-mouth competing approaches, even when (or perhaps especially when) those approaches requiring much less struggling. And yes, some might accuse me of favoring RCU in just this manner, but this is my answer to the likes of them.
Such selective struggling seems to have given rise to an interesting urban legend within the concurrency research community, namely that allowing concurrent access to both ends of a double-ended queue is difficult when using locking.
Can you come up with a lock-based solution that permits the two ends of a double-ended queue to be manipulated concurrently?
February 12, 2010 11:33 PM
libreplace is the SAMBA library (also used in ctdb) to provide working implementations of various standard(ish) functions on platforms where they are missing or flawed. It was initially created in 1996 by Andrew Tridgell based on various existing replacement hacks in utils.c (see commit 3ee9d454).
The basic format of replace.h is:
#ifndef HAVE_STRDUP
#define strdup rep_strdup
char *rep_strdup(const char *s);
#endif
If configure fails to identify the given function X, rep_X is used in its place. replace.h has some such declarations, but most have migrated to the system/ include directory which has loosely grouped functions by categories such as dir.h, select.h, time.h, etc. This works around the “which header(s) do I include” problem as well as guaranteeing specific functions.
Other than reading this code for a sense of Unix-like paleontology (and it’s so hard to tell when to remove any of these helpers that cleanups are rare) we can group replacements into three categories:
- Helper functions or definitions which are missing, eg. strdup or S_IRWXU.
- “Works for me” hacks for platform limitations, which make things compile but are not general, and
- Outright extensions, such as #define ZERO_STRUCT(x) memset((char *)&(x), 0, sizeof(x)) or Linux kernel inspired likely()
Since it’s autoconf-based, it uses the standard #ifdef instead of #if (a potential source of bugs, as I’ve mentioned before). I’ll concentrate on the insufficiently-general issues which can bite users of the library, and a few random asides.
- #ifndef HAVE_VOLATILE ? I can’t believe Samba still compiles on a compiler that doesn’t support volatile (this just defines volatile away altogether) If it did no optimizations whatsoever, volatile might not matter, but I’m suspicious…
- typedef int bool; is a fairly common workaround for lack of bool, but since pointers implicitly cast to bool but can get truncated when passed as an int, it’s a theoretical trap. ie. (bool)0×1234567800000000 == true, (int)0×1234567800000000 == 0.
- #if !defined(HAVE_VOLATILE) is the same test as above, repeated. It’s still as bad an idea as it was 186 lines before :)
- ZERO_STRUCT, ZERO_ARRAY and ARRAY_SIZE are fairly sane, but could use gcc extensions to check their args where available. I implemented this for ARRAY_SIZE in the Linux kernel and in CCAN. Making sure an arg is a struct is harder, but we could figure something…
- #define PATH_MAX 1024 assumes that systems which don’t define PATH_MAX probably have small path limits. If it’s too short though, it opens up buffer overruns. Similarly for NGROUPS_MAX and PASSWORD_LENGTH.
- The dlopen replacement is cute: it uses shl_load where available (Google says HPUX), but dlerror simply looks like so:
#ifndef HAVE_DLERROR
char *rep_dlerror(void)
{
return "dynamic loading of objects not supported on this platform";
}
#endif
This cute message for runtime failure allows your code to compile, but isn’t helpful if dlopen was a requirement. Also, this should use strerror for shl_load.
- havenone.h is (I assume) a useful header for testing all the replacements at once: it undefines all HAVE_ macros. Unfortunately it hasn’t been updated, and so it isn’t complete (unused code is buggy code).
- inet_pton is credited to Paul Vixie 1996. It’s K&R-style non-prototype, returns an int instead of bool, and doesn’t use strspn ((pch = strchr(digits, ch)) != NULL) or (better) atoi. But it checks for exactly 4 octets, numbers > 255, and carefully doesn’t write to dst unless it succeeds. I would have used sscanf(), which wouldn’t have caught too-long input like “1.2.3.4.5″. OTOH, it would catch “127…1″ which this would allow. But making input checks more strict is a bad way to be popular…
- Tridge’s opendir/readdir/telldir/seekdir/closedir replacement in repdir_getdents.c is a replacement for broken telldir/seekdir in the presence of deletions, and a workaround for (older?) BSD’s performance issues. It is in fact never used, because the configure test has had #error _donot_use_getdents_replacement_anymore in it since at least 2006 when the Samba4 changes were merged back into a common library!
- repdir_getdirents.c is the same thing, implemented in terms of getdirents rather than getdents; it’s still used if the telldir/delete/seekdir test fails.
- replace.c shows some of the schizophrenia of approaches to replacement: rep_ftruncate #errors if there’s no chsize or F_FREESP ioctl which can be used instead, but rep_initgroups returns -1/ENOSYS in the similar case. Best would be not to implement replacements if none can be implemented, so compile will fail if they’re used.
- rep_pread and rep_pwrite are classic cases of the limitations of replacement libraries like this. As pread is not supposed to effect the file offset, and file offsets are shared with children or dup’d fds. There’s no sane general way to implement this, and in fact tdb has to test this in tdb_reopen_internal. I would implement a read_seek/write_seek which are documented not to have these guarantees. I remember Tridge ranting about glibc doing the same kind of unsafe implementation of pselect :)
- snprintf only rivals qsort-with-damn-priv-pointer for pain of “if only they’d done the original function right, I wouldn’t have to reimplement the entire thing”. I’ll avoid the glibc-extracted strptime as well.
I’m not sure Samba compiles on as many platforms as it used to; Perl is probably a better place for this kind of library to have maximum obscure-platform testing. But if I were to put this in CCAN, this would make an excellent start.
February 12, 2010 08:53 AM
February 10, 2010
I'm "attending" the Red Hat Cloud Thing. The Deltacloud guy is presenting, Jeff Garzik is next with our own Hail. To get this working, I had to add thomson-webcast.net to Flash whitelist, otherwise the site said "No Scripting". It's about time somebody started a company streaming presos in Theora or something...
February 10, 2010 06:44 PM
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100207.mp3
This podcast is brought to you by the awesome power of Jason Wessel’s kgdb patches, helping to support those who believe in kernel debuggers find hard to reach kernel bugs since 2009. Kernel debuggers: the way of the future.
For the weekend of February 7th, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.
In today’s issue: Linux 2.6.33-rc7, regressions, Google Summer of Code, IMA, OOM, and sys_membarrier.
Linux 2.6.33-rc7. Linus Torvalds announced the 2.6.33-rc7 release of the Linux kernel on Saturday, February 6th, 2010 at 2:44pm (14:44) Best Coast Time (PST). In his announcement, Linus remarked, “I have to admit that I wish we had way fewer regressions listed by this time, so I hereby would like to point every developer to” a link to a recent post to the linux wireless mailing list archive on gmane.org showing a copy of a recent email from Rafael J. Wysocki detailing known kernel regressions between 2.6.32 and 2.6.33-rc6 as posted originally to the LKML. He added, “But we’ve certainly fixed a few things, and it’s been a week, so here’s -rc7″. Most of the changes are in PowerPC defconfigs (default configs), but there are even more i915 updates, radeon KMS updates, and lots of other smaller bits all over the tree. Linus also wondered (in another email) whether it was worth making the .gz files any more given that bzip2 has been around more than long enough by now. Some thought the gzip files were still useful on systems without bzip2 or for some really slow systems that apparently handle gzip files more easily.
Regressions. Rafael J. Wysocki followed up to Linus’ 2.6.33-rc7 announcement (as he had also done with 2.6.33-rc6) with a list of outstanding regressions beteen 2.6.32 and 2.6.33-rc7. There are currently 20 “unresolved” issues in the list of regressions given. Rafael also noted that Maciej Rutecki has, “generously volunteered to work on the tracking of kernel regressions”. The work done by Rafael (and now, hopefully Maciej also) is very valuable to the community and we really do owe them our gratitude for helping out. Arjan van de Ven also posted a list of oops and warning reports on kerneloops.org from the week, including a very common ext4/quota issue in Fedora.
Google Summer of Code. Luis Rodriguez stated that, “Google has confirmed it will have a Google Summer of Code for 2010″, then mentioned that last year’s effort (4 suggested projects, of which 3 were accepted) resulted in only one success. Witold Sowa followed up saying that he didn’t know he was the only student who completed his project, but that the work to add an AP mode to NetworkManager, “with use of wpa_supplicant’s newly developed AP mode” was relatively easy to accomplish and so he had worked on other things also. Apparently, the initial GSoC work is now available in NetworkManager. Nonetheless, it sounds as if Luis is keen to see a higher than 33% success rate if any entries are accepted this year under the Linux Foundation.
IMA. Mimi Zohar replied to an email from Shi Weihua concerning a NULL pointer deference bug in the IMA security code (ima_file_free), which Al Viro and others had previously discussed solutions for.
OOM. Lubos Lunak and David Rientjes resurrected the OOM killer discussion again after Lubos posted some analysis of various KDE processes running on his system, and wondered why the OOM killer uses VmSize rather than RSS to determine tasks that should be killed (in other words, why should it not favor tasks actually resident in memory at the time?). This discussion has been had recently, and David Rientjes explained that the kernel favors overall VmSize in its calculations so as to catch memory leakers as a preference (which are often not resident at the time). David did seem to like the suggestion of catching the the child with the highest badness calculation before killing its parent, and posted an untest patch. He also suggested that the KDE process tree example was “a textbook case for using /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is protected from getting selected for oom kill”. Lubos replied with some very good points about how simply setting oom_adj doesn’t scale, and Balbir Singh was amongst those still favoring a switch to RSS-like accounting but with support for shared pages (for example “PSS”) eventually. Rik van Riel noted that he had no strong opinion one way or the other. David posted various patches proposing an alternative fine grained oom_adj mechanism.
sys_membarrier. Mathieu Desnoyers posted a three part patch series implementing sys_membarrier, a new system call that can be used to “distribute the overhead of memory barriers asymmetrically”. In particular, he wants it for his urcu userspace RCU implementation (for use within the synchronize_rcu call). Sensibly, Mathieu proposes incremental additions to each architecture (even though he believes that it “should be portable to other architectures as-is”), reserving the system call numbers now, then implementing gradually.
In today’s miscellaneous items: Matti Aarnio posted to let everyone know that a recently discovered hole in the bayesian filtering system as used by the vger.kernel.org mailing list server to reduce SPAM has been plugged (it had been possible to reach the list using a specific “backend” majordomo domain), Catalin Marinas decided to simply patch the USB HCD driver that had resulted in cache coherency problems when using USB storage (and noted that a followup posting to linux-arch would call for a flush_dcache_range function), some miscallenous rewrites of obsolete syscall handlers to use generic versions from Christoph Hellwig, a request for an opinion on mergeing the kFIFO rewrite in 2.6.34 from Stefani Seibold, a potential issue with the kernel implementation of LZO compression reported by Nigel Cunningham (for which he will switch back to LZF in TuxOnIce again for the moment), Stephen Rothwell wondered aloud whether Linus would really be interested in taking the percpu changes currently sigging in percpu “next”, and Mathieu Desnoyers announced he is switching email from his academic address in Montreal (where he recently completed his PhD around LTTng) to a consulting firm he is involved with at http://efficios.com.
In today’s announcements: Greg Kroah-Hartman posted review patches for the 2.6.32.8 stable series kernel.
Scott James Remnant announced the release of upstart version 0.6.5. It includes a large number of fixes, amongst which is the completion of the splitting out of libnih into its own project. There is a new /sbin/reload command for reloading upstart daemons, a restored sync() before reboot, improved documentation, and more goodies.
Junio C Hamano announced version 1.7.0.rc2 of the Git SCM, which includes a number of forthcoming behavior changes as mentioned in this podcast when discussing the rc1 release from the previous week.
Subrata Modak announced that the Linux Test Project (LTP) for January 2010 has been released. It now contains over 3000 tests. Separately, Garrett Cooper noted a rather severe bug in the top level LTP Makefile that could result in an “rm -rf /” in the wrong circumstances, suggesting that all LTP users comment out three lines from that file.
Willy Tarreau (re-)announced the release of 2.4.37.9. The previos 2.4.37.8 hadn’t actually contained the required e1000 backport with a CVE fix that had triggered the previous release. Willy noted, “I don’t know how I managed to do that because it once was OK and I could successfully build it. Well, whatever I did, the result is wrong and the issue it was supposed to fix is still present in 2.4.37.8. So here comes 2.4.37.9 with the real fix this time”.
The latest kernel release is 2.6.33-rc7.
Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-02-03-20-09.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.
February 10, 2010 05:05 PM
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100131.mp3
This podcast is brought to you by the power of Al Viro’s ima_file_free fix, saving in-progress crashed podcast recordings since February 2010, and now powering the all new 2010 2.6.33 series Linux kernel with all wheel drive.
For January 31st, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.
In this week’s issue: Linux 2.6.33-rc6, ide2libata, kFIFO, lock types, netfilter connection tracking, netperf regressions, sparse, and USB storage.
Linux 2.6.33-rc6. Linus Torvalds announced Linux 2.6.33-rc6 on Friday January 29th 2010 at 2:20pm (14:20) Best Coast Time (PST), again describing it as containing “nothing earch-shattering”. About 50% of the changes were architecture updates, and 40% were drivers, with the remaining being mostly filesystem and networking updates. He called for people seeing regressions to begin making “loug noises”, since ‘things mostly should “just work”‘.
ide2libata. Bartlomiej Zolnierkiewicz posted a 68 part patch series entitled “ide2libata” that does roughly what it sounds like – it facilitates a conversion of sorts such that legacy IDE driver code can use a small “translation” layer to share source with the libata codebase. It doesn’t remove IDE but it does (allegedly) make it far easier to maintain both until IDE finally does go away. Alan Cox and others weren’t convinced. Alan thought that, “it will be a nightmare for maintenance with all the includes and the like plus the ifdefs making it very hard to read the drivers and maintain them”. He saw value in the effort, but more as a means to find subtle differences between drivers, and thought IDE was “drifting” a little too much to truly be described as in “maintainance mode” at this point.
kFIFO. Stefani Seibold posted an “enhanced reimplementation of the kfifo API”, which is apparently the last in the series of RFC patches intended to rework the kFIFO implementation (to be generic) without changing the existing API. Stefani included some analysis of the impact of the patch upon text section usage and found that it wasn’t much larger, but that the “hand optimized” inline code was substantially faster than the previous implementation.
lock types. Mitake Hitoshi posted an RFC patch (most for the review of Peter Zijlstra) that adds lock type information to the output of lockdep, as used by tools such as perf. As he points out, “Of course, as you told, type of lock dealing with is clear for human. But it is not clear for programs like perf lock”. On a related note, Frederic Weisbecker stated that he really liked the perf lock report layout, but would love to see a tree view that “can tell you which lock is delaying another one”. He gave varous examples of how this might be visualized as well as describing the benefits.
netfilter connection tracking. I discovered that one of my test systems was falling over on all recent 2.6 series kernels, when using KVM. I wasn’t alone (as I would find out later, looking at Fedora bug reports). The backtrace was variable, but typically involved some kind of IPv6 packet. After mailing the netfilter guys (”PROBLEM: reproducible crash KVM+nf_conntrack all recent 2.6″) and getting some general advice, I spent the entire weekend solid debugging the issue with the aid of Jason Wessel’s kgdb-next tree. The problem was that libvirt (the KVM server management daemon) would attempt to create a second network namespace (netns) on startup – just to see if it would be possible to also support containers – and autostart KVM guests started at that moment would crash because conntrack was missing various chunks of support code for dealing with multiple namespaces. This resulted in hash corruption, kmem caches that would get corrupted, and eventual panics.
netperf regressions in 2.6.33-rc1. Lin Ming performed a bisect analysis and determined that a “sched: Rate-limit newidle” commit had once again introduced a loopback regression (on the order of 50%) in the netperf benchmark, when run on an Intel Nehalem system. Lin assumed that this was due a large amount of rescheduling IPI (inter-processor interrupt) traffic, as evidenced by the perf top data, and /proc/interrupts output. Others could not reproduce this issue.
sparse. Tejun Heo posted a series of percpu patches intended to instrument modular use of percpu data, for the benefit of the sparse source checker utility recognizing that such data lives in a separate data section. Tejun included various descriptions within the individual patches, which only affect building when using the sparse checking tool.
USB mass storage. Catalin Marinas posted a message (mostly aimed at Matthew Dharm) concerning cache coherency of the kernel’s USB mass storage driver. In the case of Harvard Architecture (split I/D caches) ARM processor cores, when using PIO based USB host controllers, root mounted filesystems generating a page fault will only fault the requested page into the data cache, but the USB storage driver fails to call flush_dcache_page to ensure I-cache visibility and results in incoherency between the two. Catalin asked Matthew if he might add support for explicit flushes when doing PIO rather than DMA for IO. Oliver Neukum thought that this belonged in the HCD driver rather than USB storage, due to the wide range of possible underlying layers beneath USB storage, and Matthew Dharm agreed, “Given that an HCD can choose, on the fly, it it’s using DMA or PIO, the HCD driver is the only place to reasonably put any cache-synchronization code. That said, what do other SCSI HCDs do?”.
In today’s miscellaneous items: Chinang Ma posted a comparitive performance analysis between RHEL5.4 kernel 2.6.18 and upstream 2.6.33-rc4 in which he found a 0.8% OLTP performance regression, Simon Kagstrom send a “provoke crash” mail in which he described a module to force crashes for testing, Mark Lord wondered why he was seeing a large number of “page allocation failure” messages on upgrade from 2.6.31.5 to 2.6.32.5, a continuation of previous style discussions concerning 80 character line length “limits” in the kernel, a question from Andi Kleen as to whether the PnP probe code (for PS/2 mice in this particular instance) is racy as he experiences variable probe behavior, Christoph Lameter posted version 15 of “one of these year long projects to address fundamental issues in the Linux VM”, aka “SLAB fragmentation reduction”,Alex Chiang posted a patch to increase the maximum number of Infiniband HCAs per system from 32 to 64 in a “backwards-compatible manner” (hence only raising the limit to 64), and Al Viro posted an informative message entitled “Open Intents, lookup_instantiate_filp() And All That Shit(tm)” on his plans for handling atomic file open+possible create for NFS in the grand future.
In today’s announcements: Greg Kroah-Hartman announced the release of the 2.6.32.7 kernel (having previously announced the 2.6.32.6 earlier in the week and posting a series of review patches for 2.6.32.7). He also announced the 2.6.27.45 “long term release” kernel.
Clark Williams announced the latest version 0.63 of the rt-tests package is now available. This includes various utilities used to verify and experiment with the RT patchset that Thomas Gleixner and others maintain.
Mathieu Desnoyers announced the release of version 0.4.0 of his Userspace RCU library, which includes a few “minor API changes” as previously described. urcu is available for download at http://lttng.org/urcu.
Junio C Hamano announced version 1.7.0-rc1 of the Git SCM. The forthcoming release has a number of items in the draft release notes, including some behavior changes to “git push”, “git send-email” (no deep threads by default), “git status”, “git diff”, and various other goodies.
The latest kernel release was 2.6.33-rc6.
Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-01-28-01-36.
Willy Tarreau announced version 2.4.37.8 of the 2.4 series kernel. It mainly includes fixes for a recentl discovered vulnerability in the e1000 network driver that could allow a carefully crafted frame to skip over filtering.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.
February 10, 2010 01:15 PM
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100124.mp3
For the weekend of January 24th, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.
Linux 2.6.33-rc5. Linus Torvalds announced the release of the 2.6.33-rc5 kernel, noting that he didn’t “think there is anything earth-shaking here”. Mostly, the only new stuff was in the i915 and (new) DVB “Mantis” driver. Rafael J. Wysocki followed up with his usual list of regressions since the release of 2.6.32, for which there were no know fixes yet in Linus’ tree. The number has fallen a little, but there were still 23 unresolved.
devtmpfs. The devtmpfs filesystem is a shared memory filesystem used to mount /dev nodes that are needed even before udev starts on modern Linux systems (or for those systems that do not use udev, to provide a minimum environment). The suggestion had been made to remove the EXPERIMENTAL flag on its configuration option and enable it by default. The latter received complaints as a change in behavior that would be visible to users, even if many of them would need to have devtmpfs enabled for the most recent Linux distributions.
Interruptions. Steven Rostedt, and Peter Zijlstra did some analysis of the kernel source tree, looking for inappropriate setting of TASK_*INTERRUPTIBLE (which should never be done explicitly, and in general one should always use the set_current_state macro). They found a fairly large number of incorrect code paths and posted a list of “examples of likely bugs”. David Daney replied, asking what kind of barrier should be implied in using set_current_state, as pertains to the visibility of this assignment by other CPUs.
IO error semantics. Nick Piggin started a thread entitled “IO error semantics”, in which he raised the ugly issue of kernel IO error handling behavior once again, as he said he had done during Andi Kleen’s posting of HWPOISON patches. Nick sought to clearly define specific anticipated behaviors in response to “read IOs”, “write IOs”, and so forth – how many retries? etc. He also made the point that write IO errors should not invalidate the data before an IO error is returned to “somebody” (fsync or synchronous write syscall).
NOIO. Rafael J. Wysocki posted an initial PM patch implementing forced GFP_NOIO during suspend operations (preventing the kernel from attempting to allocate memory by going to e.g. disk to offload some existing unused pages), this was largely in reaction to specific issues with the Nvidia closed source binary driver, but was something that had apparently been on the cards for some time. The problem with the patch was that it changed the VM according to the state of the system, rather than relying upon drivers to do the right thing in using explicit GFP_NOIO allocations during suspend and resume routines.
In the week’s miscellaneous items: Tejun Heo posted version 3 of his concurrency managed workqueue patches, Peter Anvin proposed the rapid removal of CONFIG_X86_CPU_DEBUG (since all such information is already exposed elsewhere), the addition of “nopat” boot option documentation to Documentation/kernel-paramters by Jiri Kosina, ongoing discussion of generalization of certain PCI functions in the wake of and intention to merge various Xilinx PCI support bits, a cache coherency problem with mmaped writes on ARM systems posted by Anfei Zhou, a patch correcting priority inheritance deboosting in the RT kernel patchset to be POSIX compliant, Dimitry Golubovsky inquired as to the current state of UML (User Mode Linux, not the silly and pointless modelling technique) development, some Restricted Access Register (Intel MID platform) patches from Mark Allyn, and a large number of floppy (yes, floppy) cleanups from Joe Perches.
In the week’s announcements: Linux 2.6.31.12 and 2.6.32.5 (proceeded by the 2.6.32.4 kernel earlier in the week) were released by Greg Kroah-Hartman. Greg stated that he no longer intended to update the .31 stable kernel short of “something really odd happening”. Greg repeated his previous assertions that the .27 kernel would live on as a “long term” stable release (but probably only for 6 more months of viability), and that the .32 kernel would also be a “long term release” because a number of distributions were apparently basing their distributions around it. His efforts depend upon engineers working on those distributions to help.
Len Brown announced that the Linux Power Management Mini-Summit would be held in Boston on Monday, August 9th 2010, the day before the LinuxCon 2010. For further information, refer to http://events.linuxfoundation.org/.
Mathieu Desnoyers (whose excellent PhD thesis was published recently and covered by LWN) announced an updated LTTng 0.187 for the 2.6.32.4 kernel.
Junio C Hamano announced Git 1.6.6.1 is now available from the kernel.org site at http://www.kernel.org/pub/software/scm/git/. The latest version contains fixes for issues such as “git blame” not working when a commit lacked an author name, “git count-objects” not handling packfiles larger than 4G on platforms with a 32-bit off_t, “git rebase -i” not aborting cleaning if it failed to start the user’s EDITOR, some issues with
the GIT_WORK_TREE environment variable, and more besides.
Thomas Gleixner announced the release of 2.6.31.12-rt20 RT patchset. This was a forward port to 2.6.31.12, which included a number of RCU assumption fixes, the aforementioned PI POSIX compliance fix, and so forth. Thomas noted the delay in releasing a new version of the patch, but noted that various locking infrastructure changes had gone upstream (advancing the cause of mainlining various bits of RT). There will be no 2.6.32-rt, but will skip directly over to 2.6.33. He also let us know about a new “housemate” of his: http://tglx.de/~tglx/housemate.png.
Sorry for the delay in getting this episode released.
February 10, 2010 08:36 AM
February 09, 2010
An earlier post noted that parallel programming suffers more potential failures in planning than does sequential programming due to the usual suspects: deadlocks, memory misordering, race conditions, and performance/scalability issues. This should lead us to suspect that parallel programs might need better quality assurance (Q/A) than do sequential programs. Q/A activities include validation, verification, inspection, review, and of course testing.
Traditionally, Q/A groups serve many roles:
- Run tests and find bugs.
- Break in new hires, who, strangely enough, are sometimes reluctant to irritate developers.
- Distract developers who are already behind schedule with pesky bugs.
- Act as scapegoat for schedule slips.
- Act as a target of complaints from developers who are tired of debugging either their new features or any bugs located by the Q/A group.
Although there are many highly effective Q/A groups in many software development organizations, it is not hard to find Q/A groups that find bugs, but that either cannot or will not get developers to pay attention to them. It is also not hard to find Q/A groups that are overridden whenever they point out problems that might cause a schedule slip. One way to avoid these problems is via enlightened management based on (for example) bug trends over time, and another way is for the Q/A organization to report high up into the organization. Of course, with this latter approach, one wonders just how often the Q/A organization can get away with yanking on the silver chain connecting to their executive sponsor.
Of course, FOSS communities have their own Q/A challenges, but the fact that the maintainers are usually responsible for the quality of their code adds a breath of fresh air to the process. Not least, their gatekeeper role enables them to vigorously enforce any design and coding guidelines that their FOSS community might have.
But what are the technical effects of parallel software on Q/A?
February 09, 2010 07:44 PM
Content copyright by their respective authors.