Kernel Planet

April 30, 2017

Matthew Garrett: Looking at the Netgear Arlo home IP camera

Another in the series of looking at the security of IoT type objects. This time I've gone for the Arlo network connected cameras produced by Netgear, specifically the stock Arlo base system with a single camera. The base station is based on a Broadcom 5358 SoC with an 802.11n radio, along with a single Broadcom gigabit ethernet interface. Other than it only having a single ethernet port, this looks pretty much like a standard Netgear router. There's a convenient unpopulated header on the board that turns out to be a serial console, so getting a shell is only a few minutes work.

Normal setup is straight forward. You plug the base station into a router, wait for all the lights to come on and then you visit and follow the setup instructions - by this point the base station has connected to Netgear's cloud service and you're just associating it to your account. Security here is straightforward: you need to be coming from the same IP address as the Arlo. For most home users with NAT this works fine. I sat frustrated as it repeatedly failed to find any devices, before finally moving everything behind a backup router (my main network isn't NATted) for initial setup. Once you and the Arlo are on the same IP address, the site shows you the base station's serial number for confirmation and then you attach it to your account. Next step is adding cameras. Each base station is broadcasting an 802.11 network on the 2.4GHz spectrum. You connect a camera by pressing the sync button on the base station and then the sync button on the camera. The camera associates with the base station via WDS and now you're up and running.

This is the point where I get bored and stop following instructions, but if you're using a desktop browser (rather than using the mobile app) you appear to need Flash in order to actually see any of the camera footage. Bleah.

But back to the device itself. The first thing I traced was the initial device association. What I found was that once the device is associated with an account, it can't be attached to another account. This is good - I can't simply request that devices be rebound to my account from someone else's. Further, while the serial number is displayed to the user to disambiguate between devices, it doesn't seem to be what's used internally. Tracing the logon traffic from the base station shows it sending a long random device ID along with an authentication token. If you perform a factory reset, these values are regenerated. The device to account mapping seems to be based on this random device ID, which means that once the device is reset and bound to another account there's no way for the initial account owner to regain access (other than resetting it again and binding it back to their account). This is far better than many devices I've looked at.

Performing a factory reset also changes the WPA PSK for the camera network. Newsky Security discovered that doing so originally reset it to 12345678, which is, uh, suboptimal? That's been fixed in newer firmware, along with their discovery that the original random password choice was not terribly random.

All communication from the base station to the cloud seems to be over SSL, and everything validates certificates properly. This also seems to be true for client communication with the cloud service - camera footage is streamed back over port 443 as well.

Most of the functionality of the base station is provided by two daemons, xagent and vzdaemon. xagent appears to be responsible for registering the device with the cloud service, while vzdaemon handles the camera side of things (including motion detection). All of this is running as root, so in the event of any kind of vulnerability the entire platform is owned. For such a single purpose device this isn't really a big deal (the only sensitive data it has is the camera feed - if someone has access to that then root doesn't really buy them anything else). They're statically linked and stripped so I couldn't be bothered spending any significant amount of time digging into them. In any case, they don't expose any remotely accessible ports and only connect to services with verified SSL certificates. They're probably not a big risk.

Other than the dependence on Flash, there's nothing immediately concerning here. What is a little worrying is a family of daemons running on the device and listening to various high numbered UDP ports. These appear to be provided by Broadcom and a standard part of all their router platforms - they're intended for handling various bits of wireless authentication. It's not clear why they're listening on rather than, and it's not obvious whether they're vulnerable (they mostly appear to receive packets from the driver itself, process them and then stick packets back into the kernel so who knows what's actually going on), but since you can't set one of these devices up in the first place without it being behind a NAT gateway it's unlikely to be of real concern to most users. On the other hand, the same daemons seem to be present on several Broadcom-based router platforms where they may end up being visible to the outside world. That's probably investigation for another day, though.

Overall: pretty solid, frustrating to set up if your network doesn't match their expectations, wouldn't have grave concerns over having it on an appropriately firewalled network.

comment count unavailable comments

April 30, 2017 05:09 AM

April 27, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/27


In this week’s edition: Linux 4.11-rc8, updating cross compilers, Intel 5-level paging, v3 namespaced file capabilities, and ongoing development.

Editorial Notes

Apologies for the delay to this week’s podcast. I got flu around the time I was preparing last week’s podcast, limped along to the weekend, and then had to stay in bed for a long time. On the other hand, it let me play with a bunch of new SDRs [HackRF, RTL-SDR, and friends, for the curious) on Sunday when I skipped the 5K I was supposed to run 🙂

I would also like to note my thanks for the first 10,000 downloads of the new series of this podcast. It’s a work in progress. I am going to make (positive!) changes over the coming months, including a web interface that will track all LKML posts and allow for community-directed collaboration on creating this (and hopefully other) podcasts. I will include automatic patch tracking (showing when patches have landed in upstream trees, and so on), info on post authors, and allow you to edit personal bios, links, etc. And employer info. After some discussions around the best way to handle author employer attribution (to make sure everyone is treated fairly), I’ve decide to take a little time away from including employer names until I have a populated database of mappings. Jon Corbet from LWN has something similar already, which I believe is also stored in git, but there’s more to be done here (thanks to Alex and others for the G+ feedback and discussion on this).

Linux 4.11-rc8

Linus Torvalds announced Linux 4.11-rc8, saying “So originally I was just planning on releasing the final 4.11 today, but while we didn’t have a *lot* of changes the last week, we had a couple of really annoying ones, so I’m doing another rc release instead”. As he also notes, “The most noticeable of the issues is that we’ve quirked off some NVMe power management that apparently causes problems on some machines. It’s not entirely clear what caused the issue (it wasn’t just limited to some NVMe hardware, but also particular platforms), but let’s test it”.

With the release of Linux 4.11-rc8 comes that impending moment of both elation and dread that is a final kernel. It’ll be great to see 4.11 out there. It’s an awesome kernel, with lots of new features, and it will be well summarized in kernelnewbies and elsewhere. But upon its release comes the opening of the merge window for 4.12. Tracking that was exciting for 4.11. Hopefully it doesn’t finish me off trying to do that for 4.12 😉

Geert Utterhoeven posted “Build regressions/improvements in v4.11-rc8”, in which he noted that (compared with v.4.10), an addition build error and several hundred more warnings were recently added to the kernel. The error he points to is in the AVR32 architecture when applying a relocation in the linker, probably due to an unsupported offset.


Greg K-H (Kroah-Hartman) announced Linux 4.4.64, 4.9.25, and 4.10.13

Junio C Hamano announced Git v2.13.0-rc1

Alex Williams posted “Generic DMA-capable streaming device driver looking for home” in which he describes some generic features of his device (the ability to “carry generic data to/from userspace”) and inquired as to where it should live in the kernel. It could do with some followup.

Updating cross compilers

Andre Przywara inquired as to the state of the cross compilers. This was a project, initiated by Tony Breeds and located on, to maintain current Intel x86 Architecture builds of cross compiler toolchains for various architecture targets (a cross compiler is one that runs on one architecture, targeting another, which is incidentally different from a “Canadian cross” compiler – look it up if you’re ever bored or want to bootstrap compilers for fun). It was a great project, but like so many others one day (three years ago) there were no more updates. That is something Andre would like to see changed. He posted, noting that many people still use the compilers on (including yours truly, in a pinch) and that “The latest compiler I find there is 4.9.0, which celebrated its third birthday at the weekend, also has been superseded by 4.9.4 meanwhile”.

Andre used build scripts from Segher Bossenkool to build binutils (the GNU assembler) 2.28 and GCC (the GNU Compiler Collection) 6.3.0. With some tweaks, he was able to build for “all architectures except arc, m68k, tilegx and tilepro”. He wondered “what the process is to get these [the compilers linked from the kernel website] updated?”. It seems like he is keen to clean this up, which is to be commended and encouraged. And hopefully (since he works for ARM) that will eventually also include cross compiler targets for x86 that run on ARMv8 server systems.

Intel 5-level paging

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he provides an “updated version the fourth and the last bunch of [] patches that brings initial 5-level paging enabling.” This is in support of Intel’s “la57” feature of future microprocessors that allows them to exceed the traditional 48-bit “Canonical Addressing” in order to address up to 56-bits of Virtual Address space (a big benefit to those who want to map large non-volatile storage devices and accelerators into virtual memory). His latest patch series includes a fix for a “KASLR [Kernel Address Space Layout Randomization”] bug due to rewriting [] startup_64() in C”.

Separately, John Paul Adrian Glaubitz inquired about Kirill’s patch series, saying, “I recently read the LWN article on your and your colleagues work to add five-level page table support for x86 to the Linux kernel. Since this extends the address space beyond 48-bits, as you know, it will cause potential headaches with Javascript engines which use tagged pointers. On SPARC, the virtual address space already extends to 52 bits and we are running into these very issues with Javascript engines on SPARC”.

He goes on to discuss passing the “hint” parameter to mmap() “in order to tell the kernel not to allocate memory beyond the 48 bits address space. Unfortunately, on Linux this will only work when the area pointed to by “hint” is unallocated which means one cannot simply use a hardcoded “hint” to mitigate this problem”. What he means here is that the mmap call to map a virtual memory area into a userspace process allows an application to specify where it would like that mapping to occur, but Linux isn’t required to respect this. Contemporary Linux implements “MAP_FIXED” as an option to mmap, which will either map a region where requested or explicitly fail (as Andy Lutomirski pointed out). This is different from a legacy behavior where Linux used to take a hint and might just not respect placement (as Andi Kleen alluded to in followup).

This whole discussion is actually the reason that Kirill had (thoughtfully) already included a feature bit setting in his patches that allows an application to effectively override the existing kernel logic and always allocate below 48 bits (preserving as close to existing behavior as possible on a per application basis while allowing a larger VA elsewhere). The thread resulted in this being pointed out, but it’s a timely reminder of the problems faced as the pressure continues upon architectures to grow their VA (Virtual Address) space size.

Often, efforts at growing virtual memory address spaces run up against uses of the higher order bits that were never sanctioned but are in widespread use. Many people strongly dislike pointer tagging of this kind (your author included), but it is not going away. It is great that Kirill’s patches have a form of solution that can be used for the time being by applications that want to retain a smaller address space, but that’s framed in the context of legacy support, not to enable runtimes to continue to use high order bits forevermore.

Introduce v3 namespaced file capabilities

Serge E. Hallyn posted “Introduce v3 namespaced file capabilities”. Linux includes a comprehensive capability mechanism that allows applications to limit what privileged operations may be performed by them. In the “good old days” when Unix hacker beards were more likely than today’s scruffy look, root was root and nobody really cared about remote compromise because they were still fighting having to have login passwords at all. But in today’s wonderful world of awesome, in which anything not bolted down is often not long for this world, “root” can mean very little. The traditionally privileged users can be extremely restricted by security policy frameworks, such as SELinux, but even more fundamentally can be subject to restrictions imposed by the growth in use of “capabilities”.

A classic example of a capability is CAP_NET_RAW, which the “ping” utility needs in order to create a raw socket. Traditionally, such utilities were created on Unix and Linux filesystems as “setuid root”, which means that they had the “s” bit set in their permissions to “run as root” when they were executed by regular users. This allowed the utility to operate, but it also allowed any user who could trick the utility into providing a shell conveniently gain a root login. Many security exploits over the years later and we have filesystem capabilities which allow binaries to exist on disk, tagged with just those extra capabilities they require to get the job done, through the filesystem “xattr” extended attributes. “ping” has CAP_NET_RAW, so it can create raw sockets, but it doesn’t need to run as root, so it isn’t market as “setuid root” on modern distros.

Fast forward still further into the modern era of containers and namespaces, and things get more complex. As Serge notes in his patch, “Root in a non-initial user ns [namespace] cannot be trusted to write a traditional security.capability xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the xattr, and execute the file with privilege on the host”. However, as he also notes, “supporting file capabilities in a user namespace is very desirable. Not doing so means that and programs designed to run with limited privilege must continue to support other methods of gaining and dropping privilege. For instance a program installer must detect whether file capabilities can be assigned, and assign them if so but set setuid-root otherwise. The program in turn must known how to drop partial capabilities [which is a mess to get right], and do so only if setuid-root”. This is, of course, far from desirable.

In the patch series, Serge “builds a vfs_ns_cap_data struct by appending a uid_t [user ID] rootid to struct vfs_cap_data. This is the absolute uid_d (that is, the uid_t in user namespace which mounted the filesystem, usually init_user_ns [the global default]) of the root id in whosr namespace the file capabilities may take effect”. He then rewrites xattrs within the namespace for unprivileged “root” users with the appropriate notion of capabilities for that environment (in a “v3” xattr that is transparently converted to/from the conventional “v2” security.capability xattr), in accordance with capabilities that have been granted to the namespace from outside by a CAP_SETFCAP. This allows capability use without undermining host system security and seems like a nice solution.

Ongoing Development

Ashish Kalra posted “Fix BSS corruption/overwrite issue in early x86 kernel setup”. The BSS (Block Started by Symbol) is the longstanding name used to refer to statically allocated (and pre-zeroed) variables that have memory set aside at compile time. It’s a common feature of almost every ELF (Executable and Linking Format) Linux binary you will come across, the kernel not being much different. Linux also uses stacks for small runtime allocations by having a page (or several) of memory that contains a pointer which descends (it’s actually called a “fully descending” type of stack) in address as more (small) items are allocated within it. At boot time, the kernel typically expects the bootloader will have setup a stack that can be used for very early code, but Linux is willing to handle its own setup if the bootloader isn’t sophisticated enough to handle this. The latter code isn’t well exercised and it turns out doesn’t reserve quite enough space, which causes the stack to descend (run into) the BSS segment, resulting in corruption. Ashish fixes this by increasing the fallback stack allocation size from 512 to 1024 bytes in arch/x86/boot/boot.h.

Vladimir Murzin posted “ARM: Fix dma_alloc_coherent()” and friends for NOMMU”, noting “It seem that addition of cache support for M-class CPUs uncovered [a] latent bug in DMA usage. NOMMU memory model has been treated as being always consistent; however, for R/M [Real Time and Microcontroller] classes [of ARM cores] memory can be covered by MPU [Memory Protection Unit] which in turn might configure RAM as Normal i.e. bufferable and cacheable. It breaks dma_alloc_coherent() and friends, since data can stuck in caches”.

Andrew Pinski posted “arm64/vdso: Rewrite gettimeofday into C”, which improves performance by up to 32% when compared to the existing in-kernel implementation on a Cavium ThunderX system (because there are division operations that the compiler can optimize). On their next generation, it apparently improves performance by 18% while also benefitting other ARM platforms that were tested. This is a significant improvement since that function is often called by userspace applications many times per second.

Baoquan He posted “x86/KASLR: Use old ident map page table if physical randomization failed”. Dave Young discovered a problem with the physical memory map setup of kexec/kdump kernels when KASLR (Kernel Address Space Layout Randomization) is enabled. KASLR does what it says on the tin. It applies a level of randomization to the placement of (most) physical pages of the kernel such that it is harder for an attacker to guess where in memory the kernel is located. This reduces the ability for “off the shelf” buffer overflow/ROP/similar attacks to leverage known kernel layout. But when the kernel kexec’s into a kdump kernel upon a crash, it’s loading a second kernel while attempting to leave physical memory not allocated to the crash kernel alone (so that it can be dumped). This can lead to KASLR allocation failures in the crash kernel, which (until this patch) would result in the crash kernel not correctly setting up an identity mapping for the original (older) kernel, resulting in immediately resetting the machine. With the patch, the crash kernel will fallback to the original kernel’s identity mapping page tables when KASLR setup fails.

On a separate, but related, note, Xunlei Pang posted “x86_64/kexec: Use PUD level 1GB page for identity mapping if available” which seeks to change how the kexec identity mapping is established, favoring a new top-level 1GB PUD (Page Upper Directory) allocation for the identity mappings needed prior to booting into the new kernel. This can save considerable memory (128MB “On one 32TB machine”…) vs using the current approach of many 2MB PTEs (Page Table Entries) for the region. Rather than many PTEs, an effective huge page can be mapped. PTEs are grouped into “directories” in memory that the microprocessor’s walker engines can navigate when handling a “page fault” (the process of loading the TLB – Translation Lookaside Buffer – and microTLB caches). Middle Directories are collections of PTEs, and these are then grouped into even larger collections at upper levels, depending upon nesting depth. For more about how paging works, see Mel Gorman’s “Linux Memory Management”, a classic text that is still very much relevant for the fundamentals.

Janakarajan Natarajan posted “Prevent timer value 0 for MWAITX” which limits the kernel from providing a value of zero to the privileged x86 “MWAITX” instruction. MWAIT (Memory Wait) is a series of instructions on contemporary x86 systems that allows the kernel to temporarily block execution (in place of a spinloop, or other solution) until a memory location has been updated. Then, various trickery at the micro-architectural level (a dedicated engine in the core that snoops for updates to that memory address) will handle resuming execution later. This is intended for use in waiting relatively small amounts of time in an energy efficient and high performance (low wakeup time) manner. The instruction accepts a timeout period after which a wakeup will happen regardless, but it can also accept a zero parameter. Zero is supposed to mean “never timeout” (i.e. always wait for the memory update). It turns out that existing Linux kernels do use zero on some occasions, incorrectly, and that this isn’t noticed on older microprocessors due to other events eventually triggering a wakeup regardless. On the new AMD Zen core, which behaves correctly, MWAITX may never wake up with a zero parameter, and this was causing NMI soft lockup warnings. The patch corrects Linux to do the right thing, removing the zero option.

Paul E. McKenney posted “Make SRCU be built by default”. SRCU (Sleepable) RCU (Read Copy Update) is an optional feature of the Linux kernel that provides an implementation of RCU which can sleep. Conventionally, RCU had spinlock semantics (it could not sleep). By definition, its purpose was to provide a cunning lockless update mechanism for data structures, relying upon the passage of a “grace period” defined by every processor having gone into the scheduler once (a gross simplification of RCU). But under some circumstances (for example, in a Real Time kernel) there is a need for a sleepable (and pre-emptable, but that’s another issue) RCU. And so SRCU was created more than 8 years ago. It has a companion in “Tiny SRCU” for embedded systems. A “surprisingly common case” exists now where parts of the kernel are including srcu.h so Paul’s patch builds it by default.

Laurent Dufour posted “BUG raised when onlining HWPoisoned page” in which he noted that the (being onlined) page “has already the mem_cgroup field set” (this is shown in the stack trace he posts with “page dumped because: page still charged to cgroup”). He cleans this up by clearing the mem_cgroup when a page is poisoned. His second patch skips poisoned pages altogether when performing a memory block onlining operation.

Laurent also posted an RFC (Request For Comment) patch series entitled “Replace mmap_sem by a range lock” which “implements the first step of the attempt to replace the mmap_sem by a range lock”. We will summarize this patch series in more detail the next time it is posted upstream.

Christian König posted version 4 of his “Resizable PCI BAR support” patches. PCI (and its derivatives, such as PCI Express) use BARs (Base Address Registers) to convey regions of the host physical memory map that the device will use to map in its memory. BARs themselves are just registers, but the memory they refer to must be linearly placed into the physical map (or interim IOVA map in the case that the BAR is within a virtual machine). Fitting large, multi GB windows can be a challenge, sometimes resulting in failure, but many devices can also manage with smaller memory windows. Christian’s patches attempt to provide for the best of both by adding support for a contemporary feature of PCI (Express) that allows devices with such an ability to convey a minimal BAR size and then increase the allocation if that is available. His changes since version 3 include “Fail if any BAR is still in use…”.

Ying Huang posted version 10 of his “THP swap: Delay splitting THP during swapping out” which allows for swapping of Transparent Huge Pages directly. We have previously covered iterations of this patch series. The latest changes are minimal, suggesting this is close to being merged.

Jérôme Glisse posted version 21 of his “Heterogeneous Memory Management” (HMM) patch series. This is very similar to the version we covered last week. As a reminder, HMM provides an API through which the kernel can manage devices that want to share memory with a host processing environment in a more seamless fashion, using shared address spaces and regular pointers. His latest version changes the concept of “device unaddressable” memory to “device private” (MEMORY_DEVICE_PRIVATE vs MEMORY_DEVICE_PUBLIC) memory, following the feedback from Dan Nellans that devices are changing over time such that “memory may not remain CPU-unaddressable in the future” and that, even though this would likely result in subsequent changes to HMM, it was worthwhile starting out with nomenclature correctly referring to memory that is considered private to a device and will not be managed by HMM.

Intel’s test Robot noticed a 12.8% performance improvement in one of their scalability benchmarks when running with a recent linux-next tree containing Al Viro’s “amd64: get rid of zeroing” patch. This is patch of his larger “uccess unification” patch series that aims to simply and cleanup the process of copying data to/from kernel and userspace. In particular, when asking the kernel to copy data from one userspace virtual address to another, there is no need to apply the level of data zeroing that typically applies to buffers the kernel copies (for security purposes – preventing leakage of extra data beyond structures returned from kernel calls, as an example). When both source and destination are already in userspace, there is no security issue, but there was a performance degregation that Viro had noticed and fixed.

Julien Grall posted “Xen: Implement EFI reset_system callback”, which provides a means to correctly reboot and power off Dom0 host Xen Hypervisors when running on EFI systems for which reset_system is used by reference (ARM).


April 27, 2017 03:50 PM

April 26, 2017

Michael Kerrisk (manpages): Linux Security and Isolation APIs course in Munich (17-19 July 2017)

I've scheduled the first public instance of my "Linux Security and Isolation APIs" course to take place in Munich, Germany on 17-19 July 2017. (I've already run the course a few times very successfully in non-public settings.) This three-day course provides a deep understanding of the low-level Linux features (set-UID/set-GID programs, capabilities, namespaces, cgroups, and seccomp) used to build container, virtualization, and sandboxing technologies. The course format is a mixture of theory and practical.

The course is aimed at designers and programmers building privileged applications, container applications, and sandboxing applications. Systems administrators who are managing such applications are also likely to find the course of benefit.

You can find out more about the course (such as expected background and course pricing) at
and see a detailed course outline at

April 26, 2017 07:38 PM

April 23, 2017

Pete Zaitcev: SDSC Petabyte scale Swift cluster

It was almost two years since last Swift numbers, but here are a few numbers from San Diego (whole presentation is available on Github:

> 5+ PB data
> 42 servers and 1000+ disks
> 3, 4, 6 TB SAS drives for objects, SATA SSD drives for Account/Container
> 10 GbE network

The 5 PB size is about quarter scale of the largest known Swift cluster. V.impressive. The 100 PB installation that RAX run consists of 6 federated clusters. Number of objects and request rate are unknown. 1000/42 comes comes to about 25-30 disks per server, but they mention 45-disk JBODs later, with plans to move to 90-disk JBODs. Nodes of large clusters continue getting fatter.

The cluster is in operation since 2011 (started with the Diablo release). They still use Pound for load-balancing.

April 23, 2017 03:04 AM

April 20, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/19


[ Apologies for the delay – I have been a little sick for the past day or so and was out on Monday volunteering at the Boston Marathon, so my evenings have been in scarse supply to get this week’s issue completed ]

In this week’s edition: Linus Torvalds announces Linux 4.11-rc7, a kernel security update bonanza, the end of Kconfig maintenance, automatic NUMA balancing, movable memory, a bug in synchronize_rcu_tasks, and ongoing development. The Linux 4.12 merge window should open before next week.

Linus Torvalds announced Linux 4.11-rc7, noting that “You all know the drill by now. We’re in the late rc phase, and this may be the last rc if nothing surprising happens”. He also pointed out how things had been calm, and then, “as usual Friday happened”, leading to a number of reverts for “things that didn’t work out and aren’t worth trying to fix at this point”. In anticipation of the imminent opening of the 4.12 merge window (period of time during which disruptive changes are allowed) Linux Weekly News posted their usual excellent summary of the 4.11 development cycle. If you want to support quality Linux journalism, you should subscribe to LWN today.

Ted (Theodore) Ts’o posted “[REGRESSION] 4.11-rc: systemd doesn’t see most devices” in which he noted that “[t]here is a frustrating regression in 4.11 that I’ve been trying to track down. The symptoms are that a large number of systemd devices don’t show up.” (which was affecting the encrypted device mapper target backing his filesystem). He had a back and forth with Greg K-H (Kroah Hartman) about it with Greg suggesting Ted watch with udevadm and Ted pointing out that this happens at boot and is hard to trace. Ted’s final comment was interesting: “I’d do more debugging, but there’s a lot of magic these days in the kernel to udev/systemd communications that I’m quite ignorant about. Is this a good place I can learn more about how this all works, other than diving into the udev and systemd sources?”. Indeed. In somewhat interesting timing, Enric Balletbo i Serra later posted a 5 part patch series entitled “dm: boot a mapped device without an initramfs”.

Rafael J. Wysocki posted some late breaking 4.11-rc7 fixes for ACPI, including one patch reverting a “recent ACPICA commit [to the ACPI – Advanced Configuration and Power Interface – Component Architecture aka reference code upon which the kernel’s runtime interpretor is based] targeted at catching firmware bugs” that did do so, but also caused “functional problems”.


Jiri Slaby announced Linux 3.12.73.

Greg KH (Kroah-Hartman) announced Linux 3.18.49, 3.19.49 4.4.62, 4.9.23, and 4.10.11. As he noted in his review posting prior to announcing the latest 3.18 kernel, 3.18 was indeed “dead and forgotten and left to rot on the side of the road” but “unfortunately, there’s a few million or so devices out there in the wild that still rely on this kernel”. Important security fixes are included in all of these updates. Greg doesn’t commit to bring 3.18 out of retirement for very long, but he does note that Google is assisting a little for the moment to make sure 3.18 based devices get some updates.

Steven Rostedt announced “Real Time” (preempt-rt) kernels 3.2.88-rt126 (“just an update to the new stable 3.2.88 version”), 3.12.72-rt97, and 4.4.60-rt73. Separately, Paul E. McKenney noted “A Hannes Weisbach of TU Dresden published this master thesis on quasi-real-time scheduling:

Rafael J. Wysocki announced a CFP (Call For Papers) targeting the upcoming LPC (Linux Plumbers Conference) Power Management and Energy-Awareness microconference “Call for topics”. Registration for LPC just opened.

Yann E. MORIN posted “MAINTAINERS: relinquish kconfig” in which he apologized for not having enough time to maintain Kconfig with “I’ve been almost entirely absent, which totally sucks, and there is no excuse for my behavior and for not having relinquished this earlier”. With such harsh friends as yourself, who needs enemies? Joking aside, this is sad news, since Kconfig is the core infrastructure used to configure the kernel. It wasn’t long before someone else (Randy Dunlap) posted a patch for Kconfig that no longer has a maintainer (Randy’s patch implements a sort method for config options)

[as an aside, as usual, I have pinged folks who might be looking for an opportunity to encourage them to consider stepping up to take this on].

Automatic NUMA balancing, movable memory, and more!

Mel Gorman posted “mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa”. Modern Linux kernels include a feature known as automatic numa balancing which relies upon marking regions of virtual memory as inaccessible via their page table entries (PTEs) and set a special prot_numa protection hinting bit. The idea is that a later “NUMA hinting fault” on access to the page will allow the Operating System to determine whether it should migrate the page to another NUMA node. Pages are simply small granular units of system memory that are managed by the kernel in setting up translations from virtual to physical memory. When an access to a virtual address occurs, hardware (or, on some architectures, special software) “walkers” navigate the “page tables” pointed to by a special system register. The walker will traverse various “directories” formed from collections of pages in a hierarchical fashion intended to require less space to store page tables than if entries were required for every possible virtual address in a 32 or 64-bit space.

Contemporary microprocessors also support multiple page (granule) sizes, with a fundamental size (commonly 4K or 64K) being supplemented by the ability for larger pages (aka “hugepages”) to be used for very large regions of contiguous virtual memory at less overhead. Common sizes of huge pages are 2MB, 4MB, 512M, and even multi-GB, with “contiguous hint bits” on some modern architectures allowing for even greater flexibility in the footprint of page table and TLB (Translation Lookaside Buffer) entries by only requiring physical entries for a fraction of a contiguous region. On Intel x86 Architecture, huge pages are implemented using the Page Size Extensions (PSE), which allows for a PMD (Page Middle Directory) to be replaced by an entry that effectively allocates the entire range to a single page entry. When a hardware walker sees this, a single TLB entry can be used for an entire range of a few MB instead of many 4K entries.

A bug known as a “race condition” exist(ed) in the automatic NUMA hinting code in which change_pmd_range would perform a number of checks without a lock being held to protect against a concurrent race againt a parallel protection updated (which does happen under a lock) that would clear the PMD and fill it with a prot_numa entry. Mel adds a new pmd_none_or_trans_huge_or_clear_bad function that correctly handles this rare corner case sequence, and documents it (in mm/mprotect.c). Michal Hocko responded with “you will probably win the_longer_function_name_contest but I do not have [a] much better suggestion”.

Speaking of Michal Hocko, he posted version 2 of a patch series entitled “mm: make movable onlining suck less” in which he described the current status quo of “Movable onlining” as “a real hack with many downsides”. Linux divides memory into regions describing zones with names like ZONE_NORMAL (for regular system memory) and ZONE_MOVABLE (for memory the contents of which is entirely pages that don’t contain unmovable system data, firmware data, or for other reasons cannot be trivially moved/offlined/etc.).

The existing implementation has a number of constraints around which pages can be onlined. In particular, around the relative placement of the memory being onlined vs the ZONE_NORMAL memory. This, Michal described as “mainly reintroduction of lowmem/highmem issues we used to have on 32b systems – but it is the only way to make the memory hotremove more reliable which is something that people are asking for”. His patch series aims to make “the onlining semantic more usable [especially when driven by udev]…it allows to online memory movable as long as it doesn’t clash with the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE cannot overlap”. He noted that he had discussed this patch series with Jérôme Glisse (author of the HMM – Heterogenous Memory Management – patches) which were to be rebased on top of this patch series. Michal said he would assist with resolving any conflicts.

Igor Mammedov (Red Hat) noted that he had “given [the movable onlining] series some dumb testing” and had found three issues with it, which he described fully. In summary, these were “unable to online memblock as NORMAL adjacent to onlined MOVABLE”, “dimm1 assigned to node 1 on qemu CLI memblock is onlined as movable by default”, and “removable flag flipped to non-removable state”. Michal wasn’t initially able to reproduce the second issue (because he didn’t have ACPI_HOTPLUG_MEMORY enabled in his kernel) but was then able to followup noting that it was similar to another bug he had already fixed. Jérôme subsequently followed up with an updated HMM patchset as well.

Joonsoo Kim (LGE) posted version 7 of a patch series entitled “Introduce ZONE_CMA” in which he reworks the CMA (Contiguous Memory Allocator) used by Linux to manage large regions of physcially contiguous memory that must be allocated (for device DMA buffers in cases where scatter gather DMA or an IOMMU are not available for managed translations). In the existing CMA implementation, physically contiguous pages are reserved at boot time, but they operate much as reserved memory that happens to fall within ZONE_NORMAL (but with a special “migratetype”, MIGRATE_CMA), and will not generally be used by the system for regular memory allocations unless there are no movable freepages available. In other words, only as a last possible resort.

This means that on a system with 1024MB of memory, kswapd “is mostly woke[n] up when roughly 512MB free memory is left”. The new patches instead create a distinct ZONE_CMA which has some special properties intended to address utilization issues with the existing implementation. As he notes, he had a lengthy discussion with Mel Gorman after the LSF/MM 2016 conference last year, in which Mel stated “I’m not going to outright NAK your series but I won’t ACK it either”. A lot of further discussion is anticipated. Michal Hocko might have summarized it best with, “the cover letter didn’t really help me to understand the basic concepts to have a good starting point before diving into the implementation details [to review the patches]”. Joonsoo followup up with an even longer set of answers to Michal.

A bug in synchronize_rcu_tasks()

Paul E. McKenney posted “There is a Tasks RCU stall warning” in which he noted that he and Steven Rostedt were seeing a stall that didn’t report until it had waited 10 minutes (and recommended that Steven try setting the kernel rcupdate.rcu_task_stall_timeout boot parameter). RCU (Read Copy Update) is a clever mechanism used by Linux (under a GPL license from IBM, who own a patent on the underlying technology) to perform lockless updates to certain types of data structure, by tracking versions of the structure and freeing the older version once references to it have reached an RCU quiescent state (defined by each CPU in the system having scheduled synchronize_rcu once).

Steven noted that for the issue under discussion there was a thread that “never goes to sleep, but will call cond_resched() periodically [a function that is intended to possibly call into the scheduler if there is work to be done there]”. On the RT (Real Time, “preempt-rt”) kernel, Steven noted that cond_resched() is a nop and that the code he had been working on should have made a call directly to the schedule() function. Which lead to him suggesting he had “found a bug in synchronize_rcu_tasks()” in the case that a task frequently calls schedule() but never actually performs a context switch. In that case, per Paul’s subsequent patch, the kernel is patched to specially handle calls to schedule() not due to regular preemption.

Ongoing Development

Anshuman Khandual posted “mm/madvise: Clean up MADV_SOFT_OFFLINE and MADV_HWPOISON” noting that “madvise_memory_failure() was misleading to accommodate handling of both memory_failure() as well as soft_offline_page() functions. Basically it handles memory error injection from user space which can go either way as memory failure or soft offline. Renamed as madvise_inject_error() instead.” The madvise infrastructure allows for coordination between kernel and userspace about how the latter intends to use regions of its virtual memory address space. Using this interface, it is possible for applications to provide hints as to their future usage patterns, relinquish memory that they no longer require, inject errors, and much more. This is particularly useful to KVM virtual machines, which appear as regular processes and can use madvise() to control their “RAM”.

Sricharan R (Codeaurora) posted version 11 of a patch series entitled “IOMMU probe deferral support”, which “calls the dma ops configuration for the devices at a generic place so that it works for all busses”.

Kishon Vijay Abraham sent a pull request to Greg K-H (Kroah Hartman) for Linux 4.12 that included individual patches in addition to the pull itself. This resulted in an interesting side discussion between Kishon and Lee Jones (Linaro) about how this was “a strange practice” Lee hadn’t seen before.

Thomas Garnier (Google) posted version 7 of a patch series entitled “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. Once again, he cites how this would have preemptively mitagated a Google Project Zero security bug.

Christopher Bostic posted version 6 of a patch series enabling support for the “Flexible Support Interface” (FSI) high fan out bus on IBM POWER systems.

Dan Williams (Intel) posted “x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions” in which he says “Before we rework the “pmem api” to stop abusing __copy_user_nocache() for memcpy_to_pmem() we need to fix cases where we may strand dirty data in the cpu cache.”

Leo Yan (Linaro) posted an RFC (Request For Comments) patch series entitled “coresight: support dump ETB RAM” which enables support for the Embedded Trace Buffer (ETB) on-chip storage of trace data. This is a small buffer (usually 2KB to 8KB) containing profiling data used for postmortem debug.

Thierry Escande posted “Google VPD sysfs driver”, which provides support for “accessing Google Vital Product Data (VPD) through the sysfs”.

Alex(ander) Graf posted version 6 of “kvm: better MWAIT emulation for guests”, which provides new capability information to user space in order for it to inform a KVM guest of the availability of native MWAIT instruction support. MWAIT allows a (guest) kernel to wake up a remote (v)CPU without an IPI – InterProcessor Interrupt – and the associated vmexit that would then occur to schedule the remote vCPU for execution. The availability of MWAIT is deliberately not provided in the normal CPUID bitmap since “most people will want to benefit from sleeping vCPUs to allow for over commit” (in other words with MWAIT support, one can arrange to keep virtual CPUs runnable for longer and this might impact the latency of hosting many tenants on the same machine).

David Woodhouse posted version 2 of his patch series entitled “PCI resource mmap cleanup” which “pursues my previous patch set all the way to its logical conclusion”, killing off “the legacy arch-provided pci_mmap_page_range() completely, along with its vile ‘address converted by pci_resource_ro_user()’ API and the various bugs and other strange behavior that various architectures had”. He noted that to “accommodate the ARM64 maintainers’ desire *not* to support [the legacy] mmap through /proc/bus/pci I have separated HAVE_PCI_MMAP from the sysfs implementation”. This had previously been called out since older versions of DPDK were looking for the legacy API and failing as a result on newer ARM server platforms.

Darren Hart posted an RFC (Request For Comments) patch series entitled “WMI Enhancements” that seeks to clean up the “parallel efforts involving the Windows Management Instrumentation (WMI) and dependent/related drivers”. He wanted to have a “round of discussion among those of you that have been invovled in this space before we decide on a direction”. The proposed direction is to “convert[] wmi into a platform device and a proper bus, providing devices for dependent drivers to bind to, and a mechanism for sibling devices to communicate with each other”. In particular, it includes a capability to expose WMI devices directly to userspace, which resulted in some pushback (from Pali Rohár) and a suggestion that some form of explicit whitelisting of wmi identifiers (GUIDS) should be used instead. Mario Limonciello (Dell) had many useful suggestions.

Wei Wang (Intel) posted version 9 of a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration” in which he “implements two optimizations”. The first “tranfer[s] pages in chunks between the guest and host”. The second “transfer[s] the guest unused pages to the host so that they can be skipped in live migration”.

Dmitry Safonov posted “ARM32: Support mremap() for sigpage/vDSO” which allows CRIU (Checkpoint and Restart in Userspace) to complete its process of restoring all application VMA (Virtual Memory Area) mappings on restart by adding the ability to move the vDSO (Virtual Dynamic Shared Object) and sigpage kernel pages (data explicitly mapped into every process by the kernel to accelerate certain operations) into “the same place where they were before C/R”.

Matias Bjørling (Cnex Labs) prepared a git pull request for “LightNVM” targeting Linux 4.12. This is “a new host-side translation layer that implements support for exposing Open-Channel SSDs as block devices”.

Greg Thelen (Google) posted “slab: avoid IPIs when creating kmem caches”. Linux’s SLAB memory allocator (see also the paper on the original Solaris memory allocator) can be used to pre-allocate small caches of objects that can then be efficiently used by various kernel code. When these are allocated, per-cpu array caches are created, and a call is made to kick_all_cpus_sync() which will schedule all processors to run code to ensure that that there are no stale references to the old array caches. This global call is performed using an IPI (InterProcessor Interrupt), which is relatively expensive, especially in the case that a new cache is being created (and not replacing an old one). In that case wasteful IPIs are generated on the order of 47,741 additional ones in the example given vs. 1,170 in a patched kernel.

April 20, 2017 08:32 AM

April 19, 2017

Kernel Podcast: One Day Delay Due to Boston Marathon

The Podcast is delayed until Wednesday evening this week. Usually, I try to get it out on a Monday night (or at least write it up then and actually post on Tuesday), but when holidays or other events fall on a Monday, I will generally delay the podcast by a day. This week, I was volunteering at the Marathon all of Monday, which means the prep is taking place Tuesday night instead.

April 19, 2017 04:15 AM

April 16, 2017

Paul E. Mc Kenney: Book review: "Fooled by Randomness" and "The Black Swan"

I avoided “The Black Swan” for some years because I was completely unimpressed with the reviews. However, I was sufficiently impressed a recent Nassim Taleb essay to purchase his “Incerto” series. I have read the first two books so far (“Fooled by Randomness” and “The Black Swan”), and highly recommend both of them.

The key point of these two books is that in real life, extremely low-probability events can have extreme effects, and such events are the black swans of the second book's title. This should be well in the realm of common sense: Things like earthquakes, volcanoes, tidal waves, and asteroid strikes should illustrate this point. A follow-on point is that low-probability events are inherently difficult to predict. This also should be non-controversial: The lower the probability, the less the frequency, and thus the less the experience with that event. And of my four examples, we are getting semi-OK at predicting volcanic eruptions (Mt. St. Helens being perhaps the best example of a predicted eruption), not bad at tidal waves (getting this information to those who need it still being a challenge), and hopeless at earthquakes and asteroid strikes.

Taleb then argues that the increasing winner-takes-all nature of our economy increases the frequency and severity of economic black-swan events, in part by rendering normal-distribution-based statistics impotent. If you doubt this point, feel free to review the economic events of the year 2008. He further argues that this process began with the invention of writing, which allowed one person to have an outsized effect on contemporaries and on history. I grant that modern transportation and communication systems can amplify black-swan events in ways that weren't possible in prehistoric times, but would argue that individual prehistoric people had just as much fun with the black swans of the time, including plague, animal attacks, raids by neighboring tribes, changes in the habits of prey, and so on. Nevertheless, I grant Taleb's point that most prehistoric black swans didn't threaten the human race as a whole, at least with the exception of asteroid strikes.

My favorite quote of the book is “As individuals, we should love free markets because operators in them can be as incompetent as they wish.” My favorite question is implied by his surprise that so few people embrace both sexual and economic freedom. Well, ask a stupid question around me and you are likely to get a stupid answer. Here goes: Contraceptives have not been in widespread use for long enough for human natural selection to have taken much account of their existence. Therefore, one should expect the deep subconscious to assume that sexual freedom will produce lots of babies, and that these babies will need care and feeding. Who will pay for this? The usual answer is “everyone” with consequent restrictions on economic freedom. If you don't like this answer, fine, but please consider that it is worth at least what you are paying for it. ;–)

So what does all of this have to do with parallel programming???

As it turns out, quite a lot.

But first, I will also point out my favorite misconception in the book, which is that NP has all that much to do with incomputability. On the other hand, the real surprise is that the trader-philosopher author would say anything at all about them. Furthermore, Taleb would likely point out that in the real world, the distinction between “infeasible to compute” and “impossible to compute” is a distinction without a difference.

The biggest surprise for me personally from these books is that one of the most feared category of bugs, race conditions, are not black-swan bugs, but are instead white-swan bugs. They are quite random, and very amenable to the Gaussian statistical tools that Taleb so rightly denigrates for black-swan situations. You can even do finite amounts of testing and derive good confidence bounds for the reliability of your software—but only with respect to white-swan bugs such as race conditions. So I once again feel lucky to have the privilege of working primarily on race conditions in concurrent code!

What is a black-swan bug? One class of such bugs caused me considerable pain at Sequent in the 1990s. You see, we didn't have many single-CPU systems, and we not infrequently produced software that worked only on systems with at least two CPUs. Arbitrarily large amounts of testing on multi-CPU systems would fail to spot such bugs. And perhaps you have encountered bugs that happened only at specific times in specific states, or as they are sometimes called, “new moon on Tuesdays” bugs.

Taleb talks about using mathematics from fractals to turn some classes of black-swan events into grey-swan events, and something roughly similar can be done with validation. We have an ever-increasing suite of bugs that people have injected in the past, and we can make some statements about how likely someone is to make that same error again. We can then use this experience to guide our testing efforts, as I try to do with the rcutorture test suite. That said, I expect to continue pursuing additional bug-spotting methods, including formal verification. After all, that fact that race conditions are not black swans does not necessarily make them easy, particularly in cases, such as the Linux kernel, where there are billions of users.

In short, ignore the reviews of “Fooled by Randomness” and “The Black Swan”, including this one, and go read the actual books. If you only have time to read one of them, your should of course pick one at random. ;–)

April 16, 2017 09:13 PM

April 13, 2017

Pete Zaitcev: Amazon Snowmobile

I don't know how I missed this, it should've been hyped. But here it is: an Amazon truck trailer, which is basically a giant Snowball, used to overcome the data inertia. Apparently, its capacity is 100 PB (the article is not clear about it, but it mentions that 10 Snowmobiles transfer an EB). The service apparently works one way only: you cannot use a Snomobile to download your data from Amazon.

P.S. Amazon's official page for Snowmobile confirms the 100 PB capacity.

April 13, 2017 04:59 AM

April 12, 2017

Matthew Garrett: Disabling SSL validation in binary apps

Reverse engineering protocols is a great deal easier when they're not encrypted. Thankfully most apps I've dealt with have been doing something convenient like using AES with a key embedded in the app, but others use remote protocols over HTTPS and that makes things much less straightforward. MITMProxy will solve this, as long as you're able to get the app to trust its certificate, but if there's a built-in pinned certificate that's going to be a pain. So, given an app written in C running on an embedded device, and without an easy way to inject new certificates into that device, what do you do?

First: The app is probably using libcurl, because it's free, works and is under a license that allows you to link it into proprietary apps. This is also bad news, because libcurl defaults to having sensible security settings. In the worst case we've got a statically linked binary with all the symbols stripped out, so we're left with the problem of (a) finding the relevant code and (b) replacing it with modified code. Fortuntely, this is much less difficult than you might imagine.

First, let's find where curl sets up its defaults. Curl_init_userdefined() in curl/lib/url.c has the following code:
set->ssl.primary.verifypeer = TRUE;
set->ssl.primary.verifyhost = TRUE;
#ifdef USE_TLS_SRP
set->ssl.authtype = CURL_TLSAUTH_NONE;
set->ssh_auth_types = CURLSSH_AUTH_DEFAULT; /* defaults to any auth
type */
set->general_ssl.sessionid = TRUE; /* session ID caching enabled by
default */
set->proxy_ssl = set->ssl;

set->new_file_perms = 0644; /* Default permissions */
set->new_directory_perms = 0755; /* Default permissions */

TRUE is defined as 1, so we want to change the code that currently sets verifypeer and verifyhost to 1 to instead set them to 0. How to find it? Look further down - new_file_perms is set to 0644 and new_directory_perms is set to 0755. The leading 0 indicates octal, so these correspond to decimal 420 and 493. Passing the file to objdump -d (assuming a build of objdump that supports this architecture) will give us a disassembled version of the code, so time to fix our problems with grep:
objdump -d target | grep --after=20 ,420 | grep ,493

This gives us the disassembly of target, searches for any occurrence of ",420" (indicating that 420 is being used as an argument in an instruction), prints the following 20 lines and then searches for a reference to 493. It spits out a single hit:
43e864: 240301ed li v1,493
Which is promising. Looking at the surrounding code gives:
43e820: 24030001 li v1,1
43e824: a0430138 sb v1,312(v0)
43e828: 8fc20018 lw v0,24(s8)
43e82c: 24030001 li v1,1
43e830: a0430139 sb v1,313(v0)
43e834: 8fc20018 lw v0,24(s8)
43e838: ac400170 sw zero,368(v0)
43e83c: 8fc20018 lw v0,24(s8)
43e840: 2403ffff li v1,-1
43e844: ac4301dc sw v1,476(v0)
43e848: 8fc20018 lw v0,24(s8)
43e84c: 24030001 li v1,1
43e850: a0430164 sb v1,356(v0)
43e854: 8fc20018 lw v0,24(s8)
43e858: 240301a4 li v1,420
43e85c: ac4301e4 sw v1,484(v0)
43e860: 8fc20018 lw v0,24(s8)
43e864: 240301ed li v1,493
43e868: ac4301e8 sw v1,488(v0)

Towards the end we can see 493 being loaded into v1, and v1 then being copied into an offset from v0. This looks like a structure member being set to 493, which is what we expected. Above that we see the same thing being done to 420. Further up we have some more stuff being set, including a -1 - that corresponds to CURLSSH_AUTH_DEFAULT, so we seem to be in the right place. There's a zero above that, which corresponds to CURL_TLSAUTH_NONE. That means that the two 1 operations above the -1 are the code we want, and simply changing 43e820 and 43e82c to 24030000 instead of 24030001 means that our targets will be set to 0 (ie, FALSE) rather than 1 (ie, TRUE). Copy the modified binary back to the device, run it and now it happily talks to MITMProxy. Huge success.

(If the app calls Curl_setopt() to reconfigure the state of these values, you'll need to stub those out as well - thankfully, recent versions of curl include a convenient string "CURLOPT_SSL_VERIFYHOST no longer supports 1 as value!" in this function, so if the code in question is using semi-recent curl it's easy to find. Then it's just a matter of looking for the constants that CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER are set to, following the jumps and hacking the code to always set them to 0 regardless of the argument)

comment count unavailable comments

April 12, 2017 06:10 PM

April 11, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/11


In this week’s edition: Linus Torvalds announces Linux 4.11-rc6, Intel Memory Bandwidth Allocation (MBA), Coherent Device Memory (CDM), Paravirtualized Remote TLB Flushing,kernel lockdown, the latest on Intel 5-level paging, and other assorted ongoing development activities.

Linus Torvalds announced Linux 4.11-rc6. In his mail, Linus notes that “Things are looking fairly normal [for this point in the development cycle]…The only slightly unusual thing is how the patches are spread out, with almost equal parts of arch updates, drivers, filesystems, networking and “misc”.” He ends “Go and get it”. Thorsten Leemhuis followed up with “Linux 4.11: Reported regressions as of Sunday, 2017-04-09”, his third regression report for 4.11. Which “lists 15 regressions I’m currently aware of. 5 regressions mentioned in last week[‘]s report got fixed”. Most appear to be driver problems, but there is one relating to audit, and one in inet6_fill_ifaddr that is stalled waiting for “feedback from reporter”.

Stable kernels

Greg K-H (Kroah-Hartman) announced Linux kernels 4.4.60, 4.9.21, and 4.10.9

Ben Hutchings announced Linux 3.2.88 and 3.16.43

Jason A. Donenfeld pointed out that Linux 3.10 “is inexplicably missing crypto_memneq, making all crypto mac [Message Authentication Code] comparisons use non constant-time comparisons. Bad news bears [presumably due to side channel attack]. Willy followed up noting that he would “check if the 3.12 patches…can be safely backported”.

Memory Bandwidth Allocation (Intel Resource Director Technology, RDT)

Vikas Shivappa (Intel) posted version 4 of a patch series entitled “x86/intel_rdt: Intel Memory bandwidth allocation”, addressing feedback from the previous iteration that he had received from Thomas Gleixner. The MBA (Memory Bandwidth Allocation) technology is described both in the kernel Documentation patch (provided) as well as in various Intel papers and materials available online. Intel provide a construct known as a “Class of Service” (CLOS) on certain contemporary Xeon processors, as part of their CAT (Cache Allocation Technology) feature, which is itself part of a larger family of technologies known as “Intel Resource Directory Technology” (RDT). These CLOSes “act as a resource control tag into which a thread/app/VM/container can be grouped”.

It appears that a feature of Intel’s L3 cache (LLC in Intel-speak) in these parts is that they can not only assign specific proportions of the L3 cache slices on the Xeon’s ring interconnect to specific resources (e.g. “tasks” – otherwise known as processes, or applications) but also can control the amount of memory bandwidth granted to these. This is easier than it sounds. From a technical perspective, Intel integrate their memory controller onto their dies, and contemporary memory controllers already perform fine grained scheduling (this is how they bias memory reads for speculative loads of the instruction stream in among the other traffic, as just one simple example). Therefore, exposing memory bandwidth control to the cache slices isn’t all that more complex. But it is cute, and looks great in marketing materials.

Coherent Device Memory (CDM) on top of HMM

Jérôme Glisse posted and RFC [Request for Comments] patch series entitled “Coherent Device Memory (CDM) on top of HMM”. His previous HMM (Heterogenous Memory Management) patch series, now in version 19, implemented support for (non-coherent) device memory to be mapped into regular process address space, by leveraging the ability for certain contempory devices to fault on access to untranslated addresses managed in device page tables thus allowing for a kind of pageable device memory and transparent management of ownership of the memory pages between application processor cores and (e.g.) a GPU or other acceleration device. The latest patch series builds upon HMM to also support coherent device memory (via a new ZONE_DEVICE memory – see also the recent postings from IBM in this area). As Jérôme notes, “Unlike the unaddressable memory type added with HMM patchset, the CDM [Coherent Device Memory] type can be access[ed] by [the] CPU.” He notes that he wanted to kick off this RFC more for the conversation it might provoke.

In his mail, Jérôme says, “My personal belief is that the hierarchy of memory is getting deeper (DDR, HBM stack memory, persistent memory, device memory, …) and it may make sense to try to mirror this complexity within mm concept. Generalizing the NUMA abstraction is probably the best starting point for this. I know there are strong feelings against changing NUMA so i believe now is the time to pick a direction”. He’s right of course. There have been a number of patch series recently also targeting accelerators (such as FPGAs), and more can be anticipated for coherently attached devices in the future. [This author is personally involved in CCIX]

Hyper-V: Paravirtualized Remote TLB Flushing and Hypercall Improvements

Vitaly Kuznetsov (Red Hat) posted “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements”. It turns out that Microsoft’s Hyper-V hypervisor supports hypercalls (calls into the hypervisor from the guest OS) for “doing local and remote TLB [Translation Lookaside Buffer] flushing”. Translation Lookaside Buffers [TLBs] are caches built into microprocessors that store a translation of a CPU virtual address to “physical” (or, for a virtual machine, into an intermediate hypervisor) address. They save an unnecessary page table walk (of the software managed hardware/software structure – depending upon architecture – that “walkers” navigate to perform a translation during a “page fault” or unhandled memory access, such as happens constantly when demand loading/faulting in application code and data, or sharing read-only data provided by shared libraries, etc.). TLBs are generally transparent to the OS, except that they must be explicitly managed under certain conditions – such as when invlidating regions of virtual memory or performing certain context switches (depending upon the provisioning of address and virtual memory space tag IDs in the architecture).

TLB invalidates on local processor cores normally use special CPU instructions, and this is certainly also true under virtualization. But virtual addresses used by a particular process (known as a task within the kernel) might be also used by other cores that have touched the same virtual memory space. And those translations need to be invalidated too. Some architectures include sophisticated hardware broadcast invalidation of TLBs, but some other legacy architectures don’t provide these kinds of capabilities. On those architectures that don’t provide for a hardware broadcast, it is typically necessary to use a construct known as an IPI (Inter Processor Interrupt) to cause an IRQ (interrupt message) to be delivered to the remote interrupt controller CPU interface (e.g. LAPIC on Intel x86 architecture) of the destination core, which will run an IPI handler in response that does the TLB teardown.

As Vitaly notes, nobody is recommending doing local TLB flash using a hypercall, but there can be significant performance improvement in using a hypercall for the remote invalidates. In the example cited, which uses “a special ‘TLB trasher'” he demonstrates how a 16 vCPU guest experienced a greater than 25% performance improvement using the hypercall approach.

Ongoing Development

David Howells posted an magnum opus entitled “Kernel lockdown”, which aims to “provide a facility by which a variety of avenues by which userspace can feasibly modify the running kernel image can be locked down”. As he says, “The lock-down can be configured to be triggered by the EFI secure boot status, provided the shim isn’t insecure. The lock-down can be lifted by typing SysRq+x on a keyboard attached to the system [physcial presence]. Among the many other things, these patches (versions of which have been in distribution kernels for a while) change kernel behavior to include “No unsigned modules and no modules for which [we] can’t validate the signature”, disable many hardware access functions, turn off hibernation, prevent kexec_load(), and limit some debugging features. Justin Forbes of the Fedora Project noted that he had (obviously) tested these. One of the many interesting sets of patches included a feature to “Annotate hardware config module parameters” which allows modules to mark unsafe options. Following some pushback, David also followed up with a rationale for doing kernel lockdown, entitled “Why kernel lockdown?”. Worth reading.

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he (bravely) took Ingo’s request to “rewrite assembly parts of boot process into C before bringing 5-level paging support”. He says, “The only part where I succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is now in C.” He also renames the level 4 page tables “init_level4_pgt” and “early_level4_pgt” to “init_top_pgt” and “early_top_pgt”. There was another lengthy discussion around his “Allow to have userspace mappings above 47-bits”, a patch which tells the kernel to prefer to do memory allocations below 47-bits (the previous “Canonical Addressing” limit of Intel x86 processors, which some JITs and other code exploit by abusing the top bits of the address space in pointers for illegal tags, breaking compatibility with an extended virtual address space). The patch allows mmap calls ith MAP_FIXED hints to cause larger allocations. There was some concern that larger VM space is ABI and must be handled with care. A footnote here is that (apparently, from the patch) Intel MPX (Memory Protection Extension) doesn’t yet work with LA57 (the larger address space feature) and so Kirill avoids both in the same process.

Christopher Bostic posted version 5 of a patch series entitled “FSI driver implementation”. This is support for the POWER’s [Performance Optimization With Enhanced RISC, for those who ever wondered – this author used to have a lot of interest in PowerPC back in the day] “Flexible Support Interface” (FSI), a “high fan out serial bus” whose specification seems to have appeared on the OpenPower Foundation website recently also.

Kishon Vijay Abraham posted “PCI: Support for configurable PCI endpoint”, which Bjorn finally pulled into his tree in anticipation of the upcoming 4.12 merge cycle. For those who haven’t see Kishon’s awesome presentation “Overview of PCI(e) Subsystem” for Embedded Linux Conference Europe, you are encouraged to watch it at least several times. He really knows his stuff, and has done an excellent job producing a high quality generic PCIe endpoint driver for Linux:

Ard Biesheuvel posted “EFI fixes for v4.11”, which among other goodies includes a fix for EFI GOP (Graphics Output Protocol) support on systems built using the 64-bit ARM Architecture, which uses firmware assignment of PCIe BAR resources. Ard and Alex Graf have done some really fun work with graphics cards on 64-bit ARM lately – including emulating x86 option ROMs. Ard also had some fixes prepared for v4.12 that he announced, including a bunch of cleanup to the handling of FDT (Flattened Device Tree) memory allocation. Finally, he added support for the kernel’s “quiet” command line option, to remove extraneous output from the EFI stub on boot.

Srikar Dronamraju and Michal Hocko had a back and forth on the former’s “sched: Fix numabalancing to work with isolated cpus” patch, which does what it says on the tin. Michal was a little concered that NUMA balancing wasn’t automatically applied even to isolated CPUs, but others (including Peter Zjilsta) noted that this absolutely is the intended behavior.

Ying Huang (Intel) posted version 8 of his “THP swap: Delay splitting THP during swapping out”, which essentially allows paging of (certain) huge pages. He also posted version 2 of “mm, swap: Sort swap entries before free”, which sorts consecutive swap entires in a per-CPU buffer into order accoring to their backing swap deivce before freeing those entries. This reduces needless acquiring/releasing of locks and improves performance.

Will Deacon posted version 2 of a patch series entitled “drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension”. The “SPE” (Statistical Profiling Extension) “can be used to profile a population of operations in the CPU pipeline after instruction decode. These are either architected instructions (i.e. a dynamic instruction trace) or CPU-specific uops and the choice is fixed statically in the hardware and advertised to userpace via caps. Sampling is controlled using a sampling interval, similar to a regular PMU counter, but also with an optional random perturbation”. He notes that the “in-memory buffer is linear and virtually addressed, raising an interrupt when it fills up” [which makes using it nice for software folks].

Binoy Jayan posted “IV [Initial Vector] Generation algorithms for dm-crypt”, the goal of which “is to move these algorithms from the dm layer to the kernel crypto layer by implementing them as template ciphers”.

Joerg Roedel posted “PCI: Add ATS-disable quirk for AMD Stoney GPUs”. Then, he posted a followup with a minor fix based upon feedback. This should close the issue of certain bug reports posted by those using an IOMMU on a Stoney platform and seeing lockups under high TLB invalidation.

Born Helgass posted “PCI fixes for v4.11”, which includes “fix ThunderX legacy firmware resources”, a PCI quirk for certain ARM server platforms.

Paul Menzel reported “`pci_apply_final_quirks()` taking half a second”, which David Woodhouse (who wrote the code to match PCIe devices against the quick list “back in the mists of time”) posited was perhaps down to “spending a fair amount of time just attempting to match each device against the list”. He wondered “if it’s worth sorting the list by vendor ID or somthing, at least for the common case of the quirks which match on vendor/device”. There was a general consensus that cleanup would be nice, if only someone had the time and the inclination to take a poke at it.

Seth Forshee (Canonical) posted “audit regressions in 4.11”, in which he noted that ever since the merging of “audit: fix auditd/kernel connection state tracking”, the kernel will now queue up indefintely audit messages for delivery to the (userspace) audit daemon if it is not running – ultimately crashing the machine. Paul Moore thanked him for the report and there was a back and forth on the best way to handle the case of no audit running.

Neil Brown posted a patch entitled “NFS: fix usage of mempools”. As he notes in his patch, “When passed GFP [Get Free Page] flags that allow sleeping (such as GFP_NOIO), mempool_alloc() will never return NULL, it will wait until memory is available…This means that we don’t need to handle falure, but that we do need to ensure one thread doesn’t call mempool_alloc twice on the one pool without queuing or freeing the first allocation”. He then cites “pnfs_generic_alloc_ds_commits” as an unsafe function and provides a fix.

Finally, Kees Cook followed up (as he had promised) on a discussion from last week, with an RFC (Request for Comments) patch series entitiled “mm: Tighten x86 /dev/mem with zeroing”, including the suggestion from Linus that reads from /dev/mem that aren’t permitted simply return zero data. This was just one of many security discussions he was involved in (as usual). Another included having suggested a patch posted by Eddie Kovsky entitled “module: verify address is read-only”, which modifies kernel functions that use modules to verify that they are in the correct kernel ro_after_init memory area and “reject structures not marked ro_after_init”.

April 11, 2017 03:15 PM

April 10, 2017

Daniel Vetter: Review, not Rocket Science

About a week ago there where 2 articles on LWN, the first coverging memory management patch review and the second covering the trouble with making review happen. The take away from these two articles seems to be that review is hard, there’s a constant lack of capable and willing reviewers, and this has been the state of review since forever. I’d like to counter pose this with our experiences in the graphics subsystem, where we’ve rolled out a well-working review process for the Intel driver, core subsystem and now the co-maintained small driver efforts with success, and not all that much pain.

tl;dr: require review, no exceptions, but document your expectations

Aside: This is written with a kernel focus, from the point of view of a maintainer or group of maintainers trying to establish review within their subsystem. But the principles really work anywhere.

Require Review

When review doesn’t happen, that generally means no one regards it as important enough. You can try to improve the situation by highlighting review work more, and giving less focus for top committer stats. But that only goes so far, in the end when you want to make review happen, the one way to get there is to require it.

Of course if that then results in massive screaming, then maybe you need to start with improving the recognition of review and value it more. Trying to put a new process into place over the persistent resistance is not going to work. But as long as there’s general agreement that review is good, this is the easy part.

No Exceptions

The trouble is that there’s a few really easy way to torpedo reviews before you event started, and they’re all around special priviledges and exceptions. From one of the LWN articles:

… requiring reviews might be fair, but there should be one exception: when developers modify their own code.

Another similar exception is often demanded by maintainers for applying their own patches to code they maintain - in the Linux kernel only about 25% of all maintainer patches have any kind of review tag attached when they land. This is in contrast to other contributors, who always have to get past at least their direct maintainer to get a patch applied.

There’s a few reasons why having exceptions for the original developer of some code, or a maintainer of a subsystem, is a really bad idea:

On the flip side, requiring review from all your main contributors is a really easy way to kickstart a working review economy: Instantly you both have a big demand for review. And capable reviewers who are very much willing to trade a bit of review for getting reviews on their own patches.

Another easy pitfall is maintainers who demand unconditional NAck rights for the code they maintain, sometimes spiced up by claiming they don’t even need to provide reasons for the rejection. Of course more experienced people know more about the pitfalls of a code base, and hence are more likely to find serios defects in a change. But most often these rejections aren’t about clear bugs, but more design dogmas once established (and perhaps no longer valid), or just plain personal style preferences. Again, this is a great way to prevent review from happening:

And again, I haven’t seen unicorns who write perfect code yet, neither have I seen someone who’s review feedback was consistently impeccable.

But Document your Expectations

Training reviews through direct mentoring is great, but it doesn’t scale. Document what you expect from a review as much as possible. This includes everything from coding style, to how much and in which detail code correctness should be checked. But also related things like documentation, test-cases, and process details on how exactly, when and where review is happening.

And like always, executable documentation is much better, hence try to script as much as possible. That’s why build-bots, CI bots, coding style bots, and all these things are possible - it frees the carbon-based reviewers from wasting time on the easy things and instead concentrate on the harder parts of review like code design and overall architecture, and how to best get there from the current code base. But please make sure your scripting and automated testing is of high-quality, because if the results need interpretation by someone experienced you haven’t gained anything. The kernel’s coding style checker is a pretty bad example here, since it’s widely accepted that it’s too opinionated and it’s suggestions can’t be blindly followed.

As some examples we have the dim ingloriuos maintainer scripts and some fairly extensive documentation on what is expected from reviewers for Intel graphics driver patches. Contrast that to the comparetively lax review guidelines for small drivers in drm-misc. At Intel we’ve also done internal trainings on review best practices and guidelines. Another big thing we’re working on on the automation front is CI crunching through patchwork series to properly regression test new patches before they land.

April 10, 2017 12:00 AM

April 09, 2017

Matthew Garrett: A quick look at the Ikea Trådfri lighting platform

Ikea recently launched their Trådfri smart lighting platform in the US. The idea of Ikea plus internet security together at last seems like a pretty terrible one, but having taken a look it's surprisingly competent. Hardware-wise, the device is pretty minimal - it seems to be based on the Cypress[1] WICED IoT platform, with 100MBit ethernet and a Silicon Labs Zigbee chipset. It's running the Express Logic ThreadX RTOS, has no running services on any TCP ports and appears to listen on two single UDP ports. As IoT devices go, it's pleasingly minimal.

That single port seems to be a COAP server running with DTLS and a pre-shared key that's printed on the bottom of the device. When you start the app for the first time it prompts you to scan a QR code that's just a machine-readable version of that key. The Android app has code for using the insecure COAP port rather than the encrypted one, but the device doesn't respond to queries there so it's presumably disabled in release builds. It's also local only, with no cloud support. You can program timers, but they run on the device. The only other service it seems to run is an mdns responder, which responds to the _coap._udp.local query to allow for discovery.

From a security perspective, this is pretty close to ideal. Having no remote APIs means that security is limited to what's exposed locally. The local traffic is all encrypted. You can only authenticate with the device if you have physical access to read the (decently long) key off the bottom. I haven't checked whether the DTLS server is actually well-implemented, but it doesn't seem to respond unless you authenticate first which probably covers off a lot of potential risks. The SoC has wireless support, but it seems to be disabled - there's no antenna on board and no mechanism for configuring it.

However, there's one minor issue. On boot the device grabs the current time from (fine) but also hits . That file contains a bunch of links to firmware updates, all of which are also downloaded over http (and not https). The firmware images themselves appear to be signed, but downloading untrusted objects and then parsing them isn't ideal. Realistically, this is only a problem if someone already has enough control over your network to mess with your DNS, and being wired-only makes this pretty unlikely. I'd be surprised if it's ever used as a real avenue of attack.

Overall: as far as design goes, this is one of the most secure IoT-style devices I've looked at. I haven't examined the COAP stack in detail to figure out whether it has any exploitable bugs, but the attack surface is pretty much as minimal as it could be while still retaining any functionality at all. I'm impressed.

[1] Formerly Broadcom

comment count unavailable comments

April 09, 2017 12:16 AM

April 07, 2017

Andi Kleen: Cheat sheet for Intel Processor Trace with Linux perf and gdb

What is Processor Trace

Intel Processor Trace (PT) traces program execution (every branch) with low overhead.

This is a cheat sheet of how to use PT with perf for common tasks

It is not a full introduction to PT. Please read Adding PT to Linux perf or the links from the general PT reference page.

PT support in hardware

CPU Support
Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
Skylake (6th generation Core, Xeon v5) Fine grained timing. Address filtering.
Goldmont (Apollo Lake, Denverton) Fine grained timing. Address filtering.

PT support in Linux

PT is supported in Linux perf, which is integrated in the Linux kernel.
It can be used through the “perf” command or through gdb.

There are also other tools that support PT: VTune, simple-pt, gdb, JTAG debuggers.

In general it is best to use the newest kernel and the newest Linux perf tools. If that is not possible older tools and kernels can be used. Newer tools can be used on an older kernel, but may not support all features

Linux version Support
Linux 4.1 Initial PT driver
Linux 4.2 Support for Skylake and Goldmont
Linux 4.3 Initial user tools support in Linux perf
Linux 4.5 Support for JIT decoding using agent
Linux 4.6 Bug fixes. Support address filtering.
Linux 4.8 Bug fixes.
Linux 4.10 Bug fixes. Support for PTWRITE and power tracing

Many commands require recent perf tools, you may need to update them rom a recent kernel tree.

This article covers mainly Linux perf and briefly gdb.


Only needed once.

Allow seeing kernel symbols (as root)

echo kernel.kptr_restrict=0' >> /etc/sysctl.conf
sysctl -p

Basic perf command lines for recording PT

ls /sys/devices/intel_pt/format

Check if PT is supported and what capabilities.

perf record -e intel_pt// program

Trace program

perf record -e intel_pt// -a sleep 1

Trace whole system for 1 second

perf record -C 0 -e intel_pt// -a sleep 1

Trace CPU 0 for 1 second

perf record --pid $(pidof program) -e intel_pt//

Trace already running program.

perf has to save the data to disk. The CPU can execute branches much faster than than the disk can keep up, so there will be some data loss for code that executes
many instructions. perf has no way to slow down the CPU, so when trace bandwidth > disk bandwidth there will be gaps in the trace. Because of this it is usually not a good idea
to try to save a long trace, but work with shorter traces. Long traces also take a lot of time to decode.

When decoding kernel data the decoder usually has to run as root.
An alternative is to use the script included with perf

perf script --ns --itrace=cr

Record program execution and display function call graph.

perf script by defaults “samples” the data (only dumps a sample every 100us).
This can be configured using the –itrace option (see reference below)

Install xed first.

perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64

Show every assembly instruction executed with disassembler.

For this it is also useful to get more accurate time stamps (see below)

perf script --itrace=i0ns --ns -F time,sym,srcline,ip

Show source lines executed (requires debug information)

perf script --itrace=s1Mi0ns ....

Often initialization code is not interesting. Skip initial 1M instructions while decoding:

perf script --time 1.000,2.000 ...

Slice trace into different time regions Generally the time stamps need to be looked up first in the trace, as they are absolute.

perf report --itrace=g32l64i100us --branch-history

Print hot paths every 100us as call graph histograms

Install Flame graph tools first.

perf script --itrace=i100usg | > workload.folded workloaded.folded > workload.svg
google-chrome workload.svg

Generate flame graph from execution, sampled every 100us

Other ways to record data

perf record -a -e intel_pt// sleep 1

Capture whole system for 1 second

Use snapshot mode

This collects data, but does not continuously save it all to disk. When an event of interest happens a data dump of the current buffer can be triggered by sending a SIGUSR2 signal to the perf process.

perf record -a -e --snapshot intel_pt// sleep 1
*execute workload*

*event happens*
kill -USR2 $PERF_PID

*end of recording*
kill $PERF_PID>

Record kernel only, complete system

perf record -a -e intel_pt//k sleep 1

Record user space only, complete system

perf record -a -e intel_pt//u

Enable fine grained timing (needs Skylake/Goldmont, adds more overhead)

perf record -a -e intel_pt/cyc=1,cyc_thresh=2/ ...

echo $[100*1024*1024] > /proc/sys/kernel/perf_event_mlock_kb
perf record -m 512,100000 -e intel_pt// ...

Increase perf buffer to limit data loss

perf record -e intel_pt// --filter 'filter main @ /path/to/program' ...

Only record main function in program

perf record -e intel_pt// -a --filter 'filter sys_write' program

Filter kernel code (needs 4.11+ kernel)

perf record -e intel_pt// -a --filter 'start func1 @ program' --filter 'stop func2 @ program' program

Start tracing in program at main and stop tracing at func2.

perf archive
rsync -r ~/.debug other-system:

Transfer data to a trace on another system. May also require using if decoding

Using gdb

Requires a new enough gdb built with libipt. For user space only.

gdb program
record btrace pt

record instruction-history /m # show instructions
record function-history # show functions executed
prev # step backwards in time

For more information on gdb pt see the gdb documentation


The perf PT documentation

Reference for –itrace option (from perf documentation)

i synthesize "instructions" events
b synthesize "branches" events
x synthesize "transactions" events
c synthesize branches events (calls only)
r synthesize branches events (returns only)
e synthesize tracing error events
d create a debug log
g synthesize a call chain (use with i or x)
l synthesize last branch entries (use with i or x)
s skip initial number of events

Reference for –filter option (from perf documentation)

A hardware trace PMU advertises its ability to accept a number of
address filters by specifying a non-zero value in
/sys/bus/event_source/devices/ /nr_addr_filters.

Address filters have the format:

filter|start|stop|tracestop [/ ] [@]

- 'filter': defines a region that will be traced.
- 'start': defines an address at which tracing will begin.
- 'stop': defines an address at which tracing will stop.
- 'tracestop': defines a region in which tracing will stop.

is the name of the object file, is the offset to the
code to trace in that file, and is the size of the region to
trace. 'start' and 'stop' filters need not specify a .

If no object file is specified then the kernel is assumed, in which case
the start address must be a current kernel memory address.

can also be specified by providing the name of a symbol. If the
symbol name is not unique, it can be disambiguated by inserting #n where
'n' selects the n'th symbol in address order. Alternately #0, #g or #G
select only a global symbol. can also be specified by providing
the name of a symbol, in which case the size is calculated to the end
of that symbol. For 'filter' and 'tracestop' filters, if is
omitted and is a symbol, then the size is calculated to the end
of that symbol.

If is omitted and is '*', then the start and size will
be calculated from the first and last symbols, i.e. to trace the whole
If symbol names (or '*') are provided, they must be surrounded by white

The filter passed to the kernel is not necessarily the same as entered.
To see the filter that is passed, use the -v option.

The kernel may not be able to configure a trace region if it is not
within a single mapping. MMAP events (or /proc/ /maps) can be
examined to determine if that is a possibility.

Multiple filters can be separated with space or comma.

v2: Fix some typos/broken links

April 07, 2017 08:55 PM

April 05, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/04


Linus Torvalds announces Linux 4.11-rc5, Donald Drumpf drains the maintainer swamp in April, Intel FPGA Device Drivers, FPU state cacheing, /dev/mem access crashing machines, and assorted ongoing development.

Linus Torvalds announced Linux 4.11-rc5. In his announcement mail, Linus notes that “things have definitely started to calm down, let’s hope it stays this way and it wasn’t just a fluke this week”. He calls out the oddity that “half the arch updates are to parisc” due to parisc user copy fixes.

It’s worth noting that rc5 includes a fix for virtio_pci which removes an “out of bounds access for msix_names” (the “name strings for interrupts” provided in the virtio_pci_device structure. According to Jason Wang (Red Hat), “Fedora has received multiple reports of crashes when running 4.11 as a guest” (in fact, your author has seen this one too). Quoting Jason, “The crashes are not always consistent but they are generally some flavor of oops or GPF [General Protection Fault – Intel x86 term referring to the general case of an access violation into memory by an offending instruction in various different ISAs – Instruction Set Architectures] in virtio related code. Multiple people have done bisections (Thank you Thorsten Leemhuis and Richard W.M. Jones)”. An example rediscovery of this issue came from a Mellanox engineer who reported that their test and regression VMs were crashing occasionally with 4.11 kernels.


Sebastian Andrzej Siewior announced preempt-rt Linux version 4.9.20-rt16. This includes a “Re-write of the R/W semaphores code. In RT we did not allow multiple readers because a writer blocking on the semaphore would have [to] deal with all the readers in terms of priority or budget inheritance [by which he is refering to the Priority Inheritance or “PI” feature common to “real time” kernels]. It’s obvious that the single reader restriction has severe performance problems for situations with heavy reader contention.” He notes that CPU hotplug got “better but can deadlock”

Greg Kroah-Hartman posted Linux stable kernels 4.4.59, 4.9.20, and 4.10.8.

Draining the Swamp (in April)

Donald Drumpf ( posted “MAINTAINERS: Drain the swamp”, an inspired patch aiming to finally address the problem of having “a small group of elites listed in the corrupt MAINTAINERS file” who, “For too long” have “reaped the rewards of maintainership”. He notes that over the past year the world has seen a great Linux Exit (“Lexit”) movement in which “People all of the Internet have come together and demanded that power be restored to the developers”, creating “a historic fork based on Linux 2.4, back to a better time, before Linux was controlled by corporate interests”. He notes that the “FAKE NEWS site said it wouldn’t happen, but we knew better”.

Donald says that all of the groundwork laid over the past year was just an “important first step”. And that “now, we are taking back what’s rightfully ours. We are transferring power from “Lyin’ Linus” and giving it back to you, the people. With the below patch, the job-killing MAINTAINERS file is finally being ROLLED BACK.” He also notes his intention to return “LAW and ORDER” to the Linux kernel repository by building a wall around and “THE LINUX FOUNDATION IS GOING TO PAY FOR IT”. Additional changes will include the repeal and replacement of the “bloated merge window”, the introduction of a distribution import tax, and other key innovations that will serve to improve the world and to MAKE LINUX GREAT AGAIN!

Everyone around the world immediately and enthusiastically leaped upon this inspired and life altering patch, which was of course perfect from the moment of its inception. It was then immediately merged without so much as a dissenting voice (or any review). The private email servers used to host Linus’s deleted patch emails were investigated and a special administrator appointed to investigate the investigators.

Intel FPGA Device Drivers

Wu Hao (Intel) posted a sixteen part patch series entitled “Intel FPGA Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open, and access FPGA [Field Programmable Gate Arrays, flexible logic fabrics containing millions of gates that can be connected programmatically by bitstreams describing the intended configuration] accelerators on platforms equipped with Intel(R) FPGA solutions and enables system level management functions such as FPGA partial reconfiguration [the dynamic updating of partial regions of the FPGA fabric with new logic], power management, and virtualization. This support differs from the existing in-kernel fpga-mgr from Alan Tull in that it seems to relate to the so-called Xeon-FPGA hybrid designs that Intel have presented on in various forums.

The first patch (01/16) provides a lengthy summary of their proposed design in the form of documentation that is added to the kernel’s Documentation directory, specifically in the file Documentation/fpga/intel-fpga.txt. It notes that “From the OS’s point of view, the FPGA hardware appears as a regular PCIe device. The FPGA device memory is organized using a predefined structure [Device Feature List). Features supported by the particular FPGA device are exposed throughg these data structures. An FME (FPGA Management Engine) is provided which “performs power and thermal management, error reporting, reconfiguration, performance reporting, and other infrastructure functions. Each FPGA has one FME, which is always access through the physical function (PF)”. The FPGA also provides a series of Virtual Functions that can be individually mapped into virtual machines using SR-IOV.

This design allows a CPU attached using PCIe to communicate with various Accelerated Function Units (AFUs) contained within the FPGA, and which are individually assignable into VMs or used in aggregate by the host CPU. One presumes that a series of userspace management utilities will follow this posting. It’s actually quite nice to see how they implemented the discovery of individual AFU features, since this is very close to something a certain author has proposed for use elsewhere for similar purposes. It’s always nicely validating to see different groups having similar thoughts.

Copy Offload with Peer-to-Peer PCI Memory

Logan Gunthorpe posted an RFC (Request for Comments) patch series entitled “Copy Offload with Peer-to-Peer PCI Memory” which relates to work discussed at the recent LSF/MM (Linux Storage Filesystem and Memory Management) conference, in Cambridge MA (side note: I did find some of you haha!). To quote Logan, “The concept here is to use memory that’s exposed on a PCI BAR [Base Address Register – a configuration register that tells the device where in the physical memory map of a system to place memory owned by the device, under the control of the Operating System or the platform firmware, or both] as data buffers in the NVMe target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely”. He notes a number of positives from this, including better QoS (Quality of Service), and a need for fewer (relatively still quite precious even in 2017) PCIe lanes from the CPU into a PCIe switch placed downstream of its Root Complex on which peer-to-peer PCIe devices talk to one another without the intervening step of hopping through the Root Complex and into the system memory via the CPU. As a consequence, Logan has focused his work on “cases where the NIC, NVMe devices and memory are all behind the same PCI switch”.

To facilitate this new feature, Logan has a second patch in the series, entitled “Introduce Peer-to-Peer memory (p2mem) device”, which supports partitioning and management of memory used in direct peer-to-peer transfers between two PCIe devices (endpoints, or “cards”) with a BAR that “points to regular memory”. As Logan notes, “Depending on hardware, this may reduce the bandwidth of the transfer but could significantly reduce pressure on system memory” (again by not hopping up through the PCIe topology). In his patch, Logan had also noted that “older PCI root complexes” might have problems with peer-to-peer memory operations, so he had decided to limit the feature to be only available for devices behind the same PCIe switch. This lead to a back and forth with Sinan Kaya who asked (rhetorically) “What is so special about being connected to the same switch?”. Sinan noted that there are plenty of ways in Linux to handle blacklisting known older bad hardware and platforms, such as requiring that the DMI/SMBIOS-provided BIOS date of manufacture of the system be greater than a certain date in combination with all devices exposing the p2p capability and a fallback blacklist. Ultimately, however, it was discovered that the feature peer-to-peer feature isn’t enabled by default, leading Sinan to suggest “Push the decision all the way to the user. Let them decide whether they want this feature to work on a root port connected port or under the switch”.

FPU state cacheing

Kees Cook (Google) posted a patch entitled “x86/fpu: move FPU state into separate cache”, which aims to remove the dependency within the Intel x86 Architecture port upon an internal kernel config setting known as ARCH_WANTS_DYNAMIC_TASK_STRUCT. This configuration setting (set by each architecture’s code automatically, not by the person building the kernel in the configuration file) says that the true size of the task_struct cannot be known in advance on Intel x86 Architecture because it contains a variable sized array (VSA) within the thread_struct that is at the end of the task_struct to support context save/restore of the CPU’s FPU (Floating Point Unit) co-processor. Indeed, the kernel definition of task_struct (see include/linux/sched.h) includes a scary and ominous warning “on x88, ‘thread_struct’ contains a variable-sized structure. It *MUST* be at the end of ‘task_struct'”. Which is fairly explicit.

The reason to remove the dependency upon dynamic task_struct sizing is because this “support[s] future structure layout randomization of the task_struct”, which requires that “none of the structure fields are allowed to have a specific position or a dynamic size”. The idea is to leverage a GCC (GNU Compiler Collection) plugin that will change the ordering of C structure members (such as task_struct) randomly at compile time, in order to reduce the ability for an attacker to guess the layout of the structure (highly useful in various exploits). In the case of distribution kernels of course, an attacker has access to the same kernel binaries that may be running on a system, and could use those to calculate likely structure layout for use in a compromise. But the same is not true of the big hyperscale service providers like Google and Facebook. They don’t have to publish the binaries for their own internal kernels running on their public infrastructure servers.

This patch lead to a back and forth with Linus, who was concerned about why the task_struct would need changing in order to prevent the GCC struct layout randomization plugin from blowing up. In particular, he was worried that it sounded like the plugin was moving variable sized arrays from the last member of structures (not legally permitted). Kees, Linus, and Andy Lutomirski went through the fact that, yes, the plugin can handle trailing VSAs and so forth. In the end, it was suggested that Kees look at making task_struct “be something that contains a fixed beginning and end, and just have an unnamed randomized part in the middle”. Kees said “That could work. I’ll play around with it”.

/dev/mem access crashing machines

Dave Jones (x86info maintainer) had a back and forth with Kees Cook, Linus, and Tommi Rantala about the latter’s discovery that running Dave’s “x86info” tool crashed his machine with an illegal memory access. In turns out that x86info reads /dev/mem (a requirement to get the data it needs), which is a special file representing the contents of physical memory. Normally, when access is granted to this file, it is restricted to the root user, and then only certain parts of memory as determined by STRICT_DEVMEM. The latter is intended only to allow reads of “reserved RAM” (normal system memory reserved for specific device purposes, not that allocated for use by programs). But in Tommi’s case, he was running a kernel that didn’t have STRICT_DEVMEM set on a system booting with EFI for which the legacy “EBDA” (Extended BIOS Data Area) that normally lives at a fixed location in the sub-1MB memory window on x86 was not provided by the platform. This meant that the x86info tool was trying to read memory that was a legal address but which wasn’t reserved in the EFI System Table (memory map), and was mapped for use elsewhere.

All of this lead Linus to point out that simply doing a “dd” read on the first MB of the memory on the offending system would be enough to crash it. He noted that (on x86 systems) the kernel allows access to the sub-1MB region of physical memory unconditionally (regardless of the setting of the kernel STRICT_DEVMEM option) because of the wealth of platform data that lives there and which is expected to be read by various tools. He proposed effectively changing the logic for this region such that memory not explicitly marked as reserved would simple “just read zero” rather than trying to read random kernel data in the case that the memory is used for other purposes.

This author certainly welcomes a day when /dev/mem dies a death. We’ve gone to great lengths on 64-bit ARM systems to kill it, in part because it is so legacy, but in another part because there are two possible ways we might trap a bad access – one as in this case (synchronous exception) but another in which the access might manifest as a System Error due to hitting in the memory controller or other SoC logic later as an errant access.

Ongoing Development

Steve Longerbeam posted version 6 of a patch series entitled “i.MX Media Driver”, which implements a V4L2 (Video for Linux 2) driver for i.MX6.

David Gstir (on behalf of Daniel Walter) posted “fscrypt: Add support for AES-128-CBC” which “adds support for using AES-128-CBC for file contents and AES-128-CBC-CTS for file name encryption. To mitigae watermarking attacks, IVs [Initial Vectors] are generated using the ESSIV algorthim.”

Djalal Harouni posted an RFC (Request for Comments) patch entitled “proc: support multiple separate proc instances per pidnamespace”. In his patch, Djala notes that “Historically procfs was tied to pid namespaces, and moun options were propagated to all other procfs instances in the same pid namespace. This solved several use cases in that time. However today we face new problems, there are multiple container implementations there, some of them want to hide pid entries, others want to hide non-pid entries, others want to have sysctlfs, others want to share pid namespace with private procfs mounts. All these with current implementation won’t work since all options will be propagated to all procfs mounts. This series allow to have new instances of procfs per pid namespace where each intance can have its own mount option”.

Zhou Chengming (Hauwei) posted “reduce the time of finding symbols for module” which aims to reduce the time taken for the Kernel Live Patch (klp) module to be loaded on a system in which the module uses many static local variables. The patch replaces the use of kallsyms_on_each_symbol with a variant that limits the search to those needed for the module (rather than every symbol in the kernel). As Jessica Yu notes, “it means that you have a lot of relocation records with reference your out-of-tree module. Then for each such entry klp_resolve_symbol() is called and then klp_find_object_symbol() to actually resolve it. So if you have 20k entries, you walk through vmlinux kallsyms table 20k times…But if there were 20k modules loaded, the problem would still be there”. She would like to see a more generic fix, but was also interested to see that the Huawei report referenced live patching support for AArch64 (64-bit ARM Architecture), which isn’t in upstream. She had a number of questions about whether this code was public, and in what form, to which links to works in progress from several years ago were posted. It appears that Huawei have been maintaining an internal version of these in their kernels ever since.

Ying Huang (Intel) posted version 7 of “THP swap: Delay splitting THP during swapping out”, which as we previously noted aims to swap out actual whole “huge” (within certain limits) pages rather than splitting them down to the smallest atom of size supported by the architecture during swap. There was a specific request to various maintainers that they review the patch.

Andi Kleen posted a patch removing the printing of MCEs to the kernel log when the “mcelog” daemon is running (and hopefully logging these events).

Laura Abbott posted a RESEND of “config: Add Fedora config fragments”, which does what it says on the tin. Quoting her mail, “Fedora is a popular distribution for people who like to build their own kernels. To make this easier, add a set of reasonable common config options for Fedora”. She adds files in kernel/configs for “fedora-core.config”, “fedora-fs.config” and “fedora-networking.config” which should prove very useful next time someone complains at me that “building kernels for Red Hat distributions is hard”.

Eric Biggers posted “KEYS: encrypted: avoid encrypting/decrypting stack buffers”, which notes that “Since [Linux] v4.9, the crypto PI cannot (normally) be used to encrypt/decrypt stack buffers because the stack may be virtually mapped. Fix this or the padding buffers in encrypted-keys by using ZERO_PAGE for the encryption padding and by allocating a temporary heap buffer for the decryption padding. Eric is referring to the virtually mapped stack support introduced by Andy Lutomirski which has the side effect of incidentally flagging up various previous missuse of stacks.

Mark Rutland posted an RFC (Request For Comments) patch series entitled “ARMv8.3 pointer authentication userspace support”. ARMv8.3 includes a new architectural extension that “adds functionality to detect modification of pointer values, mitigating certain classes of attack such as stack smashing, and making return oriented [ROP] programming attacks harder”. [aside: If you’re bored, and want some really interesting (well, I think so) bedtime reading, and you haven’t already read all about ROP, you really should do so]. Continuing to quote Mark, the “extension introduces the concept of a pointer authentication code (PAC), which is stored in some upper bits of pointers. Each PAC is derived from the original pointer, another 64-bit value (e.g. the stack pointer), and a secret 128-bit key”. The extension includes new instructions to “insert a PAC into a pointer”, to “strip a PAC from a pointer”, and to “authenticate strip a PAC from a pointer” (which has the side effect of poisoning the pointer and causing a later fault if the authentication fails – allowing for detection of malicious intent).

Mark’s patch makes for great reading and summarizes this feature well. It notes that it has various counterparts in userspace to add ELF (Executable and Linking Format, the executable container used on modern Linux and Unix systems) notes sections to programs to provide the necessary annotations and presumably other data necessary to implement pointer authentication in application programs. It will be great to see those posted too.

Joerg Roedel followed up to a posting from Samuel Sieb entitled “AMD IOMMU causing filesystem corruption” to note that it has recently been discovered (and was documented in another thread this past week entitled “PCI: Blacklist AMD Stoney GPU devices for ATS”) that the AMD “Stoney” platform features a GPU for which PCI-ATS is known to be broken. ATS (Address Translation Services) is the mechanism by which PCIe endpoint devices (such as plugin adapter cards, including AMD GPUs) may obtain virtual to physical address translations for use in inbound DMA operations initiated by a PCIe device into a virtual machine (VM’s) memory (the VM talks the other way through the CPU MMU).

In ATS, the device utilizes an Address Translation Cache (ATC) which is essentially a TLB (Translation Lookaside Buffer) but not called that because of handwavy reasons intended not to confuse CPU and non-CPU TLBs. When a device sitting behind an IOMMU needs to perform an address translation, it asks a Translation Agent (TA) typically contained within the PCIe Root Complex to which it is ultimately attached. In the case of AMD’s Stoney Platform, this blows up under address invalidation load: “the GPU does not reply to invalidations anymore, causing Completion-wait loop timeouts on the AMD IOMMU driver side”. Somehow (but this isn’t clear) this is suspected as the possible cause of the filesystem corruption seen by Samuel, who is waiting to rebuild a system that ate its disk testing this.

Calvin Owens (Facebook) posted “printk: Introduce per-console filtering of messages by loglevel”, which notes that “Not all consoles are created equal”. It essentially allows the user to set a different loglevel for consoles that might each be capable of very different performance. For example, a serial console might be severely limited in its baud rate (115,200 in many cases, but perhaps as low as 9,600 or lower is still commonplace in 2017), while a graphics console might be capable of much higher. Calvin mentions netconsole as the preferred (higher speed) console that Facebook use to “monitor our fleet” but that “we still have serial consoles attached on each host for live debugging, and the latter has caused problems”. He doesn’t specifically mention USB debug consoles, or the EFI console, but one assumes that listeners are possibly aware of the many console types.

Christopher Bostic (IBM) posted version 5 of a patch series entitled “FSI device driver implementation”. FSI stands for “Flexible Support Interface”, a “high fan out [a term referring to splitting of digital signals into many additional outputs] serial bus consisting of a clock and a serial data line capable of running at speeds up to 166MHz”. His patches add core support to the Linux bus and device models (including “probing and discovery of slaves and slave engines”), along with additional handling for CFAM (Common Field Replacable Unit Access Macro) – an ASIC (chip) “residing in any device requiring FSI communications” that provides these various “engines”, and an FSI engine driver that manages devices on the FSI bus.

Finally, Adam Borowski posted “n_tty: don’t mangle tty codes in OLCUC mode” which aims to correct a bug which is “reproducible as of Linux 0.11” and all the way back to 0.01. OLCUC is not part of POSIX, but this terminios structure flag tells Linux to map lowercase characters to uppercase ones. The posting cites an obvious desire by Linus to support “Great Runes” (archiac Operating Systems in which everything was uppercase), to which Linus (obviously in jest, and in keeping with the April 1 date) asked Adam why he “didn’t make this the default state of a tty?”.

April 05, 2017 07:31 AM

April 03, 2017

Arnaldo Carvalho de Melo: Looking for users of new syscalls

Recently Linux got a new syscall to get extended information about files, a super ‘stat’, if you will, read more about it at LWN.

So I grabbed the headers with the definitions for the statx arguments to tools/include/ so that ‘perf trace’ can use them to beautify, i.e. to appear as
a bitmap of strings, as described in this cset.

To test it I used one of things ‘perf trace’ can do and that ‘strace’ does not: system wide stracing. To look if any of the programs running on my machine was using the new syscall I simply did, using strace-like syntax:

# perf trace -e statx

After a few minutes, nothing… So this fedora 25 system isn’t using it in any of the utilities I used in these moments, not surprising, glibc still needs wiring statx up.

So I found out about samples/statx/test-statx.c, and after installing the kernel headers and pointing the compiler to where those files were installed, I restarted that system wide ‘perf trace’ session and ran the test program, much better:

# trace -e statx
16612.967 ( 0.028 ms): statx/562 statx(dfd: CWD, filename: /etc/passwd, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffef195d660) = 0
33064.447 ( 0.011 ms): statx/569 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW|STATX_FORCE_SYNC, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffc5484c790) = 0
36050.891 ( 0.023 ms): statx/576 statx(dfd: CWD, filename: /etc/motd, flags: SYMLINK_NOFOLLOW, mask: BTIME, buffer: 0x7ffeb18b66e0) = 0
38039.889 ( 0.023 ms): statx/584 statx(dfd: CWD, filename: /home/acme/.bashrc, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7fff1db0ea90) = 0

Ah, to get filenames fetched we need to put in place a special probe, that will collect filenames passed to the kernel right after the kernel copies it from user memory:

[root@jouet ~]# perf probe 'vfs_getname=getname_flags:72 pathname=result->name:string'
Added new event:
probe:vfs_getname    (on getname_flags:72 with pathname=result->name:string)

You can now use it in all perf tools, such as:

perf record -e probe:vfs_getname -aR sleep 1

[root@jouet ~]# trace -e open touch /etc/passwd
0.024 ( 0.011 ms): touch/649 open(filename: /etc/, flags: CLOEXEC) = 3
0.056 ( 0.018 ms): touch/649 open(filename: /lib64/, flags: CLOEXEC) = 3
0.481 ( 0.014 ms): touch/649 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC) = 3
0.553 ( 0.012 ms): touch/6649 open(filename: /etc/passwd, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3
[root@jouet ~]#

Make sure you have CONFIG_DEBUG_INFO set in your kernel build or that the matching debuginfo packages are installed. This needs to be done just once per boot, ‘perf trace’ will find it in place and use it.

Lastly, if ‘perf’ is hardlinked to ‘trace’, then the later will be the same as ‘perf trace’.

April 03, 2017 03:23 PM

March 31, 2017

Daniel Vetter: Foundation Election - Vote Now!

It is election season again for the Foundation. Beside electing half of the board seats we again have some paperwork changes - after updating the bylaws last year we realized that the membership agreement hasn’t been changed since over 10 years. It talks about the previous-previous legal org, has old addresses and a bunch of other things that just don’t fit anymore. In the board we’ve updated it to reflect our latest bylaws (thanks a lot to Rob Clark doing the editing), with no material changes intended.

Like bylaw changes any change to the membership agreement needs a qualified supermajority of all members, every vote counts and not voting essentially means voting no.

To vote, please go to, log in and hit the “Cast” button on the listed ballot.

Voting closes by  23:59 UTC on 11 April 2017, but please don’t cut it short, it’s a computer that decides when it’s over …

March 31, 2017 12:00 AM

March 28, 2017

Arnaldo Carvalho de Melo: Getting backtraces from arbitrary places

Needs debuginfo, either in a package-debuginfo rpm or equivalent or by building with ‘cc -g’:

[root@jouet ~]# perf probe -L icmp_rcv:52 | head -15

  52  	if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
      		 * RFC 1122: An ICMP_ECHO to broadcast MAY be
      		 *  silently ignored (we let user decide with a sysctl).
      		 * RFC 1122: An ICMP_TIMESTAMP MAY be silently
      		 *  discarded if to broadcast/multicast.
  59  		if ((icmph->type == ICMP_ECHO ||
  60  		     icmph->type == ICMP_TIMESTAMP) &&
      		    net->ipv4.sysctl_icmp_echo_ignore_broadcasts) {
      			goto error;
      		if (icmph->type != ICMP_ECHO &&
      		    icmph->type != ICMP_TIMESTAMP &&
[root@jouet ~]# perf probe icmp_rcv:59
Added new event:
  probe:icmp_rcv       (on icmp_rcv:59)

You can now use it in all perf tools, such as:

	perf record -e probe:icmp_rcv -aR sleep 1

[root@jouet ~]# perf trace --no-syscalls --event probe:icmp_rcv/max-stack=5/
     0.000 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
  1025.876 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
^C[root@jouet ~]#

Humm, lots of redundant info, guess we could do away with those ([kernel.kallsyms) in all the callchain lines…

March 28, 2017 08:23 PM

Kernel Podcast: Linux Kernel Podcast for 2017/03/28


Author’s Note: Apologies to Ulrich Drepper for incorrectly attributing his paper “Futexes are Tricky” to Rusty. Oops. In any case, everyone should probably read Uli’s paper:

In this week’s edition: Linus Torvalds announces Linux 4.11-rc4, early debug with USB3 earlycon, upcoming support for USB-C in 4.12, and ongoing development including various work on boot time speed ups, logging, futexes, and IOMMUs.

Linus Torvalds announced Linux 4.11-rc4, noting that “So last week, I said that I was hoping that rc3 was the point where we’d start to shrink the rc’s, and yes, rc4 is smaller than rc3. By a tiny tiny sidgen. It does touch a few more files, but it has a couple fewer commits, and fewer lines changed overall. But on the whole the two are almost identical in size. Which isn’t actually all that bad, considering that rc4 has both a networking merge and the usual driver suspects from Greg [Kroah Hartman], _and_ some drm fixes”.


Junio C Hamano announced Git v2.12.2.

Greg Kroah-Hartman announced Linux 4.4.57, 4.9.18, and 4.10.6.

Sebastian Andrezej Siewior announced Linux v4.9.18-rt14, which includes a “larger rework of the futex / rtmutex code. In v4.8-rt1 we added a workaround so we don’t de-boost too early in the unlock path. A small window remained in which the locking thread could de-boost the unlocking thread. This rework by Peter Zijlstra fixes the issue.”

Upcoming features

Greg K-H finally accepted the latest “USB Type-C Connector class” patch series from Heikki Krogerus (Intel). This patch series aims to provide various control over the capability for USB-C to be used both as a power source and as a delivery interface to supply to power to external devices (enabling the oft-cited use case of selecting between charging your cellphone/mobile device or using said device to charge your laptop). This will land a new generic management framework exposed to userspace in Linux 4.12, including a driver for “Intel Whiskey Cove PMIC [Power Management IC] USB Type-C PHY”. Your author looks forward to playing. Greg thanked Heikki for the 18(!) iterations this patch went through prior to being merged – not quite a record, but a lot of effort!

Kishon Vijay Abraham (TI) posted “PCI: Support for configurable PCI endpoint”, which provides generic infrastructure to handle PCI endpoint devices (Linux operating as a PCI endpoint “device”), such as those based upon IP blocks from DesignWare (DW). He’s only tested the design on his “dra7xx” boards and requires “the help of others to test the platforms they have access to”. The driver adds a configfs interface including an entry to which userspace should write “start” to bring up an endpoint device. He adds himself as the maintainer for this new kernel feature.

Rob Herring posted “dtc updates for 4.12”, which “syncs dtc [Device Tree Compiler] with current mainline [dtc]”. His “primary motivation is to pull in the new checks [he’s] worked on. This gives lots of new warnings which are turned off by default”.

60Hz vs 59.94Hz (Handling of reduced FPS in V4L2)

Jose Abreu (Synopsys) posted a patch series entitled “Handling of reduced FPS in V4L2”, which aims to provide a mechanism for the kernel to measure (in a generic way) the actual Frames Per Second for a Video For Linux (V4L) video device. The patches rely upon hardware drivers being able to signal that they can distinguish “between regular fps and 1000/1001 fps”.

This took your author on a journey of discovery. It turns out that (most of the time), when a video device claims to be “60fps” it’s actually running at 59.94fps, but not always. The latter frame rate is an artifact of the NTSC (National Television System Committee) color television standard in the United States. Early televisions used the 60Hz frequency (which is nationally synchronized, at least in each of the traditional three independent grids operated in the US, which are now interconnected using HVDC interconnects but presumably are still not directly in phase with one another – feel free to educate me!) of the AC supply to lock individual frame scan times. When color TV was introduced, a small frequency offset was used to make room in each frame for a color sub-carrier signal while retaining backward compatibility for black and white transmissions. This is where frequencies of 29.97 and 59.95 frames per second originate. In case you always wondered.

Jose and Hans Verkuil had a back and forth discussion about various real- world measured pixelclock frequencies that they had obtained using a variety of equipment (signal analyzers, certified HDMI analyzer, and the Synopsys IP supported by the patch series under discussion) to see whether it was in reality possible to reliably distinguish frame rates.

Early Debug with USB3 earlycon (early printk)

Lu Baolu (Intel) posted version 8 of a patch series entitled “usb: early: add support for early printk through USB3 debug port”. Contemporary (especially x86) desktop and server class systems don’t expose low level hardware debug interfaces, such as JTAG debug chains, which are used during chip bringup and early firmware and OS enablement activities, and which allow developers with suitable tools to directly control and interrogate hardware state. Or just dump out the kernel ringbuffer (the dmesg “log”).

Actually, all such systems do have low level debug capabilities, they’re just fused out during the production process (by blowing efuses embedded into the processor) and either not exposed on the external pins of the chip at all, or are simply disabled in the chip logic. Probably most of these can be re-enabled by writing the magic cryptographically signed hashes to undocumented memory regions in on-chip coprocessor spaces. In any case, vendors such as Intel aren’t going to tell you how.

Yet it is often desirable to have certain low level debug functionality for systems that are deployed into field settings, even to reliably dump out the kernel console log DEBUG log level messages somewhere. Traditionally this was done using PC serial ports, but most desktop (and all laptop) systems no longer ship with those exposed on the rear panel. If you’re lucky you’ll see an IDC10 connector on your motherboard to which you can attach a DB9 breakout cable. Consumers and end users have no idea what any of this means, and in the case that they don’t know what this means, they probably shouldn’t be encouraged to open the machine up and poke things. Yet even in the case that IDC10 connectors exist and can be hooked up, this is still a cumbersome interface that cannot be relied upon today.

Microsoft (who are often criticized but actually are full of many good ideas and usually help to drive industry standardization for the broader market) instituted sanity years ago by working with the USB Implementors Forum (IF) to ensure that the USB3 specification included a standardized feature known as xHCI debug capability (DbC), an “optional but standalone functionality by an xHCI hosst controller”. This suited Windows, which traditionally requires two UARTs (serial ports) for kernel development, and uses one of them for simple direct control of the running kernel without going through complex driver frameworks. Debug port (which also existed on USB2) traditionally required a special external partner hardware dongle but is cleaner in USB3, requiring only a USB A-to-A crossover cable connecting USB3.0 data lines.

As Lu Baolu notes in his patch, “With DbC hardware initialized, the system will present a debug device through the USB3 debug port (normally the first USB3 port)”. The patch series enables this as a high speed console log target on Linux, but it could be used for much more interesting purposes via KDB.

[Separately, but only really related to console drivers and not debugging, Thierry Escande posted “firmware: google memconsole” which adds support for importing the boot time BIOS memory based console into the kernel ringbuffer on Google Coreboot systems].

Ongoing Development

Pavel Tatashin (Oracle) posted “parallelized “struct page” zeroing”, which improves boot time performance significantly in the case that the “deferred struct page initialization feature is enabled”. In this case, zeroing out of the kernel’s vmemmap (Virtual Memory Map) is delayed until after the secondary CPU cores on a machine have been started. When this is done, those cores can be used to run zeroing threads that write to memory, taking one SPARC system down from 97.89 seconds to boot down to 46.91. Pavel notes that the savings are also considerable on x86 systems too.

Thomas Gleixner had a lengthy back and forth with Pasha Tatashin (Oracle) over the latter’s posting of “Early boot time stamps for x86” which use the TSC (Time Stamp Counter) on Intel x86 Architecture. The goal is to log how long the machine actually took to boot, including firmware, rather than just how long Linux took to boot from the time it was started. Peter Zijlstra responded (to Pasha), “Lol, how cute. You assume TSC starts at 0 on reset” (alluding to the fact that firmware often does crazy things playing with the TSC offset or directly writing to it). Thomas was unimpressed with Pavel’s posting of a v2 patch series, noting “Did you actually read my last reply on V1 of this? I made it clear that the way this is done, i.e. hacking it into the earliest boo[]t stage is not going to happen…I don’t care about you wasting your time, but I very much care about my time”. He provided a further more lengthy response, including various commentary on the best ways to handle feedback.

Peter Zijlstra posted version 6 of a patch series entitled “The arduous story of FUTEX_UNLOCK_PI” in which he adds “Another installment of the futex patches that give you nightmares”. Futexes (Fast User-space Mutexes) are a mechanism provided by the Linux kernel which leverage shared memory to provide a low overhead mutex (mutual exclusion primitave) to userspace in the case that such mutexes are uncontended (no conflicts between processes – tasks within the kernel – exist trying to acquire the same resource) but with a “slow path” through the kernel in the case of contention. They are used by many userspace applications, including extensively in the C library (see the famous paper by Rusty Russell entitled “Futexes are Tricky”). Peter is working on solving problems introduced by having to have Priority Inheritance (PI) aware futexes in Real Time kernels. These adjust priority of the associated tasks holding mutexes for short periods in order to prevent Priority Inversion (see Mars Pathfinder study papers) in which a low priority task holds a mutex that a high priority task wants to acquire. Peter’s patches “rework[] and document[] the locking” of existing code.

Separately, Waiman Long (Red Hat) posted version 6 of “futex” Introducing throughput-optimized (TP) futexes which “introduces a new futex implementation called throughput-optmized (TP) futexes. It is similar to PI futexes in its calling convention, but provides better throughput than the wait-wake (WW) futexes by encouraging lock stealing and optimistic spinning. The new TP futexes an be used in implementing both userspace mutexes and rwlocks. The provide[] better performance while simplifying the userspace locking implementation at the same time. The WW futexes are still needed to implement other synchronization primitives like conditional variables and semaphores that cannot be handled by the TP futexes”.

David Woodhouse posted “PCI resource mmap cleanup” which aims to clean up the use of various kernel interfaces that provide “user visible” resource addresses through (legacy) proc and (contemporary) sysfs. The purpose of these interfaces is to provide information about regions of PCI address space memory that can be directly mapped by userspace applications such as those linked against the DPDK (Data Plane Development Kit) library. An example of his cleanup included “Only allow WC [Write Combining] mmap on prefetchable resources” for the /proc/bus/pci mmap interface because this was the case for the preferred sysfs interface already. This lead some to debate why the 64-bit ARM Architecture didn’t provide the legacy procfs interface (since there was a little confusion about the dependencies for DPDK) but ultimately re-concluded that it shouldn’t.

Tyler Baicar (Codeaurora) posted version 13 of a patch series entitled “Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64”, which aims to introduce support to the 64-bit ARM Architecture for logging of RAS events using the shared “GHES” (Generic Hardware Error Source) memory location “with the proper GHES structures to notify the OS of the error”. This dovetails nicely with platforms performing “firmware first” error handling in which errors are trapped to secure firmare which first handles them and subsequently informs the Operating System using this ACPI feature.

Shaohua Li (Facebook) posted a patch entitled “add an option to disable iommu force on” in the case of the (x86) Trusted Boot (TBOOT) feature being enabled. The reason cited was that under a certain 40GBit networking load XDP (eXpress Data Path) test there were high numbers of IOTLB (IO Translation Look Aside Buffer) misses “which kills the performance”. What he is refering to is the mechanism through which an IOMMU (which sits logically between a hardware device, such as a network card, and memory, often as part of an integrated PCI Root Complex) translates underlying memory accesses by the adapter card into real host memory transactions. These are cached by the IOMMU in small caches (known as IOTLBS) after it performs such translations using its “page tables” (similar to how a host CPU’s MMU – Memory Management Unit – performs host memory translations). Badly designed IOMMU implementations or poor utilization can result in large numbers of misses that result in users disabling the feature. Alas, without an IOMMU, there’s little protection during boot from rogue devices that maliciously want to trash host memory. Nobody has noted this in the RFC (Request For Comments) discussion, yet.

Bodong Wang (Mellanox) posted a patch entitled “Add an option to probe VFs or not before enabling SR-IOV”, which aims to allow administrators to limit the probing of (PCIe) Virtual Functions (VFs) on adapters that will have those resources passed through to Virtual Machines (VMs) (using VFIO). This “can save host side resource usage by VF instances which would be eventually probed to VMs”. It adds a new sysfs interface to control this.

Viresh Kumar posted a patch entitled “cpufreq: Restore policy min/max limits on CPU online”. Apparently, existing code behavior was that “On CPU online the cpufreq core restores the previous governor [the in kernel logic that determines CPU frequency transitions based upon various metrics, such as saving energy, or prioritizing performance]…but it does not restore min/max limits at the same time”. The patch addresses this shortcoming.

Wanpeng Li posted a patch entitled “KVM: nVMX: Fix nested VPID vmx exec control” that aims to “hide and forbid” Virtual Processor IDentifiers in nested virtualization contexts where the hardware doesn’t support this. Apparently it was unconditionally being enabled (based upon real hardware realities of existing implementation) regardless of feature information (INVVPID) provided in the “vmx” capabilities.

Joerg Roedel posted a patch entitled “ACPI: Don’t create a platform_device for IOAPIC/IOxAPIC” since this was causing problems during hot remove (of CPUs). Rafael J. Wysocki noted that “it’s better to avoid using platform_device for hot-removable stuff” since it is “inherently fragile”.

Kees Cook (Google) posted a patch disabling hibernation support on 32-bit systems in the case that KASLR (Kernel Address Space Layout Randomization) was enabled at boot time, but allowing for “nokaslr” on the kernel command line to change this. Evgenii Shatokhin initially noted that “nokaslr” didn’t re-enable hibernation support correctly, but eventually determined that the ordering and placement of the “nokaslr” on the command line was to blame, which lead to Kees saying he would look into the command line parsing sequence and interaction with other options, such as “resume=”.

Separately, Baoquan He (Red Hat) noted that with KASLR an implicit assumption that EFI_VA_START < EFI_VA_END existed, while “In fact [the] EFI [(Unified) Extensible Firmware Interface] region reserved for runtime services [these are callbacks into firmware from Linux] virtual mapping will be allocated using a top-down schema”. His patches addressed this problem, and being “RESEND”s, he was keen to see that they get taken up soon.

Also separately, Kees posted “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. He cites a bug it would have prevented.

Kan Liang (Intel) posted “measure SMI cost”. This patch series aims to leverage hardware counters to inform perf of the amount of time spent (on Intel x86 Architecture systems) inside System Management Mode (SMM). SMIs (System Management Interrups) are events that are generated (usually) by Intel Platform Control Hub and similar chipset logic which can be programmed by firmare to generate regular interrupts that target a secure execution context known as SMM (System Management Mode). It is here that firmware temporarily steals CPU cycles from the Operating System (without its knowledge) to perform such things as CPU fan control, errata handling, and wholesale VGA graphics emulation in BMC “value add” from OEMs). Over the years, the amount of gunk hidden in SMIs has grown that this author even once wrote a latency detector (hwlat) and has a patent on SMI detection without using such dedicated counters…due to the impact of such on system performance. SMM is necessary on x86 due to its lack of a standardized on-SoC platform management controller, but so is accounting for bloat.

Finally, yes, Kirill A. Shutemov snuck in another iteration of his Intel “5-level paging support” in preparation for a 4.12 merge.


March 28, 2017 05:23 PM

March 27, 2017

Matthew Garrett: Buying a Utah teapot

The Utah teapot was one of the early 3D reference objects. It's canonically a Melitta but hasn't been part of their range in a long time, so I'd been watching Ebay in the hope of one turning up. Until last week, when I discovered that a company called Friesland had apparently bought a chunk of Melitta's range some years ago and sell the original teapot[1]. I've just ordered one, and am utterly unreasonably excited about this.

Update: Friesland have apparently always produced the Utah teapot, but were part of the Melitta group for some time - they didn't buy the range from Melitta.

[1] They have them in 0.35, 0.85 and 1.4 litre sizes. I believe (based on the measurements here) that the 1.4 litre one matches the Utah teapot.

comment count unavailable comments

March 27, 2017 11:45 PM

Pete Zaitcev: It was surprising

So here I was watching ACCA at Crunchyroll, when a commercial comes up... of VMware OpenStack.

I still remember times in 2010 when VMware was going to have their own cloud (possibly called VxCloud), with blackjack and hookers, as they say, or much better that OpenStack anyway. Looks like things have changed.

Also, what's up with this targeting? How did they link my account at Crunchyroll with OpenStack?

{Update: Thanks, Andreas!}

March 27, 2017 06:33 PM

March 26, 2017

Vegard Nossum: Writing a reverb filter from first principles

WARNING/DISCLAIMER: Audio programming always carries the risk of damaging your speakers and/or your ears if you make a mistake. Therefore, remember to always turn down the volume completely before and after testing your program. And whatever you do, don't use headphones or earphones. I take no responsibility for damage that may occur as a result of this blog post!

Have you ever wondered how a reverb filter works? I have... and here's what I came up with.

Reverb is the sound effect you commonly get when you make sound inside a room or building, as opposed to when you are outdoors. The stairwell in my old apartment building had an excellent reverb. Most live musicians hate reverb because it muddles the sound they're trying to create and can even throw them off while playing. On the other hand, reverb is very often used (and overused) in the studio vocals because it also has the effect of smoothing out rough edges and imperfections in a recording.

We typically distinguish reverb from echo in that an echo is a single delayed "replay" of the original sound you made. The delay is also typically rather large (think yelling into a distant hill- or mountainside and hearing your HEY! come back a second or more later). In more detail, the two things that distinguish reverb from an echo are:

  1. The reverb inside a room or a hall has a much shorter delay than an echo. The speed of sound is roughly 340 meters/second, so if you're in the middle of a room that is 20 meters by 20 meters, the sound will come back to you (from one wall) after (20 / 2) / 340 = ~0.029 seconds, which is such a short duration of time that we can hardly notice it (by comparison, a 30 FPS video would display each frame for ~0.033 seconds).
  2. After bouncing off one wall, the sound reflects back and reflects off the other wall. It also reflects off the perpendicular walls and any and all objects that are in the room. Even more, the sound has to travel slightly longer to reach the corners of the room (~14 meters instead of 10). All these echoes themselves go on to combine and echo off all the other surfaces in the room until all the energy of the original sound has dissipated.

Intuitively, it should be possible to use multiple echoes at different delays to simulate reverb.

We can implement a single echo using a very simple ring buffer:

    class FeedbackBuffer {
        unsigned int nr_samples;
        int16_t *samples;

        unsigned int pos;

        FeedbackBuffer(unsigned int nr_samples):
            samples(new int16_t[nr_samples]),

            delete[] samples;

        int16_t get() const
            return samples[pos];

        void add(int16_t sample)
            samples[pos] = sample;

            /* If we reach the end of the buffer, wrap around */
            if (++pos == nr_samples)
                pos = 0;

The constructor takes one argument: the number of samples in the buffer, which is exactly how much time we will delay the signal by; when we write a sample to the buffer using the add() function, it will come back after a delay of exactly nr_samples using the get() function. Easy, right?

Since this is an audio filter, we need to be able to read an input signal and write an output signal. For simplicity, I'm going to use stdin and stdout for this -- we will read 8 KiB at a time using read(), process that, and then use write() to output the result. It will look something like this:

    #include <cstdio>
    #include <cstdint>
    #include <cstdlib>
    #include <cstring>
    #include <unistd.h>

    int main(int argc, char *argv[])
        while (true) {
            int16_t buf[8192];
            ssize_t in = read(STDIN_FILENO, buf, sizeof(buf));
            if (in == -1) {
                /* Error */
                return 1;
            if (in == 0) {
                /* EOF */

            for (unsigned int j = 0; j < in / sizeof(*buf); ++j) {
                /* TODO: Apply filter to each sample here */

            write(STDOUT_FILENO, buf, in);

        return 0;

On Linux you can use e.g. 'arecord' to get samples from the microphone and 'aplay' to play samples on the speakers, and you can do the whole thing on the command line:

    $ arecord -t raw -c 1 -f s16 -r 44100 |\
        ./reverb | aplay -t raw -c 1 -f s16 -r 44100

(-c means 1 channel; -f s16 means "signed 16-bit" which corresponds to the int16_t type we've used for our buffers; -r 44100 means a sample rate of 44100 samples per second; and ./reverb is the name of our executable.)

So how do we use class FeedbackBuffer to generate the reverb effect?

Remember how I said that reverb is essentially many echoes? Let's add a few of them at the top of main():

    FeedbackBuffer fb0(1229);
    FeedbackBuffer fb1(1559);
    FeedbackBuffer fb2(1907);
    FeedbackBuffer fb3(4057);
    FeedbackBuffer fb4(8117);
    FeedbackBuffer fb5(8311);
    FeedbackBuffer fb6(9931);

The buffer sizes that I've chosen here are somewhat arbitrary (I played with a bunch of different combinations and this sounded okay to me). But I used this as a rough guideline: simulating the 20m-by-20m room at a sample rate of 44100 samples per second means we would need delays roughly on the order of 44100 / (20 / 340) = 2594 samples.

Another thing to keep in mind is that we generally do not want our feedback buffers to be multiples of each other. The reason for this is that it creates a consonance between them and will cause certain frequencies to be amplified much more than others. As an example, if you count from 1 to 500 (and continue again from 1), and you have a friend who counts from 1 to 1000 (and continues again from 1), then you would start out 1-1, 2-2, 3-3, etc. up to 500-500, then you would go 1-501, 2-502, 3-504, etc. up to 500-1000. But then, as you both wrap around, you start at 1-1 again. And your friend will always be on 1 when you are on 1. This has everything to do with periodicity and -- in fact -- prime numbers! If you want to maximise the combined period of two counters, you have to make sure that they are relatively coprime, i.e. that they don't share any common factors. The easiest way to achieve this is to only pick prime numbers to start with, so that's what I did for my feedback buffers above.

Having created the feedback buffers (which each represent one echo of the original sound), it's time to put them to use. The effect I want to create is not simply overlaying echoes at fixed intervals, but to have the echos bounce off each other and feed back into each other. The way we do this is by first combining them into the output signal... (since we have 8 signals to combine including the original one, I give each one a 1/8 weight)

    float x = .125 * buf[j];
    x += .125 * fb0.get();
    x += .125 * fb1.get();
    x += .125 * fb2.get();
    x += .125 * fb3.get();
    x += .125 * fb4.get();
    x += .125 * fb5.get();
    x += .125 * fb6.get();
    int16_t out = x;

...then feeding the result back into each of them:


And finally we also write the result back into the buffer. I found that the original signal loses some of its power, so I use a factor 4 gain to bring it roughly back to its original strength; this number is an arbitrary choice by me, I don't have any specific calculations to support it:

    buf[j] = 4 * out;

That's it! 88 lines of code is enough to write a very basic reverb filter from first principles. Be careful when you run it, though, even the smallest mistake could cause very loud and unpleasant sounds to be played.

If you play with different buffer sizes or a different number of feedback buffers, let me know if you discover anything interesting :-)

March 26, 2017 10:07 AM

Vegard Nossum: Fuzzing the OpenSSH daemon using AFL

(EDIT 2017-03-25: All my patches to make OpenSSH more amenable to fuzzing with AFL are available at This also includes improvements to the patches found in this post.)

American Fuzzy Lop is a great tool. It does take a little bit of extra setup and tweaking if you want to go into advanced usage, but mostly it just works out of the box.

In this post, I’ll detail some of the steps you need to get started with fuzzing the OpenSSH daemon (sshd) and show you some tricks that will help get results more quickly.

The AFL home page already displays 4 OpenSSH bugs in its trophy case; these were found by Hanno Böck who used an approach similar to that outlined by Jonathan Foote on how to fuzz servers with AFL.

I take a slightly different approach, which I think is simpler: instead of intercepting system calls to fake network activity, we just run the daemon in “inetd mode”. The inet daemon is not used very much anymore on modern Linux distributions, but the short story is that it sets up the listening network socket for you and launches a new process to handle each new incoming connection. inetd then passes the network socket to the target program as stdin/stdout. Thus, when sshd is started in inet mode, it communicates with a single client over stdin/stdout, which is exactly what we need for AFL.

Configuring and building AFL

If you are just starting out with AFL, you can probably just type make in the top-level AFL directory to compile everything, and it will just work. However, I want to use some more advanced features, in particular I would like to compile sshd using LLVM-based instrumentation (which is slightly faster than the “assembly transformation by sed” that AFL uses by default). Using LLVM also allows us to move the target program’s “fork point” from just before entering main() to an arbitrary location (known as “deferred forkserver mode” in AFL-speak); this means that we can skip some of the setup operations in OpenSSH, most notably reading/parsing configs and loading private keys.

Most of the steps for using LLVM mode are detailed in AFL’s llvm_mode/README.llvm. On Ubuntu, you should install the clang and llvm packages, then run make -C llvm_mode from the top-level AFL directory, and that’s pretty much it. You should get a binary called afl-clang-fast, which is what we’re going to use to compile sshd.

Configuring and building OpenSSH

I’m on Linux so I use the “portable” flavour of OpenSSH, which conveniently also uses git (as opposed to the OpenBSD version which still uses CVS – WTF!?). Go ahead and clone it from git://

Run autoreconf to generate the configure script. This is how I run the config script:

./configure \
CC="$PWD/afl-2.39b/afl-clang-fast" \
CFLAGS="-g -O3" \
--prefix=$PWD/install \
--with-privsep-path=$PWD/var-empty \
--with-sandbox=no \

You obviously need to pass the right path to afl-clang-fast. I’ve also created two directories in the current (top-level OpenSSH directory), install and var-empty. This is so that we can run make install without being root (although var-empty needs to have mode 700 and be owned by root) and without risking clobbering any system files (which would be extremely bad, as we’re later going to disable authentication and encryption!). We really do need to run make install, even though we’re not going to be running sshd from the installation directory. This is because sshd needs some private keys to run, and that is where it will look for them.

(EDIT 2017-03-25: Passing --without-pie to configure may help make the resulting binaries easier to debug since instruction pointers will not be randomised.)

If everything goes well, running make should display the AFL banner as OpenSSH is compiled.

You may need some extra libraries (zlib1g-dev and libssl-dev on Ubuntu) for the build to succeeed.

Run make install to install sshd into the install/ subdirectory (and again, please don’t run this as root).

We will have to rebuild OpenSSH a few times as we apply some patches to it, but this gives you the basic ingredients for a build. One particular annoying thing I’ve noticed is that OpenSSH doesn’t always detect source changes when you run make (and so your changes may not actually make it into the binary). For this reason I just adopted the habit of always running make clean before recompiling anything. Just a heads up!

Running sshd

Before we can actually run sshd under AFL, we need to figure out exactly how to invoke it with all the right flags and options. This is what I use:

./sshd -d -e -p 2200 -r -f sshd_config -i

This is what it means:

“Debug mode”. Keeps the daemon from forking, makes it accept only a single connection, and keeps it from putting itself in the background. All useful things that we need.
This makes it log to stderr instead of syslog; this first of all prevents clobbering your system log with debug messages from our fuzzing instance, and also gives a small speed boost.
-p 2200
The TCP port to listen to. This is not really used in inetd mode (-i), but is useful later on when we want to generate our first input testcase.
This option is not documented (or not in my man page, at least), but you can find it in the source code, which should hopefully also explain what it does: preventing sshd from re-execing itself. I think this is a security feature, since it allows the process to isolate itself from the original environment. In our case, it complicates and slows things down unnecessarily, so we disable it by passing -r.
-f sshd_config
Use the sshd_config from the current directory. This just allows us to customise the config later without having to reinstall it or be unsure about which location it’s really loaded from.
“Inetd mode”. As already mentioned, inetd mode will make the server process a single connection on stdin/stdout, which is a perfect fit for AFL (as it will write testcases on the program’s stdin by default).

Go ahead and run it. It should hopefully print something like this:

$ ./sshd -d -e -p 2200 -r -f sshd_config -i
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2g 1 Mar 2016
debug1: private host key #0: ssh-rsa SHA256:f9xyp3dC+9jCajEBOdhjVRAhxp4RU0amQoj0QJAI9J0
debug1: private host key #1: ssh-dss SHA256:sGRlJclqfI2z63JzwjNlHtCmT4D1WkfPmW3Zdof7SGw
debug1: private host key #2: ecdsa-sha2-nistp256 SHA256:02NDjij34MUhDnifUDVESUdJ14jbzkusoerBq1ghS0s
debug1: private host key #3: ssh-ed25519 SHA256:RsHu96ANXZ+Rk3KL8VUu1DBzxwfZAPF9AxhVANkekNE
debug1: setgroups() failed: Operation not permitted
debug1: inetd sockets after dupping: 3, 4
Connection from UNKNOWN port 65535 on UNKNOWN port 65535

If you type some garbage and press enter, it will probably give you “Protocol mismatch.” and exit. This is good!

Detecting crashes/disabling privilege separation mode

One of the first obstacles I ran into was the fact that I saw sshd crashing in my system logs, but AFL wasn’t detecting them as crashes:

[726976.333225] sshd[29691]: segfault at 0 ip 000055d3f3139890 sp 00007fff21faa268 error 4 in sshd[55d3f30ca000+bf000]
[726984.822798] sshd[29702]: segfault at 4 ip 00007f503b4f3435 sp 00007fff84c05248 error 4 in[7f503b3a6000+1bf000]

The problem is that OpenSSH comes with a “privilege separation mode” that forks a child process and runs most of the code inside the child. If the child segfaults, the parent still exits normally, so it masks the segfault from AFL (which only observes the parent process directly).

In version 7.4 and earlier, privilege separation mode can easily be disabled by adding “UsePrivilegeSeparation no” to sshd_config or passing -o UsePrivilegeSeaparation=no on the command line.

Unfortunately it looks like the OpenSSH developers are removing the ability to disable privilege separation mode in version 7.5 and onwards. This is not a big deal, as OpenSSH maintainer Damien Miller writes on Twitter: “the infrastructure will be there for a while and it’s a 1-line change to turn privsep off”. So you may have to dive into sshd.c to disable it in the future.

(EDIT 2017-03-25: I’ve pushed the source tweak for disabling privilege separation for 7.5 and newer to my OpenSSH GitHub repo. This also obsoletes the need for a config change.)

Reducing randomness

OpenSSH uses random nonces during the handshake to prevent “replay attacks” where you would record somebody’s (encrypted) SSH session and then you feed the same data to the server again to authenticate again. When random numbers are used, the server and the client will calculate a new set of keys and thus thwart the replay attack.

In our case, we explicitly want to be able to replay traffic and obtain the same result two times in a row; otherwise, the fuzzer would not be able to gain any useful data from a single connection attempt (as the testcase it found would not be usable for further fuzzing).

There’s also the possibility that randomness introduces variabilities in other code paths not related to the handshake, but I don’t really know. In any case, we can easily disable the random number generator. On my system, with the configure line above, all or most random numbers seem to come from arc4random_buf() in openbsd-compat/arc4random.c, so to make random numbers very predictable, we can apply this patch:

diff --git openbsd-compat/arc4random.c openbsd-compat/arc4random.c
--- openbsd-compat/arc4random.c
+++ openbsd-compat/arc4random.c
@@ -242,7 +242,7 @@ void
arc4random_buf(void *buf, size_t n)
- _rs_random_buf(buf, n);
+ memset(buf, 0, n);
# endif /* !HAVE_ARC4RANDOM_BUF */

One way to test whether this patch is effective is to try to packet-capture an SSH session and see if it can be replayed successfully. We’re going to have to do that later anyway in order to create our first input testcase, so skip below if you want to see how that’s done. In any case, AFL would also tell us using its “stability” indicator if something was really off with regards to random numbers (>95% stability is generally good, <90% would indicate that something is off and needs to be fixed).

Increasing coverage

Disabling message CRCs

When fuzzing, we really want to disable as many checksums as we can, as Damien Miller also wrote on twitter: “fuzzing usually wants other code changes too, like ignoring MAC/signature failures to make more stuff reachable”. This may sound a little strange at first, but makes perfect sense: In a real attack scenario, we can always1 fix up CRCs and other checksums to match what the program expects.

If we don’t disable checksums (and we don’t try to fix them up), then the fuzzer will make very little progress. A single bit flip in a checksum-protected area will just fail the checksum test and never allow the fuzzer to proceed.

We could of course also fix the checksum up before passing the data to the SSH server, but this is slow and complicated. It’s better to disable the checksum test in the server and then try to fix it up if we do happen to find a testcase which can crash the modified server.

The first thing we can disable is the packet CRC test:

diff --git a/packet.c b/packet.c
--- a/packet.c
+++ b/packet.c
@@ -1635,7 +1635,7 @@ ssh_packet_read_poll1(struct ssh *ssh, u_char *typep)

cp = sshbuf_ptr(state->incoming_packet) + len - 4;
stored_checksum = PEEK_U32(cp);
- if (checksum != stored_checksum) {
+ if (0 && checksum != stored_checksum) {
error("Corrupted check bytes on input");
if ((r = sshpkt_disconnect(ssh, "connection corrupted")) != 0 ||
(r = ssh_packet_write_wait(ssh)) != 0)

As far as I understand, this is a simple (non-cryptographic) integrity check meant just as a sanity check against bit flips or incorrectly encoded data.

Disabling MACs

We can also disable Message Authentication Codes (MACs), which are the cryptographic equivalent of checksums, but which also guarantees that the message came from the expected sender:

diff --git mac.c mac.c
index 5ba7fae1..ced66fe6 100644
--- mac.c
+++ mac.c
@@ -229,8 +229,10 @@ mac_check(struct sshmac *mac, u_int32_t seqno,
if ((r = mac_compute(mac, seqno, data, dlen,
ourmac, sizeof(ourmac))) != 0)
return r;
+#if 0
if (timingsafe_bcmp(ourmac, theirmac, mac->mac_len) != 0)
return 0;

We do have to be very careful when making these changes. We want to try to preserve the original behaviour of the program as much as we can, in the sense that we have to be very careful not to introduce bugs of our own. For example, we have to be very sure that we don’t accidentally skip the test which checks that the packet is large enough to contain a checksum in the first place. If we had accidentally skipped that, it is possible that the program being fuzzed would try to access memory beyond the end of the buffer, which would be a bug which is not present in the original program.

This is also a good reason to never submit crashing testcases to the developers of a program unless you can show that they also crash a completely unmodified program.

Disabling encryption

The last thing we can do, unless you wish to only fuzz the unencrypted initial protocol handshake and key exchange, is to disable encryption altogether.

The reason for doing this is exactly the same as the reason for disabling checksums and MACs, namely that the fuzzer would have no hope of being able to fuzz the protocol itself if it had to work with the encrypted data (since touching the encrypted data with overwhelming probability will just cause it to decrypt to random and utter garbage).

Making the change is surprisingly simple, as OpenSSH already comes with a psuedo-cipher that just passes data through without actually encrypting/decrypting it. All we have to do is to make it available as a cipher that can be negotiated between the client and the server. We can use this patch:

diff --git a/cipher.c b/cipher.c
index 2def333..64cdadf 100644
--- a/cipher.c
+++ b/cipher.c
@@ -95,7 +95,7 @@ static const struct sshcipher ciphers[] = {
# endif /* OPENSSL_NO_BF */
#endif /* WITH_SSH1 */
- { "none", SSH_CIPHER_NONE, 8, 0, 0, 0, 0, 0, EVP_enc_null },
+ { "none", SSH_CIPHER_SSH2, 8, 0, 0, 0, 0, 0, EVP_enc_null },
{ "3des-cbc", SSH_CIPHER_SSH2, 8, 24, 0, 0, 0, 1, EVP_des_ede3_cbc },
# ifndef OPENSSL_NO_BF
{ "blowfish-cbc",

To use this cipher by default, just put “Ciphers none” in your sshd_config. Of course, the client doesn’t support it out of the box either, so if you make any test connections, you have to have to use the ssh binary compiled with the patched cipher.c above as well.

You may have to pass pass -o Ciphers=none from the client as well if it prefers to use a different cipher by default. Use strace or wireshark to verify that communication beyond the initial protocol setup happens in plaintext.

Making it fast

afl-clang-fast/LLVM “deferred forkserver mode”

I mentioned above that using afl-clang-fast (i.e. AFL’s LLVM deferred forkserver mode) allows us to move the “fork point” to skip some of the sshd initialisation steps which are the same for every single testcase we can throw at it.

To make a long story short, we need to put a call to __AFL_INIT() at the right spot in the program, separating the stuff that doesn’t depend on a specific input to happen before it and the testcase-specific handling to happen after it. I’ve used this patch:

diff --git a/sshd.c b/sshd.c
--- a/sshd.c
+++ b/sshd.c
@@ -1840,6 +1840,8 @@ main(int ac, char **av)
/* ignore SIGPIPE */

+ __AFL_INIT();
/* Get a connection, either from inetd or a listening TCP socket */
if (inetd_flag) {
server_accept_inetd(&sock_in, &sock_out);

AFL should be able to automatically detect that you no longer wish to start the program from the top of main() every time. However, with only the patch above, I got this scary-looking error message:

Hmm, looks like the target binary terminated before we could complete a
handshake with the injected code. Perhaps there is a horrible bug in the
fuzzer. Poke <> for troubleshooting tips.

So there is obviously some AFL magic code here to make the fuzzer and the fuzzed program communicate. After poking around in afl-fuzz.c, I found FORKSRV_FD, which is a file descriptor pointing to a pipe used for this purpose. The value is 198 (and the other end of the pipe is 199).

To try to figure out what was going wrong, I ran afl-fuzz under strace, and it showed that file descriptors 198 and 199 were getting closed by sshd. With some more digging, I found the call to closefrom(), which is a function that closes all inherited (and presumed unused) file descriptors starting at a given number. Again, the reason for this code to exist in the first place is probably in order to reduce the attack surface in case an attacker is able to gain control the process. Anyway, the solution is to protect these special file descriptors using a patch like this:

diff --git openbsd-compat/bsd-closefrom.c openbsd-compat/bsd-closefrom.c
--- openbsd-compat/bsd-closefrom.c
+++ openbsd-compat/bsd-closefrom.c
@@ -81,7 +81,7 @@ closefrom(int lowfd)
while ((dent = readdir(dirp)) != NULL) {
fd = strtol(dent->d_name, &endp, 10);
if (dent->d_name != endp && *endp == '\0' &&
- fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp))
+ fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp) && fd != 198 && fd != 199)
(void) close((int) fd);
(void) closedir(dirp);

Skipping expensive DH/curve and key derivation operations

At this point, I still wasn’t happy with the execution speed: Some testcases were as low as 10 execs/second, which is really slow.

I tried compiling sshd with -pg (for gprof) to try to figure out where the time was going, but there are many obstacles to getting this to work properly: First of all, sshd exits using _exit(255) through its cleanup_exit() function. This is not a “normal” exit and so the gmon.out file (containing the profile data) is not written out at all. Applying a source patch to fix that, sshd will give you a “Permission denied” error as it tries to open the file for writing. The problem now is that sshd does a chdir("/"), meaning that it’s trying to write the profile data in a directory where it doesn’t have access. The solution is again simple, just add another chdir() to a writable location before calling exit(). Even with this in place, the profile came out completely empty for me. Maybe it’s another one of those privilege separation things. In any case, I decided to just use valgrind and its “cachegrind” tool to obtain the profile. It’s much easier and gives me the data I need without hassles of reconfiguring, patching, and recompiling.

The profile showed one very specific hot spot, coming from two different locations: elliptic curve point multiplication.

I don’t really know too much about elliptic curve cryptography, but apparently it’s pretty expensive to calculate. However, we don’t really need to deal with it; we can assume that the key exchange between the server and the client succeeds. Similar to how we increased coverage above by skipping message CRC checks and replacing the encryption with a dummy cipher, we can simply skip the expensive operations and assume they always succeed. This is a trade-off; we are no longer fuzzing all the verification steps, but allows the fuzzer to concentrate more on the protocol parsing itself. I applied this patch:

diff --git kexc25519.c kexc25519.c
--- kexc25519.c
+++ kexc25519.c
@@ -68,10 +68,13 @@ kexc25519_shared_key(const u_char key[CURVE25519_SIZE],

/* Check for all-zero public key */
explicit_bzero(shared_key, CURVE25519_SIZE);
+#if 0
if (timingsafe_bcmp(pub, shared_key, CURVE25519_SIZE) == 0)

crypto_scalarmult_curve25519(shared_key, key, pub);
dump_digest("shared secret", shared_key, CURVE25519_SIZE);
diff --git kexc25519s.c kexc25519s.c
--- kexc25519s.c
+++ kexc25519s.c
@@ -67,7 +67,12 @@ input_kex_c25519_init(int type, u_int32_t seq, void *ctxt)
int r;

/* generate private key */
+#if 0
kexc25519_keygen(server_key, server_pubkey);
+ explicit_bzero(server_key, sizeof(server_key));
+ explicit_bzero(server_pubkey, sizeof(server_pubkey));
dump_digest("server private key:", server_key, sizeof(server_key));

With this patch in place, execs/second went to ~2,000 per core, which is a much better speed to be fuzzing at.

(EDIT 2017-03-25: As it turns out, this patch is not very good, because it causes a later key validity check to fail (dh_pub_is_valid() in input_kex_dh_init()). We could perhaps make dh_pub_is_valid() always return true, but then there is a question of whether this in turn makes something else fail down the line.)

Creating the first input testcases

Before we can start fuzzing for real, we have to create the first few input testcases. Actually, a single one is enough to get started, but if you know how to create different ones taking different code paths in the server, that may help jumpstart the fuzzing process. A few possibilities I can think of:

The way I created the first testcase was to record the traffic from the client to the server using strace. Start the server without -i:

./sshd -d -e -p 2200 -r -f sshd_config
Server listening on :: port 2200.

Then start a client (using the ssh binary you’ve just compiled) under strace:

$ strace -e trace=write -o strace.log -f -s 8192 ./ssh -c none -p 2200 localhost

This should hopefully log you in (if not, you may have to fiddle with users, keys, and passwords until you succeed in logging in to the server you just started).

The first few lines of the strace log should read something like this:

2945  write(3, "SSH-2.0-OpenSSH_7.4\r\n", 21) = 21
2945 write(3, "\0\0\4|\5\24\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0010curve25519-sha256,,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c\0\0\1\",,,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,,,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa\0\0\0\4none\0\0\0\4none\0\0\0\,,,,,,,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\,,,,,,,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\32none,,zlib\0\0\0\32none,,zlib\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1152) = 1152

We see here that the client is communicating over file descriptor 3. You will have to delete all the writes happening on other file descriptors. Then take the strings and paste them into a Python script, something like:

import sys
for x in [

When you run this, it will print a byte-perfect copy of everything that the client sent to stdout. Just redirect this to a file. That file will be your first input testcase.

You can do a test run (without AFL) by passing the same data to the server again (this time using -i):

./sshd -d -e -p 2200 -r -f sshd_config -i < testcase 2>&1 > /dev/null

Hopefully it will show that your testcase replay was able to log in successfully.

Before starting the fuzzer you can also double check that the instrumentation works as expected using afl-analyze:

~/afl-2.39b/afl-analyze -i testcase -- ./sshd -d -e -p 2200 -r -f sshd_config -i

This may take a few seconds to run, but should eventually show you a map of the file and what it thinks each byte means. If there is too much red, that’s an indication you were not able to disable checksumming/encryption properly (maybe you have to make clean and rebuild?). You may also see other errors, including that AFL didn’t detect any instrumentation (did you compile sshd with afl-clang-fast?). This is general AFL troubleshooting territory, so I’d recommend checking out the AFL documentation.

Creating an OpenSSH dictionary

I created an AFL “dictionary” for OpenSSH, which is basically just a list of strings with special meaning to the program being fuzzed. I just used a few of the strings found by running ssh -Q cipher, etc. to allow the fuzzer to use these strings without having to discover them all at once (which is pretty unlikely to happen by chance).


Just save it as openssh.dict; to use it, we will pass the filename to the -x option of afl-fuzz.

Running AFL

Whew, it’s finally time to start the fuzzing!

First, create two directories, input and output. Place your initial testcase in the input directory. Then, for the output directory, we’re going to use a little hack that I’ve found to speed up the fuzzing process and keep AFL from hitting the disk all the time: mount a tmpfs RAM-disk on output with:

sudo mount -t tmpfs none output/

Of course, if you shut down (or crash) your machine without copying the data out of this directory, it will be gone, so you should make a backup of it every once in a while. I personally just use a bash one-liner that just tars it up to the real on-disk filesystem every few hours.

To start a single fuzzer, you can use something like:

~/afl-2.39b/afl-fuzz -x sshd.dict -i input -o output -M 0 -- ./sshd -d -e -p 2100 -r -f sshd_config -i

Again, see the AFL docs on how to do parallel fuzzing. I have a simple bash script that just launches a bunch of the line above (with different values to the -M or -S option) in different screen windows.

Hopefully you should see something like this:

                         american fuzzy lop 2.39b (31)

┌─ process timing ─────────────────────────────────────┬─ overall results ─────┐
│ run time : 0 days, 13 hrs, 22 min, 40 sec │ cycles done : 152 │
│ last new path : 0 days, 0 hrs, 14 min, 57 sec │ total paths : 1577 │
│ last uniq crash : none seen yet │ uniq crashes : 0 │
│ last uniq hang : none seen yet │ uniq hangs : 0 │
├─ cycle progress ────────────────────┬─ map coverage ─┴───────────────────────┤
│ now processing : 717* (45.47%) │ map density : 3.98% / 6.67% │
│ paths timed out : 0 (0.00%) │ count coverage : 3.80 bits/tuple │
├─ stage progress ────────────────────┼─ findings in depth ────────────────────┤
│ now trying : splice 4 │ favored paths : 117 (7.42%) │
│ stage execs : 74/128 (57.81%) │ new edges on : 178 (11.29%) │
│ total execs : 74.3M │ total crashes : 0 (0 unique) │
│ exec speed : 1888/sec │ total hangs : 0 (0 unique) │
├─ fuzzing strategy yields ───────────┴───────────────┬─ path geometry ────────┤
│ bit flips : n/a, n/a, n/a │ levels : 7 │
│ byte flips : n/a, n/a, n/a │ pending : 2 │
│ arithmetics : n/a, n/a, n/a │ pend fav : 0 │
│ known ints : n/a, n/a, n/a │ own finds : 59 │
│ dictionary : n/a, n/a, n/a │ imported : 245 │
│ havoc : 39/25.3M, 20/47.2M │ stability : 97.55% │
│ trim : 2.81%/1.84M, n/a ├────────────────────────┘
└─────────────────────────────────────────────────────┘ [cpu015: 62%]

Crashes found

In about a day of fuzzing (even before disabling encryption), I found a couple of NULL pointer dereferences during key exchange. Fortunately, these crashes are not harmful in practice because of OpenSSH’s privilege separation code, so at most we’re crashing an unprivileged child process and leaving a scary segfault message in the system log. The fix made it in CVS here:


Apart from the two harmless NULL pointer dereferences I found, I haven’t been able to find anything else yet, which seems to indicate that OpenSSH is fairly robust (which is good!).

I hope some of the techniques and patches I used here will help more people get into fuzzing OpenSSH.

Other things to do from here include doing some fuzzing rounds using ASAN or running the corpus through valgrind, although it’s probably easier to do this once you already have a good sized corpus found without them, as both ASAN and valgrind have a performance penalty.

It could also be useful to look into ./configure options to configure the build more like a typical distro build; I haven’t done anything here except to get it to build in a minimal environment.

Please let me know in the comments if you have other ideas on how to expand coverage or make fuzzing OpenSSH faster!


I’d like to thank Oracle (my employer) for providing the hardware on which to run lots of AFL instances in parallel :-)

  1. Well, we can’t fix up signatures we don’t have the private key for, so in those cases we’ll just assume the attacker does have the private key. You can still do damage e.g. in an otherwise locked down environment; as an example, GitHub uses the SSH protocol to allow pushing to your repositories. These SSH accounts are heavily locked down, as you can’t run arbitrary commands on them. In this case, however, we do have have the secret key used to authenticate and sign messages.

March 26, 2017 10:07 AM

March 24, 2017

James Morris: Linux Security Summit 2017: CFP Announcement

LSS logo

The 2017 Linux Security Summit CFP (Call for Participation) is now open!

See the announcement here.

The summit this year will be held in Los Angeles, USA on 14-15 September. It will be co-located with the Open Source Summit (formerly LinuxCon), and the Linux Plumbers Conference. We’ll follow essentially the same format as the 2016 event (you can find the recap here).

The CFP closes on June 5th, 2017.

March 24, 2017 01:10 PM

March 21, 2017

Matthew Garrett: Announcing the Shim review process

Shim has been hugely successful, to the point of being used by the majority of significant Linux distributions and many other third party products (even, apparently, Solaris). The aim was to ensure that it would remain possible to install free operating systems on UEFI Secure Boot platforms while still allowing machine owners to replace their bootloaders and kernels, and it's achieved this goal.

However, a legitimate criticism has been that there's very little transparency in Microsoft's signing process. Some people have waited for significant periods of time before being receiving a response. A large part of this is simply that demand has been greater than expected, and Microsoft aren't in the best position to review code that they didn't write in the first place.

To that end, we're adopting a new model. A mailing list has been created at, and members of this list will review submissions and provide a recommendation to Microsoft on whether these should be signed or not. The current set of expectations around binaries to be signed documented here and the current process here - it is expected that this will evolve slightly as we get used to the process, and we'll provide a more formal set of documentation once things have settled down.

This is a new initiative and one that will probably take a little while to get working smoothly, but we hope it'll make it much easier to get signed releases of Shim out without compromising security in the process.

comment count unavailable comments

March 21, 2017 08:29 PM

Kernel Podcast: Linux Kernel Podcast for 2017/03/21


In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc3, this week’s exciting installment of “5-level paging weekly”, the 2038 doomsday compliance “statx” systemcall, and heterogenous memory management. Also a summary of all ongoing active kernel development toward 4.12 onwards.

Linus Torvalds announced Linux 4.11-rc3. In his announcement, Linus noted that “rc3 is larger than rc2, but this is hopefully the point where things start to shrink and calm down. We had a late typo in rc2 that affected arm and powerpc (the prep code for the 5-level page tables [on x86 systems]), and hopefully there are no similar brown-paper-bugs in rc3.”


Kent Overstreet announced the latest developments in Bcachefs, in a post entitled “Bcachefs – encryption, fsck, and more”. One of the key new features is that “We now have whole filesystem encryption – and this is modern authenticated encryption”. He notes that they can’t currently encrypt only part of the filesystem (as is the case, for example, with ext4 – as used on Android devices, and of course with Apple’s multi-layered iOS filesystem implementation) but “it’s more of a better dm-crypt” in removing the layers between the filesystem and the underlying hardware. He also notes that there’s a “New inode format”, and many other changes. Further details at:

Hongbo Wang (Intel) announced the 2016-Q4 release of XenGT and 2016-Q4 release of KVMGT. These are both “full GPU virtualization solution[s] with mediated pass-through”…of the hardware graphics resources into guest virtual machines. Further information is available from Intel’s github: (igvtg-xen for the Xen tree, and igvtg-kernel, and igvtg-qemu for the pieces needed for KVM support)

Julia Cartwright announced the Linux preempt-rt (Real Time) kernel version 4.1.39-rt47 stable kernel release.

Junio C Hamano announced Git v2.12.1. In his announcement, he noted that the tarballs “are NOT YET found at” the typical URL since “I am having trouble reaching there”. It’s unclear if this is due to recent changes in the architecture of and its mirroring, or a local issue.

Intel 5-level paging

In this week’s episode of “merging Intel 5-level paging support” the fun but unexpected plot twist resulting in a “will it merge or not” cliffhanger comes from Linus. Kirill A. Shutemov (Intel) has been diligently posting this series for some time, and if you recall from last week’s episode, the foundational pieces needed to land this in 4.12 were merged after the closure of the 4.11 merge window following a special request from Linus. Kirill has since posted “x86: 5-level paging enabling for v4.12, Part 1”. In response to a comment from Kirill that “Let’s see if I’m on the right track addressing Ingo’s [Molnar’s] feedback”, Linus stated, “Considering the bug we just had with the HAVE_GENERIC_RCU_GUP code, I’m wondering if people would be willing to look at what it would take to make x86 use the generic version?”, and “The x86 version of __get_user_pages_fast() seems to be quite similar to the generic one. And it would be lovely if all the main architectures shared the same core gup code”.

The Linux kernel implements a set of code functions for pinning of usermode (userspace) pages (the smallest granule size upon which contemporary hardware operates via a Memory Management Unit under the control of software provided and (co-)maintained “page tables”, and the size tracked by the Operating System in its page table management code) whenever they must be shared between userspace (which has dynamically pageable memory that can come and go as the kernel needs to free up RAM temporarily for other tasks by “paging” those pages out to “swap”) and code running within a kernel driver (the Linux kernel does not have pageable memory). GUP (get_user_pages) handles this operation, which takes a set of pointers to the individual pages that should be present and marked as in use. It has a variant usually referred to as “fast GUP” which aims to perform this operation without taking an expensive lock in the corresponding userspace processes’ “mm” struct (an object that forms part of a task’s – the in-kernel term for a process – metadata, and linked from the corresponding task_struct). Fast GUP doesn’t always work, but when it doesn’t need to fallback to an expensive slow path, it can save considerable time. So Linus was expressing a desire for x86 to share the same generic code as used by other architectures for this operation.

Linus further added three “subtle issues” that he saw with switching over x86 to the generic GUP code:

“(a) we need to make sure that x86 actually matches the required semantics for the generic GUP.

(b) we need to make sure the atomicity of the page table reads is ok.

(c) need to verify the maximum VM address properly”

He said “I _think_ (a) is ok”. But he wanted to see “real work to make sure” that (b) is “ok on 32-bit PAE”. PAE means Physical Address Extension, a mechanism used on certain 32-bit Intel x86 systems to address greater than a 32-bit physical address space by leveraging the fact that many individual applications don’t need larger than a 32-bit address space but that an overall system might in aggregate use multiple such 32-bit applications. It was a hack that bought time before the widespread adoption of the 64-bit architecture, and one that others (such as ARM) have implemented in a similar sense of end purpose in “LPAE” and friends as well. PAE moved the x86 architecture from 32-bit PTE (Page Table Entries) to 64-bit hardware entries, which means that on 32-bit systems there are real concerns around atomicity of updates to these structures without very careful handling. And as this author can attest, you don’t want to have to debug that situation.

This discussion lead Kirill to point out that there were some obvious looking bugs in the existing x86 GUP code that needed fixing for PAE anyway. The thread is ongoing, and Kirill is certain to be enjoying this week’s episode of “so you thought you were only adding 5-level paging?”. Michal Hocko noted that he had pulled the current version of the 5-level paging patch series into the mmotm (mm of the moment) VM (Virtual Memory) subsystem development tree as co-maintained with Andrew Morton and others.

Borislav Petkov posted “x86/mce: Handle broadcasted MCE gracefully with kexec” which (as we covered previously) seeks to handle the unfortunate case of an MCE (Machine Check Exception) on Intel x86 systems arriving during the process of handoff from the crash kernel into “pergatory” prior to the new kernel beginning. At this phase, the old kernel’s MCE handler is running and will never complete a synchronization with other cores in the system that are waiting in a holding spinloop (probably MWAIT one would assume) for the new kernel to take over.


Various subsystems gained support for the new “statx” system call, which is part of the ongoing “Year 2038” doomsday avoidance work to prevent a Y2K style disaster when 32-bit Unix time wraps in 2038 (this being an actual potential “disaster” in the making, unlike the much hyped Y2K nonsense). Many of us have aspiriations to be retired and living on boats by then, but this is neither assured, nor a prudent means to guarantee we won’t have to deal with this later (but presumably with at least some kind of lucrative consulting contract to bring us out of our early or late retirements).

The “statx” call adds 64-bit timestamps and replaces “stat”. It also does a lot more than just “make large” (David Howell’s words) the various fields in the previous stat structutures. The overall system call was covered much more generally by Linux Weekly News (which you should support as a purveyor of actual in-depth journalism on such topics) as recently as last week. Stafford Horne posted one example of the patches we refer to here, for the “asm-generic” reference includes used by emerging architectures, such as the OpenRISC architecture that he is maintaining. Another statx patch came from David Howells, for the ext4 filesytem, which lead to a longer discussion of how to implement various underlying flag changes required to ext4.

Eric Biggers noted that David used the ext4_get_inode_flags function “to sync the generic inode flags (inode->i_flags) to the ext4-specific inode flags (ei->i_flags)” bu that a problem can exist when doing this without holding an underlying lock due to “flag syncs…in both directions concurrently” which could “cause an update to be lost”. He walked an example of how this could occur, and then suggested that for ->getattr() it might be easier to skip the call to the offending function and “instead populating the generic attributes like STATX_ATTR_APPEND and STATX_ATTR_IMMUTABLE from the generic inode flags, rather than from the ext4-specific flags?”. Andreas Dilger suggested the other way around, pulling the flags directly from the ext4 flags rather than the generic ones. He also raised the eneral question of “when/where are the VFS inode flags changed that they need to be propagated into the ext4 disk inode?”.

Jan Kara replied that “you seem to be right. And actually I have checked and XFS does not bother to copy inode->i_flags to its on-disk flags so it seems generally we are not expected to reflect inode->i_flags in on-disk state”. Jan suggested to Andreas that it might be “better…to have ext4_quota_on() and ext4_quota_off() just update the flags as needed and avoid doing it anywhere else…I’ll have a look into it”.

Heterogeneous Memory Management

Jérôme Glisse posted version 18 of his patch series entitled “HMM (Heterogenous Memory Management)” which aims to serve two generic use cases: “First it allows to use device memory transparently inside any process without modifications to process program code. Second it allows to mirror process address space on a device”. His intro described these summaries as a “Cliff node” (a brand of examination-time study materials often used by students for preparation), which lead to an objection from Andrew Morton that “Cliff’s notes” “isn’t appropriate for a large feature such as this. Where’s the long-form description? One which permits readers to fully understand the requirements, design, alternative designs, the implementation, the interface(s), etc?”. He also asked for clarifcation of which was meant by “device memory” since “That’s very vague. What are the characteristics of this memory? Why is it a requirement that userspace code be unaltered? What are the security implications – does the process need particular permissions to access this memory? What is the proposed interface to set up this access?”

In a followup, Jérôme noted that he had previously given a longer form summary, which he attached, in the earlier revisions of the now version 18 patch series. In his summary, he makes clear his intent is to ease the overall management and programming of hybrid systems involving GPUs and other accelerators by introducing “a new kind of ZONE_DEVICE memory that does allow to allocate a struct page for each page of the device memory. Those page are special because the CPU can not map them. They however allow to migrate main memory to device memory using ex[]isting migration mechanism[s] and everything looks like it page was swap[ped] out to disk from CPU point of view. Using a struct page gives the easiest and cleanest integration with existing mm mechanisms”. He notes that he isn’t trying to solve other problems, and in fact one could summarize HMM using the buzzword du jour: “mediated”.

In an HMM world, devices and host-side application software can share what appears to them as a “unified” memory map. One in which pointer addresses from within an application can be deferenced by code running on a GPU, and vice versa, through cunning use of page tables and a new underlying system framework for the device drivers touching the hardware. It’s not magic, but it does help to treat device memory “like regular memory” and accommodates “Advance in high level language construct (in C++ but others too) gives opportunities to compiler to leverage GPU transparently without programmer knowledge. But for this to happen we need a share[d] address space”.

This means that, if a host application (processor side of the equation) performs an access to part of a process (known as a “task” within the kernel) address space that is currently under control of a device, then the associated page fault will trigger generic framework code to handle handoff of that page back to the host CPU side. On the flip side, the framework still requires device drivers to use a new framework to manage their access to memory since few devices have generic page fault mechanisms today that can be leveraged to make this more transparent, and a lot of other device specific gunk is needed. It’s not a perfect solution, but it does arguably advance the state of the art, and is useful. Jérôme also states that “I do not wish to compete for the patchset with the highest revision count and i would like a clear cut position on w[h]ether it can be merge[d] or not. If not i would like to know why because i am more than willing to address any issues people might have. I just don’t want to keep submitting it over and over until i end up in hell…So please consider applying for 4.12”.

This author’s own personal opinion is that, while HMM is certainly useful, many such shared device/host memory situations can be greatly simplified by introducing coherent shared virtual memory between device and host. That model allows for direct address space sharing without some of the heavy lifting required in this patch set. Yet, as is noted in the posting, few devices today have such features (and there is no reason to presume that all future devices suddenly will implement shared virtual memory, not that every device will want to expand the energy required to maintain coherent memory for communication). So the HMM patches provide a means of tracking who owns memory shared between device and “host”, and they exploit split device and “host” system page tables as well as associated faults to ensure pages are handed off as cleanly as can be achieved with technology available in the market today.

Ongoing Development

Michal Hocko posted a patch entitled “rework memory hotplug onlining”, which seeks to rework the semantics for memory hotplug since the current implementation is “awkward and hard/impossible to use from the udev to online memory as movable. The main problem is that only the last memblock or the adjacent to highest movable memblock can be onlined as movable”. He posted a number of examples showing how things fall down today, as well as a patch (“just for x86 now but I will address other arches once there is an agreement this is the right approach”) removing “all the zone specific operations from __add_pages (aka arch_add_memory) path. Instead we do page->zone association from move_pfn_range which is called from online_pages. This criterion for movable/normal zone association is really simple now. We just have to guarantee that zone Normal is always lower than zone Movable”. This lead to a lengthy discussion around the ideal longer term approach and is likely to be a topic at the LSF/MM conference this week (one assumes?). [ It’s happening down the street from me…I’ll smile and wave at you 😉 ]

Gustavo Padovan posted “V4L2 explicit synchronization support”, an RFC (Request For Comments) that “adds support for Explicit Synchronization of shared buffers in V4L2” (Video For Linux 2, the general purpose video framework API used on Linux machines for certain multimedia purposes). This new RFC leverages the “Sync File Framework” as a means to “communicate the fences between kernel and userspace”. In English, what this means is that it’s often necessary to communicate using shared buffers between userspace, kernel, and hardware. And some (most) hardware might not guarantee that these buffers are fully coherent (observed identically between multiple concurrently operating agents that are manipulating it). The use of “fences” (barriers) enables explicit communication of certain points in time during which the state of a buffer is consistent and ready for access to be handed off between different parts of the system. The RFC is quite interesting and has a lot more detail, including the observation that it is intended to be a PoC (Proof of Concept) to get the conversation moving more than the eventual end result of that conversation that might actually be merged.

Wei Wang (Intel) posted a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration. Balloons aren’t just helium filled goodies that all of us love to play with from a young age. Well, they are that, but, they’re also a concept applied to the memory management of virtual machines, which “inflate” the amount of memory available to them by requesting more from a hypervisor during their lifetime (that they might also return). In Linux, the same concept is applied to the migration of virtual machines, which can use the virtio-balloon abstraction over the virtio bus (a hypervisor communications channel) to transfer “guest unused pages to the host so that they can be skipped to migrate in live migration”. One of the patches in his version 3 series (patch number 3 of 4), entitled “mm: add in[t]erface to offer info about unused pages” had some detailed discussion with Michael S. Tsirkin commenting on better documentation and Andrew Morton suggesting that it might be better for the code to live in the virtio-balloon driver rather than being made too generic as its use case is very targeted.

Elena Reshetova continued her work toward conversion of Linux kernel subsystems to her newer “refcount” explicit reference counting API with a posting entitled “net subsystem refcount conversions”.

Suzuki K Poulose posted a bunch of patches implementing support for detection and reporting of new ARMv8.3 architecture features, including one patch that was entitled “arm64: v8.3: Support for Javascript conversion instruction” (which really means a new double precision float to integer conversion instruction that will likely be used by high performance JavaScript JITs…). He also posted “arm64: v8.3: Support for weaker release consistency”. The new revision of the architecture adds new instructions to “support Release Consistent processor consistent (RCpc) model, which is weaker than the RCsc [Release Consistent sequential consistency] model”. Listeners are encouraged to read the C++ memory model and other fascinating bedtime literature for much more detail on the available RC options.

Markus Mayer (Broadcom) posted “Basic divider clock”, an RFC which aims to provide a generic means of expressing clock dividers that can be leveraged in an embedded system’s “DeviceTree”, for which he also posted bindings (descriptions to be used in creating these textual description “trees”). Stephen Boyd pushed back that the community had so far avoided generic implementations but instead preferred to keep things at the level of having drivers that target certain hardware IP from certain vendors based upon the compatible matching strings.

Michael S. Tsirkin posted “kvm: better MWAIT emulation for guests”. We have previously explained this patchset and the dynamics of MWAIT implementations. His goal for this patch is to handle guests that assume the presence of the (x86) MWAIT feature, which isn’t present on all x86 CPUs. If you were running (for example) MacOS inside a VM on an 86 machine, it would generally assume the presence of MWAIT without checking for it, because it’s present in all x86-based Apple Macs. Emulating MWAIT is useful in such situations.

Romain Perier posted “Replace PCI pool by DMA pool API”. As he notes in his posting, “The current PCI pool API are simple macro functions direct expanded to the appropriate dma pool functions. The prototypes are almost the same and semantically, they are very similar. I propose to use the DMA pool API directly and get rid of the old API”.

Daeseok Youn posted “staging: atomisp: use k{v}zalloc instead of k{v}alloc and memset”. Alan Cox replied “…please don’t apply this. There are about five other layers of indirection for memory allocators that want removing first so that the driver just uses the correct kmalloc/kzalloc/kv* functions in the right places”. Now does seem like a good time not to add more layers.

Peter Zijlstra posted various “x86 optimizations” that aimed to “shrink the kernel and generate better code”.

March 21, 2017 04:22 PM

March 20, 2017

Dave Airlie: how close to conformant is radv?

I spent some time staring into the results of the VK-GL-CTS test suite on radv, which contains the Vulkan 1.0 conformance tests.

In order to be conformant you have to pass all the tests on the mustpass list for the Vulkan version you want to conform to, from the branch of the test suite for that version.

The latest CTS tests for 1.0 is the vulkan-cts-1.0.2 branch, and the mustpass list is in external/vulkancts/mustpass/1.0.2/vk-default.txt

Using some WIP radv patches in my github radv-wip-conform branch and the 1.0.2 test suite, today's results are on my Tonga GPU:

Test run totals:
Passed: 82551/150950 (54.7%)
Failed: 0/150950 (0.0%)
Not supported: 68397/150950 (45.3%)
Warnings: 2/150950 (0.0%)

That is pretty conformant (in fact it would pass as-is). However I need to clean up the patches in the branch and maybe figure out how to do some bits properly without hacks (particularly some semaphore wait tweaks), but that is most of the work done.

Thanks again to Bas and all other radv contributors.

March 20, 2017 07:26 AM

March 19, 2017

Pete Zaitcev: Standards for ARM computers in 2017

I wrote this 7 years ago, in 2009:

Until ARM comes up with a full computer instead of just a CPU, it's no contender in Linux server space.

So, how are things nowadays?

The final question had to do with cross-platform drivers. There is an interpreted executable format known as EFI Byte Code (EBC); drivers compiled to that format can run on multiple architectures. [...]

Graf asked whether drivers could, instead, be shipped as a multiple-architecture binary. Progress is being made in this direction, and EFI supports multiple binary formats.


P.S. Jon reminded me about SBSA, and indeed it's a solid advancement. But using EFI drivers is an idea so monstrously dumb that I don't even know what to say.

March 19, 2017 04:34 PM

Pete Zaitcev: Standards for ARM computers and Linaro

A year ago I posted the following comment at LWN in the context of excitement about Yet Another Little Linux Server:

If you buy an Atom or Geode barebones, it's guaranteed to boot a normal Fedora or Ubuntu. If you buy ARM, you're a hostage of vendor support. If that fails (which is the default), you're all alone in the bizarre maze of incompatible bootloaders and out-of-tree patches which quickly become obsolete. Until ARM comes up with a full computer instead of just a CPU, it's no contender in Linux server space. Instead, the vicious cycle of "make a product, patch something to boot on it, ship it, forget about it immediately" will continue forever, with publications hyping up the next wonderful widget and the platform going nowhere.

It looks like someone else figured it out, ergo Linaro. Unfortunately, they do not seem to be eager to create a real platform, but rather slap a veneer of something OpenFirmware-like on top of exising systems. Also, they are buddying with Ubuntu. So, a half-hearted effort and a top-down deal. But it's a step in the right direction.

March 19, 2017 04:27 PM

March 17, 2017

Andi Kleen: Intel Processor Trace resources

Intel Processor Trace (PT) can be used on modern Intel CPUs to trace execution. This page contains references for learning about and using Intel PT.

Basic information:


JTAG support

Other presentations


Research papers using PT (subset):


March 17, 2017 04:46 PM

March 15, 2017

Michael Kerrisk (manpages): man-pages-4.10 is released

I've released man-pages-4.10. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from over 40 contributors. This release sees a large number of changes: over 600 commits changing around 160 pages. The changes include the addition of 11 pages, significant rewrites of 3 other pages, and enhancements to many other pages.

Among the more significant changes in man-pages-4.10 are the following:

March 15, 2017 05:09 AM

March 14, 2017

Kernel Podcast: Kernel Podcast for March 13th, 2017


In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc2 (including pre-enablement for Intel 5-level paging), VMA based swap readahead, and ongoing development ahead of the next cycle.

Linus Torvalds announced Linux 4.11-rc2. In his announcement, he said that the past week had been “fairly quiet” because “people are still looking for bugs and taking a breather after the merge window”. But he also noted that “we’ve got a healthy number of fixes in, and there’ssome cleanup/prep patches for the upcoming 5-level page table support that I took after the merge window just to make the next merge window easier”.

Various fixes and updates have been posted against the previous rc1, over the past week, including an urgent fix from Matthew (Willy) Wilcox for his idr rewrite in 4.11 (freeing the correct IDA bitmap).

Geert Uytterhoeven posted “Build regressions/improvements in v4.11-rc1”. This compared build error/warning regressions and improvements between v4.11-rc1 and v4.10. According to Geert, the 4.11-rc1 kernel saw an increase of 19 build errors and 1108 warnings when compared to 4.10.


Jiri Slaby announced Linux 3.12.71, Greg Kroah Hartman (KH) announced 4.4.53, 4.9.14, and 4.10.2 (which started a conversation about git tags being stale that we will address in a moment). Greg took the opportunity of various stable kernel work to prod the i915 graphics driver team with a message entitled “The i915 stable patch marking is totally broken”.

Sebastian Andrzej Siewior announced the v4.9.13-rt12 preempt-rt “Real Time” kernel patch set, which has a known issue that “CPU hotplug got a little better but can deadlock”, suggesting you might not want to try that then.

Julia Cartwright announced 4.1.38-rt46.

Steven Rostedt announced the 3.18.48-rt53 stable release of the RT kernel. He also announced the 3.10.105-rt119 and 3.2.86-rt124 releases.

Jair Ruusu announced “loop-AES-v3.7k file/swap crypto package”, which is available on sourceforge at:

Andy Lutomirski sent out detailed notes (along with a followup with yet more explanation) of the Intel SGX (“Secure Enclave”) feature discussion that occured at Kernel Summit and Linux Plumbers Conference last fall. The thread is called “SGX notes from KS/LPC”. In the thread, he explains what SGX is (a small region of virtual memory within a Linux process – known as a task inside the kernel – that is not visible to the host OS after the enclave is “launched”) and how it can be used to hide certain data from system administrators or providers – for example, cryptographic keys that one would rather were not compromised. SGX comes with a litany of new requirements upon the Operating System that Andy covers, along with some guidelines for how to expose this feature, and how to make it useable. are now sponsoring the project to the tune of various geo-diverse bare metal frontend systems in datacenters around the globe. Each of these (powerful) frontends provides read-only public access to git repositories and the public website ( and More information, including machine specifications can be found here:

(this came to light because of a brief outage affecting the Newark, NJ mirror which was lagging behind other mirrors in picking up new git tags pushed, but one hopes that an official announcement and thanks was otherwise forthcoming)

Masahiro Yamada has been added as a Kbuild (co-)maintainer.

Intel 5-level paging

Kirill A. Shutemov posted version 4 of his “5-level paging” patch series that implements support for the la57 (56 bit Virtual Address space for x64 Canonical Addressing) feature on some future CPUs. We covered the underlying patch series before, explaining the benefit of a larger (virtual) address space, as well as the additional compexities required to implement backward compatibility (including new prctls to limit the virtual address space of certain legacy applications), and the lack (so far) of boot time switching between 4-and-5-level support, which is seen as important for the distros.

Linus responded by saying that he thought “we should just aim for this being in 4.12” as he didn’t “see any real reason to delay merging it”. After some discussion about whose tree to merge it through, it was decided (by Thomas Gleixner) that it could come in through the “-tip” x86 tree. Which resulted in Linus pulling a preparatory “5-level paging: prepare generic code” patch series from Kirill into 4.11 (even after the merge window had closed) to lay the groundwork for pulling the main feature into the next (4.12) cycle. This promptly broke PowerPC, which was promptly fixed by a followup patch. Following the merge of enabling support in 4.11, Kirill posted “5-level paging enabling for v4.12” which aims to complete the merge next cycle.

The earlier version 4 iteration of the patch series noted that the Xen hypervisor currently doesn’t support 5-level paging and thus CONFIG_XEN is disabled automatically when building CONFIG_X86_5LEVEL. It was pointed out by the Andrew Cooper that runtime (boottime) switching between 4 and 5 level support would be required in order to provide a clean experience, especially until Xen Dom0 support is available. That boottime switching is on the existing todo and presumably is going to land at some point.

Separately, Dmitry Safonov posted version 6 of a patch series entitled “Fix compatible mmap() return pointer over 4Gb” which has “some minor conflicts with Kirill’s set for 5-table paging”. Dmitry aims to solve a slightly different problem than Kirill’s PR_{SET,GET}_MAX_VADDR calls (which limit the virtual address ranges returned by mmap to avoid legacy programs breaking when suddenly able to receive much larger “Canonical Addresses” – in Intel parlance – than they were compiled with built-in and broken assumptions about once upon a time) insomuch as he is focused on 32-bit legacy syscalls on 64-bit x64 not returning memory above 4GB that cannot be used by older 32-bit code.

VMA based swap readahead

Ying Huang (Intel) posted an RFC (Request For Comments) entitled “mm, swap: VMA based swap readahead” in which he discussed the current kernel paging implementation for Virtual Memory Areas (VMAs) as well as how it could be improved to facilitate greater awareness of the in-memory access patterns of associated data by changing the corresponding readahead algorithm.

“Readahead” as a concept is what it sounds like. Locality (both spacial, in this case, as well as temporal, in other cases) of data means that when a memory access occurs, it is usually more likely than not that an access to a nearby memory location will soon follow (except in the case of pure random access workloads). Thus, the kernel contains support for preloading nearby data when performing various disk and memory operations. Examples include readahead of nearby disk blocks when loading filesystem data, and loading nearby disk blocks when reading pages back in from swap.

VMAs (Virtual Memory Areas) are regions of memory managed by the Linux kernel. A running application (process), known as a “task” by the kernel, contains a large number of different VMAs which form its overall address space. You can see this by inspecting /proc/self/maps (replacing “self” with a process ID that you have access to). The output will show a series of memory regions representing various memory owned by the task. Memory that doesn’t represent files is known as “anonymous memory” and it is what is paged (swapped) out under memory pressure situations.

As Ying notes in his RFC, the “original swap readahead algorithm does readahead based on the consecutive blocks in [the] swap device” but “the consecutive blocks in [the] swap device just reflect the order of page reclaiming” and not necessarily “the access sequence in RAM”. His patch series aims to change this by teaching the readahead algorithm about VMAs and how to bias the readahead to sequentially walk through the address space of a task (process), reading those parts of the swap space containing this data rather than simply walking through swap sequentially.

But wait! There’s more! Ying also posted a separate patch series entitled “THP swap: Delay splitting THP during swapping out”, which does what it sounds like it would do. THP (Transparent Huge Pages) is a technology used by the Linux kernel to dynamically allocate “huge” (optionally very large – up to 1GB in size, but in this case 2MB) pages of memory to contiguous regions of virtual memory address space, especially those backing shared large memory data (even including a huge zero page used for virtual machine RAM at boot). THP reduces pressure on limited CPU internal microarchitectural caches known as TLBs (Translation Lookaside Buffers) – as well as uTLBs at a lower level than the TLBs – which cache the translation performed by page table entries to physical or intermediate memory addresses. Reducing the number of TLBs required to map regions of virtual memory reduces the number of times TLBs must be reused by the underlying architecture during memory access operations.

The existing Linux kernel THP code splits THPs back into smaller pages whenever they are swapped (paged) out to disk. Yet it turns out that this is particularly inefficient on contemporary systems in which secondary disk or NVMe storage has far greater bandwidth than a single high end core can saturate if forced to do this work. Ying’s patch instead delays this split and pushes entire THPs out to swap, allowing for larger writes and reads of contiguous memory out to the backing storage.

Ongoing Development

“David F” inquired about RAID mode support for Intel m.2 chipsets. These devices continue the recent-ish legacy of certain Intel storage devices providing dual modes of operation: as an AHCI device, and as a hardware RAID device operating in a propietary mode for which no Linux drivers exist. David was quite concerned that the lack of a Linux driver was becoming particular problematic on newer machines, which might not provide a means to switch into AHCI mode (supported by Linux). Christoph Hellwig was…unsympathetic…suggesting that the RAID mode “provides worse performance”, and that its implementation was questionable. He also had a series of other suggestions for what to do with these devices – those are less family friendly to repeat in this podcast.

Michal Hocko posted “kvmalloc” which is a generic replacement for the many “open coded kmalloc with vmalloc fallback instances in the tree”. k-and-vmalloc are two different means by which kernel code allocates memory. The former is used to obtain small allocations (on the order of a few pages – the minimal granule size operated on by the virtual memory subsystem of Linux on contemporary processors) that are also linerally contiguous in physical memory. The latter is for larger allocations of strictly “virtual” memory – contiguous only when accessed using the underlying Memory Mangement Unit to perform a translation (this is usually automatic for kernel code, since the kernel runs with virtual memory of its own, just like user processes do, but it can be problematic if a driver would like to use this memory for certain hardware operations, such as DMA transfers). The generic wrapper aims to clean up the common case that kernel code just wants a chunk of memory and will try to allocate it with kmalloc, but fallback to the more generic vmalloc if that fails.

Christian Konig (AMD) posted “PCI: add resizeable BAR infrastructure” (version 2, and later an update with some fixes in a version 3 also), which aims to add support to the kernel for a PCI SIG (Peripheral Component Interconnect Special Interest Group) ECN (Engineering Change Notice) that enables BARs (Base Address Registers) to be resized at runtime. PCI(e) BARs are mapping windows (aperatures) in the system memory map that are used to talk to hardware add-on cards (or built-in devices within modern platforms) by determining where the device’s memory will live. Traditionally, BARs were fixed size and so on architectures not relying upon firmware configuration of underlying BARs, Linux would have to determine where to place certain PCI(e) resources at boot/hotplug time by checking how much memory a device needed to expose and programming the BARs. With the new extension comes the possibility to increase the size of a BAR to map larger regions of memory. This is a useful feature for graphics cards, which may want to map very large regions of memory. A subsequent patch wires up the AMD GPU driver to use this.

Javi Merino posted “Documentation/EDID fixes”, which aims to correct some broken assumptions in the kernel documentation for EDID (Extended Display Identification Data – the data provided over e.g. I2C from a VGA monitor when the cable is connected). The examples didn’t build correctly due to existing assumptions. This author is probably one of few people who always thinks of EDID and the interaction with Xorg every time he plugs in an external projector to his laptop.

David Howells posted “net: Work around lockdep limitation in sockets that use sockets” in which he corrected an erroneous assumption in the kernel “lockdep” (lock dependency checker) that prevented it from correctly identifying bad call chains involving TCP sockets when there exists a dependency between sockets created purely in the kernel and sockets created purely in userspace (which the lockdep could not distinguish between due to its use of broad lock classes). The AFS (Andrew File System) was generating a false lockdep warning because it was exposing such an implied dependency.

Charles Keepax posted “genirq: Add support for nested shared IRQs” to address an audio CODEC that also acts as an interrupt controller. The details sounded rather painful. Yet it was “fairly easy” to fix.

Steven Rostedt posted “tracing: Allow function tracing to start earlier in boot up”, which does roughly what it says on the can, “moving tracing up further in the boot process”, “right after memory is initialized”. He noted that his RFC was a start and could be futher improved upon.

Matthew (Willy) Wilcox posted an RFC entitled “memset_l and memfill” that provides a generic means for architectures to provide optimized functions that “fill regions of memory with patterns larger than those contained in a single byte”. This is intended to be used by zram as well as other code.

Paul McKenney noticed some of his RCU torture tests failing during hotplug early in boot due to calls to smp_store_cpu_info during that operation. The call is not safe because it indirectly invokes schedule_work() which wants to use RCU prior to RCU being enabled as a side effect of dealing with an unstable TSC (Time Stamp Counter) on the afflicted CPU. Peter Zijlstra had an opinion on hotplug, and also a patch to handle this situation.

Vlad Zakharov posted “update timer frequencies”, which inquired about the best means to implement a cpufreq driver for ARC CPUs. These having a special property that “ARC timers (including those are used for timekeeping) are driven by the same clock as ARC CPU core(s)”. Yup, they change frequency according to the current CPU frequency. Which as Thomas Gleixner noted in response is “broken by design and you really should go and tell your hardware folks to fix that”. He added that “It’s well known for more than TWO decades that changing the frequency of the timekeeper clocksource is a complete disaster”.

Thomas Gleixner posted “kexec, x86/purgatory: Cleanup the unholy mess”, which aims to address his opinion that “the whole machinery is undocumented and lacks any form of forward declarations” (of variables which were previously global but had been made static). Purgatory is a special piece of code which is provided by the kernel but runs in the interim period between the kernel crashing (or beginning kexec) and the new crash or kexec kernel that is then subsequently loaded – this is what performs the load and exec.

March 14, 2017 06:36 PM

March 13, 2017

James Morris: LSM mailing list archive: this time for sure!

Following various unresolved issues with existing mail archives for the Linux Security Modules mailing list, I’ve set up a new archive here.

It’s a mailman mirror of the vger list.

March 13, 2017 10:20 PM

March 09, 2017

James Morris: Hardening the LSM API

The Linux Security Modules (LSM) API provides security hooks for all security-relevant access control operations within the kernel. It’s a pluggable API, allowing different security models to be configured during compilation, and selected at boot time. LSM has provided enough flexibility to implement several major access control schemes, including SELinux, AppArmor, and Smack.

A downside of this architecture, however, is that the security hooks throughout the kernel (there are hundreds of them) increase the kernel’s attack surface. An attacker with a pointer overwrite vulnerability may be able to overwrite an LSM security hook and redirect execution to other code. This could be as simple as bypassing an access control decision via existing kernel code, or redirecting flow to an arbitrary payload such as a rootkit.

Minimizing the inherent security risk of security features, is, I believe, an essential goal.

Recently, as part of the Kernel Self Protection Project, support for marking kernel pages as read-only after init (ro_after_init) was merged, based on grsecurity/pax code. (You can read more about this in Kees Cook’s blog here). In cases where kernel pages are not modified after the kernel is initialized, hardware RO page protections are set on those pages at the end of the kernel initialization process. This is currently supported on several architectures (including x86 and ARM), with more architectures in progress.

It turns out that the LSM hook operations make an ideal candidate for ro_after_init marking, as these hooks are populated during kernel initialization and then do not change (except in one case, explained below). I’ve implemented support for ro_after_init hardening for LSM hooks in the security-next tree, aiming to merge it to Linus for v4.11.

Note that there is one existing case where hooks need to be updated, for runtime SELinux disabling via the ‘disable’ selinuxfs node. Normally, to disable SELinux, you would use selinux=0 at the kernel command line. The runtime disable feature was requested by Fedora folk to handle platforms where the kernel command line is problematic. I’m not sure if this is still the case anywhere. I strongly suggest migrating away from runtime disablement, as configuring support for it in the kernel (via CONFIG_SECURITY_SELINUX_DISABLE) will cause the ro_after_init protection for LSM to be disabled. Use selinux=0 instead, if you need to disable SELinux.

It should be noted, of course, that an attacker with enough control over the kernel could directly change hardware page protections. We are not trying to mitigate that threat here — rather, the goal is to harden the security hooks against being used to gain that level of control.

March 09, 2017 10:52 AM

Rusty Russell: Quick Stats on zstandard (zstd) Performance

Was looking at using zstd for backup, and wanted to see the effect of different compression levels. I backed up my (built) bitcoin source, which is a decent representation of my home directory, but only weighs in 2.3GB. zstd -1 compressed it 71.3%, zstd -22 compressed it 78.6%, and here’s a graph showing runtime (on my laptop) and the resulting size:

zstandard compression (bitcoin source code, object files and binaries) times and sizes

For this corpus, sweet spots are 3 (the default), 6 (2.5x slower, 7% smaller), 14 (10x slower, 13% smaller) and 20 (46x slower, 22% smaller). Spreadsheet with results here.

March 09, 2017 12:53 AM

March 08, 2017

Matthew Garrett: The Internet of Microphones

So the CIA has tools to snoop on you via your TV and your Echo is testifying in a murder case and yet people are still buying connected devices with microphones in and why are they doing that the world is on fire surely this is terrible?

You're right that the world is terrible, but this isn't really a contributing factor to it. There's a few reasons why. The first is that there's really not any indication that the CIA and MI5 ever turned this into an actual deployable exploit. The development reports[1] describe a project that still didn't know what would happen to their exploit over firmware updates and a "fake off" mode that left a lit LED which wouldn't be there if the TV were actually off, so there's a potential for failed updates and people noticing that there's something wrong. It's certainly possible that development continued and it was turned into a polished and usable exploit, but it really just comes across as a bunch of nerds wanting to show off a neat demo.

But let's say it did get to the stage of being deployable - there's still not a great deal to worry about. No remote infection mechanism is described, so they'd need to do it locally. If someone is in a position to reflash your TV without you noticing, they're also in a position to, uh, just leave an internet connected microphone of their own. So how would they infect you remotely? TVs don't actually consume a huge amount of untrusted content from arbitrary sources[2], so that's much harder than it sounds and probably not worth it because:


Seriously your phone is like eleven billion times easier to infect than your TV is and you carry it everywhere. If the CIA want to spy on you, they'll do it via your phone. If you're paranoid enough to take the battery out of your phone before certain conversations, don't have those conversations in front of a TV with a microphone in it. But, uh, it's actually worse than that.

These days audio hardware usually consists of a very generic codec containing a bunch of digital→analogue converters, some analogue→digital converters and a bunch of io pins that can basically be wired up in arbitrary ways. Hardcoding the roles of these pins makes board layout more annoying and some people want more inputs than outputs and some people vice versa, so it's not uncommon for it to be possible to reconfigure an input as an output or vice versa. From software.

Anyone who's ever plugged a microphone into a speaker jack probably knows where I'm going with this. An attacker can "turn off" your TV, reconfigure the internal speaker output as an input and listen to you on your "microphoneless" TV. Have a nice day, and stop telling people that putting glue in their laptop microphone is any use unless you're telling them to disconnect the internal speakers as well.

If you're in a situation where you have to worry about an intelligence agency monitoring you, your TV is the least of your concerns - any device with speakers is just as bad. So what about Alexa? The summary here is, again, it's probably easier and more practical to just break your phone - it's probably near you whenever you're using an Echo anyway, and they also get to record you the rest of the time. The Echo platform is very restricted in terms of where it gets data[3], so it'd be incredibly hard to compromise without Amazon's cooperation. Amazon's not going to give their cooperation unless someone turns up with a warrant, and then we're back to you already being screwed enough that you should have got rid of all your electronics way earlier in this process. There are reasons to be worried about always listening devices, but intelligence agencies monitoring you shouldn't generally be one of them.

tl;dr: The CIA probably isn't listening to you through your TV, and if they are then you're almost certainly going to have a bad time anyway.

[1] Which I have obviously not read
[2] I look forward to the first person demonstrating code execution through malformed MPEG over terrestrial broadcast TV
[3] You'd need a vulnerability in its compressed audio codecs, and you'd need to convince the target to install a skill that played content from your servers

comment count unavailable comments

March 08, 2017 01:30 AM

March 06, 2017

Kernel Podcast: Kernel Podcast for March 6th, 2017


In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc1, rants about folks not correctly leveraging linux-next, the remainder of this cycle’s merge window pulls, and announcements concerning end of life for some features.

Linus Torvalds announced Linux 4.11-rc1, noting that “two weeks have passed, the merge window is over, and 4.11 has been tagged and pushed out.” He notes that the latest kernel cycle is set to be “on the smallish side”, but that is only in comparison with the most recent two cycles, which have been significantly larger than typical. He notes that 4.11 has a similar number of commits to 4.1, 4.3, 4.5, and 4.7 before it. With the release of 4.11-rc1 comes the closing of the “merge window” (defined by it, the period of time during which disruptive changes are allowed into the kernel prior to RC).

We covered most of the major pulls for 4.11 in last week’s podcast. But there were a few more stragglers. Here’s a sample of those:

J. Bruce Fields posted “nfsd changes for 4.11” which included two semantic changes: NFS security labels are “now off by default” and a “new security_label export flag reenables it per export” since this “only makes sense if all your clients and servers have similar enough selinux policies”. Secondly, NFSv4/UDP support is off because “It was never really supported, and the spec explicitly forbids it. We only ever left it on out of laziness; thanks to Jeff Layton for finally fixing that.”

Anna Schumaker followed up a little later with “Please pull NFS client changes for Linux 4.11”, which includes a memory leak in “_nfs4_open_and_get_state”, as well as various other fixes and new features.

Matthew (Willy) Wilcox posted “Please pull IDR rewrite” which seeks to harmonize the IDR (“Small id to pointer translation service avoding fixed sized tables”) and in-kernel radix tree code. Accoring to Willy, merging the two codebases “lets us share the memory alloction pools, and results in a net deletion of 500 lines of code. It also opens up the possibility of exposing more of the fetures of the radix tree to users of the IDR”.

Will Deacon posted “arm64 fixes for -rc1” of which the “main fix here addresses a kernel panic triggered on Qualcomm QDF2400 due to incorrect register usage in an erratum workaround introduced during the merge window”.

Michael S. Tsirkin posted “vhost: cleanups and fixes”, of which there were very few for this kernel cycle.

Nicholas A. Bellinger posted “target updates for v4.11-rc1”, which includes support for “dual mode (initiator + target) qla2xxx operation”, and a number of other fixes and improvements. He pre-warns that things are “shaping up to be a busy cycle for v4.12 with a new fabric driver (efct) in flight, and a number of other patches on the list being discussed”.

Rafael J. Wysocki posted “Additional ACPI update for v4.11-rc1”, which includes a fix for “an apparant, but actually artificial, resource conflict between the ACPI NVS memory region and the ACPI BERT (Boot Error Record Table)”.

Jens Axboe posted “Block fixes for 4.11-rc1”, which includes a “collection of fixes for this merge window, either fixes for existing issues, or parts that were waiting for acks to come in”. These include a performance fix for the allocation of nvme queues on the right node, along with others.

Miklos Szeredi posted “fuse update for 4.11” and “overlayfs update for 4.11”. the latter “allows concurrent copy up of regular files eliminating [the] potential problem” of (previously) serialized copy ups taking a long time.

Bjorn Helgaas posted “PCI fixes for v4.11”, including a couple of fixes for bugs introduced during code refactoring.

Dan Williams posted “libnvdimm fixes for 4.11-rc1”, which includes a fix for the generation of “nvdimm namespace label”s (metadata) checksums that “Linux was not calculating correcting leading to other environments rejecting the Linux label”.

Helge Deller posted “parisc updates for 4.11”, noting that there was “nothing really important” in this particular cycle to pull in.

James Bottomley posted “final round of SCSI updates for the 4.10+ merge window”, which “is the set of stuff that didn’t quite make the initial pull and a set of fixes for stuff which did”.

Radim Krcmar posted “Second batch of KVM changes for 4.11 merge window”, which includes a number of fixes for PPC and x86.

David Miller posted “Networking”, including many fixes.

A linux-next rant

In his 4.11-rc1 announcement, Linus noted that “it *does* feel like there was more stuff that I was asked to pull than was in linux-next. That always happens, but seems to have happened more now than usually. Comparing to the linux-next tree at the time of the 4.10 release, almost 18% of the non-merge commits were not in Linux-next. That seems higher than usual, although I guess Stephen Rothwell has actual numbers from past merges.” Let’s break what Linus said a little. Stephen Rothwell is an (overworked) kernel hacker based in Australia who produces a (daily, outside of the merge window) kernel tree (and accompanying test infrastructure, patch tracking, and announcement mechanisms) known as “linux-next”. Its raison d’etre is to be the proving ground for new features before they are sent to Linus for merging.

Typically, major new features soak in linux-next for a cycle prior to the one in which they are actually merged (so features landing in 4.11 would have been largely complete and tested via -next during 4.10). Linux kernel development cycles are generally on the order of about two months, so this isn’t an unreasonable long period of time for disruptive changes to languish. Contrast this with the multi-year wait that used to happen back when Linux had an odd/even minor version cycle in which even numbers (2.2, 2.4, 2.6) were the “supported” releases and the odd numbers (2.1, 2.3, 2.5) were development ones. That seems like ancient history now, but it’s really only in the past decade of git that kernel development tooling and community has reached a level of sophistication that the ship can keep moving while the engine is replaced.

Linus noted that there are a “few different classes” of changes that didn’t come to him following a previous test in linux-next. Those include fixes (which is “obviously ok and inevitable”), a specific example (statx) for a longstanding issue that has been ongoing for years (to which he said, “Yeah, I’ll allow this one too”), the “quite noticeable <linux/sched.h> split up series” which “had real reasons for late inclusion”. Finally, he includes the class of subsystems such as “drm, Infiniband, watchdog and btrfs”, which he “found rather annoying this merge window”. He reminded folks of the “linux-next sanity checks” and that if folks ingore them “you had better have your own sanity checks that you replaced them with” rather than “screw all the rules and processes we have in place to verify things”.

The bottom line? Linus says “You people know who you are. Next merge window I will not accept anything even remotely like that. Things that haven’t been in linux-next will be rejected, and since you’re already on my sh*t-list you’ll get shouted at again”. And nobody enjoys being shouted at by Linus. Well, almost nobody. There do seem to be a few people who perversely enjoy it.


A couple of questions of code maintenance arose this week. The first was from Natale Patriciello, who asked whether UML (User Mode Linux) is “not maintained anymore?” by citing a few bugs that haven’t been resolved in some time. There were no followups at the time of this recording. The second question came in form of an RFC (Request For Comments) patch entitled “remove support for AVR32 architecture” from Hans-Christian Noren Egtvedt. He noted that AVR32 is “not keeping up with the development of the kernel”, “shares so much of the drivers with Atmel ARM SoC”, and “all AVR32 AP7 SoC processors are end of lifed from Atmel (now Microchip)”. This did seem like a fairly compelling set of reasons to kill it, which others agreed with also. This means that unless someone comes forward soon to maintain AVR32 (along with the associated GCC toolchain and other distribution pieces), its days in the upstream Linux kernel are numbered – and probably removed in 4.12.

Sebastian Andrzej Siewior announced Linux v4.9.13-rt11, which includes a fix for a previous fix (allowing the previous lockdep fix to compile on UP).


Logan Gunthorpe posted “New Microsemi PCI Switch Management Driver”, which is in its 7th revision. The RFC (Request for Comments “proposes a management driver for Microsemi’s Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform”. Logan notes that “Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations”.

Ongoing Development

Michael S. Tsirkin continued his work on “vfio error recovery: kernel support” with version 4 of the patch series wich seeks to do more than simply ignoring non-fatal PCIe AER (Advanced Error Reporting) errors that hit assigned devices passed using VFIO into a guest Virtual Machine. Currently, only fatal errors (which cause a PCIe link reset) are reported – they stop the guest. In his summary email, Michael notes that his goal is to handle non-fatal errors by reporting them to the guest and having it handle them. And rather than surprising existing code, he calls out under “issues” that “this behavior should only be enabled with new userspace, old userspace should work without changes”. By “userspace” he means the code driving VFIO, which might be a QEMU process that is backing a KVM virtual machine context, or a container, or merely a bare metal userspace process that is using VFIO directly.

Johannes Weiner posted “mm: kswapd spinning on unreclaimable nodes – fixes and cleanups” in which he notes a previous posting from Jia He that he (and the team at Facebook) have reproduced. In the case of the problem scenario, the kernel’s kswapd (swap space daemon) for a given (memory) node spins indefinitely at 100% CPU usage when there are absolutely no reclaimable pages (granules of the smallest size of memory that can be managed by Linux and the underlying hardware) however the “condition for backing off is never met”. This results in kswapd busy-looping forever. In his patches, Johannes changes reclaim behavior so that kswapd will eventually really back off after failing 16 times (which is the same magic number of times we try during an OOM “Out Of Memory” situation) as defined by MAX_RECLAIM_RETRIES. He includes various examples.

Len Brown posted “cpufreq: Add the “” cmdline option. This is a corollary to “” and comes about for similar reasons for the purpose of testing. This author wonders aloud whether this will allow for buggy platforms that don’t support CPPC (Collaborative Processor Performance Control) to easily disable this at runtime too.

Aleksey Makarov posted “printk: fix double printing with earlycon”. On ACPI compliant platforms (including ARM servers), the SPCR (“Serial Port Console Redirection”) table provides information about the serial console UART that the kernel should be using, rather than having the user provide memory register addresses and baud rates on the kernel command line. This is a feature which is generally useful beyond ARM systems (although most x86 systems follow the traditional “PC” UART design). Prior to this fix, the kernel would double print output if given a “console=” and “earlycon”.

Minchan Kim posted “make try_to_unmap simple” which aims to remove some of the (apparently somewhat gratitous) complexity in the return value of this function. Currently it can return SWAP_SUCCESS, SWAP_FAIL, SWAP_AGAIN, SWAP_DIRTY, and SWAP_MLOCK. But Minchan feels that it can be simply a boolean return by removing the latter three of those return values.

Matthew Gerlach (Intel) posted “Altera Partial Reconfiguration IP”, which adds support to the kernel’s (Alan Tull’s) “fpga-mgr” driver for the “Altera Partial Reconfiguration IP”. Partial Reconfiguration (sometimes known as “PR” in the reconfigurable logic community) allows an FPGA (Field Programmable Gate Array)’s logic fabric to be reconfigured in smaller than whole regions. This (for example) would allow a closely coupled datacenter (Xeon) processor to continue to drive certain FPGA contained IP while other IP were being replaced dynamically. If one were to couple this with support in OpenStack Nomad or Kubernetes for dynamic reconfiguration at VM/container setup it would begin to enable various use cases for the mainstream datacenter around FPGA acceleration.

Andi Kleen posted “pci: Allow lockless access path to PCI mmconfig”. “mmconfig” refers to the memory mapped configuration region used by contemporary PCIe devices during enumeration and configuration. This is a kind of out-of-band mechanism by which the kernel can talk to PCIe devices in a fully standards compliant means prior to having configured them. Intel processors include many “PCIe” devices that are in fact a logical means of expressing so called “uncore” non-compute features on the processor SoC. They’re not real PCIe devices but appear to the kernel as such. This wonderful abstraction comes with some overhead cost, especially when the kernel spends time grabbing the “pci_cfg_lock” which it actually doesn’t need to hold, according to Andi.

Jarkko Sakkinen posted version 3 of “in-kernel resource manager”, which adds support to the kernel for “TPM spaces that provide an isolated execution context for transient objects and HMAC policy sessions”.

Tomas Winkler posted a question about what the community considered to be the “correct usage of arrats of variable length within [the] Linux kernel”. The replies generally included language to the form of “don’t”. Both for reasons of general language ugliness, and also because (especially in the case of local variables) the Linux kernel’s fixed (and also small) size stack raises serious potential for stack overflow if one is not careful. There was a suggestion that the kernel should be built with a compiler option to disallow VLAs, but that this would require various code to be fixed first.

March 06, 2017 12:53 AM

March 02, 2017

Pete Zaitcev: PTG, Sheraton, Atlanta

Serious reports trickle in (one, two), but on the lighter side, how was the venue? It's on the other side of the downtown, you know.

Bum incursions were very mild. Most went to the coffee area at the 1st floor, under the lobby level of Sheraton. One was remarkable though. I saw him practicing an unusual style of fighting on the sidewalk, with deep squats and wild swings - probably one of them prison styles. Very impressive, and also somewhat disturbing since all I have for him is checking with feet as well as I could, then move in for grappling phase, and then it's luck... Once done with his routine, he proceeded right past the coffee into the 2 rooms where some other org was meeting (not OpenStack) and started begging food off the hotel workers setting up tables. I heard him claiming that he was very hungry. Seemed super energetic and powerful a few minutes prior lol. But as much as I know, no laptops were stolen at the PTG (unlike e.g. OLS and UDS). Only goes to show that the main hazard is the venue staff and bums are more of an amusement, unless it's a really rough area.

March 02, 2017 01:51 AM

February 28, 2017

Kernel Podcast: Kernel Podcast for Feb 27th, 2017


In this week’s kernel podcast: the merge window for kernel 4.11 is open and patches are flying into Linus’s inbox, fixing NUMA node determination at runtime, Virtual Machine Aware Caches, Advisory Memory Allocations, and a non-fixed TASK_SIZE to bring excitement to your life. We will have this, and a summary of ongoing development in this week’s Linux Kernel podcast.

The merge window (period of time during which disruptive changes are allowed to be “merged” – incorporated into Linus’s official git tree – prior to a multi-week stabilization and Release Candidate cycle) for Linux 4.11 is currently open. This means that the most recent official kernel remains Linux 4.10. Meanwhile, many “pull requests” and merges are in flight for various kernel subsystems planning updates in 4.11. These include:

For a detailed sumary of current merge widow pulls and patches, consult this week’s Linux Weekly News at (Thursday).

Geert Uytterhoeven posted a summary of “Build regressions/improvements in v4.10”. These show an increase in build errors and warnings vs the previous 4.9 kernel cycle. He posted a list of configs used, the error and warning messages, and thanked the “linux-next team for providing the build service”.

Pavel Machek has been posting about various problems running 4.10 kernels. In one instance, he saw a corrupted stack that implied a double call to “startup_32_smp” (the secondary CPU boot method on Intel x64 Architecture). This lead Josh Poimbeouf to ponder whether the GCC in use was somehow bad.


Greg Kroah-Hartman announced Linux 4.4.52, 4.9.13, and 4.10.1. Ben Hutchings announced Linux 3.16.41, and 3.2.86.

Stephen Hemminger announced iproute2-4.10, including support for “new features in Linux 4.10”. Amongst those new features are “enhanced support for BPF [Berkley Packet Filter], VRF [Virtual Routing and Forwarding], and Flow based classifier (flower)”. The latest version is available here:

Karel Zak announced util-linux v2.29.2, including a fix for a (nasty) “su” security issue, otherwise documented in CVE-2017-2616. According to Karel, it is “possible for any local user to send SIGKILL to other processes with root privileges. To exploit this, the user must be able to perform su with a successful login. SIGKILL can only be send to processes which were executed after the su process. It is not possible to send SIGKILL to processes which were already running”. A fix entitled “properly clear child PID” against “su” is included among the fixes listed.

Lucas De Marchi announced kmod 24, which includes enhanced support for kernel module dependency loop detection:

Junio C Hamano announced git version 2.12.0:

Con Kolivas announced his Linux-4.10-ck1 MuQSS (Multiple Queue Skiplist Scheduler) version 0.152. More details at:

Ove Kent Karlsen has been performing various Linux gaming experiments. They posted links to YouTube videos showing results with “Doom 3”, which can be found here:

NUMA node determination

Dou Liyang (Fujitsu) posted several revisions of a patch series entitled “Revert works for the mapping of cpuid <-> nodeid”. This is intended to clean up the process by which (Intel x64 Architecture) systems enumerate the mapping of physical processor IDs to NUMA (Non-Uniform Memory Architecture) multi-socket “node” IDs. Conventionally, Linux uses the MADT (Multiple APIC Description Table – otherwise known as the “APIC” table for legacy reasons). ACPI table to map processors to their “Local APIC ID” (the ID of the core connected to the Intel APIC interrupt controller’s LAPIC CPU interface). It then maps these to NUMA nodes using the _PXM node ID in the ACPI DSDT (Differentiated System Description Table) and determines NUMA topology using the SRAT (Static Resource Affinity Table) and SLIT (System Locality Information Table). But this is fragile. Firmware developers are known to make mistakes on occasion, and these have included “duplicated processor IDs in DSDT”, and having the “_PXM in DSDT…inconsistent with the one in [the] MADT”. For this reason, Dou seeks to move the proximity discovery into the system’s hotplug path by reverting two previous commits. Xiaolong Ye (Intel) said he would test these and followup.

As a footnote, it’s worth adding that modern processors have a very  oose notion of a “physical” core, since they usually (internally) support dynamic remapping of true physical cores to the IDs exposed even to system programmers. This affords the illusion of contiguously numbered processors, and prevents an easy analysis of binning and yield characteristics. It’s one of the reasons that processors such as Intel’s use various mapping schemes in order to determine NUMA node proximinity. But one should never assume that any information given about a processor in any table reflects reality other than as a microprocessor company wanted you to perceive it.

Virtual Machine Aware Caches

Shanker Donthineni (Codeaurora) posted “arm64: Add support for VMID aware PIPT instruction cache”. Caches on the ARMv8 architecture are defined to be PIPT (Physically Indexed, Physically Tagged) from a software perspective (although the underlying implementation might be different – for example, you could index virtually with VIPT underneath a PIPT facade if you implemented expensive logic for automatic homonym detection). The ARMv8.2 specification allows “VMID aware PIPT” which means a cache is PIPT but aware of the existence of Virtual Machine IDs (VMIDs), which might form part of the cache entry. Will Deacon responded that the approach “may well cause problems for KVM with non-VHE [Virtual Host Extension – the ability to run “type 2″ hypervisors with split page tables for the kernel and userspace, as opposed to non-VHE implemented on original ARMv8.0 machines in which a shim running with its own page tables is required for KVM] because the host VMID is different from the guest VMID, yet we assume that I-cache invalidation by the host *will* affect the guest when, for example, invalidating the I-cache for pages holding the guest kernel Image”. He noted that he had some other patches in flight that he would post soon (for 4.12).

Advisory Memory Allocations in real life

Shaohua Li (Facebook) posted “mm: fix some MADV_FREE issues”. MADV_FREE is part of relatively recent(ish) kernel infrastructure to support advisory mmaps that the kernel may need to arbitrarily reclaim later when low on available memory. It’s the kind of thing that other Operating Systems (such as Windows) have done for many years (Windows will even dynamically enlarge its swap (paging) file on low memory situations). Facebook apparently like to use the (alternative) “jemalloc” userspace memory allocator and have found a number of issues when attempting to combine this with MADV_FREE flags to mmap. Shaohua notes that MADV_FREE cannot be used on a machine without swap enabled, actually increases memory pressure (due to page reclaim being biases against anonymous pages), and the lack of global accounting. The patches aim to address these.

Non-fixed TASK_SIZE

Martin Schwidefsky and Linus Torvalds had a back and forth discussion about “Using TASK_SIZE for kernel threads”. As kernel programmers know, kernel threads (“tasks”, or “kernel processes” – these show up in brackets in “ps” and “top”) don’t have an associated “mm” struct (they have no userspace). On s390, just to be different, TASK_SIZE is not fixed. It can actually be one of several values that are determined by reading a field in a task’s mm struct (context.asce_limit). This was causing very subtle breakage as the kernel indirected into a null structure which happened to contain a value very close to zero that kinda worked. Martin has a fixed queued up but had some suggestions for changes to make to the kernel to avoid such a subtle issue in future. Linus was more convinced that s390 was just doing something that needed fixing.

Ongoing Development

Elena Reshetova (Intel) posted many patches converting various uses of the kernel’s “atomic_t” datatype as a reference counter over to the new “refcount_t”. As she notes, “[b]y doing this we prevent intentional or accidental underflows or overflows that can le[a]d to use-after-free vulnerabilities”. Examples including architecture and VM code fixes.

Xunlei Pang (Red Hat) posted version 2 of a patch entitled “x86/mce: Don’t participate in rendezvous process once nmi-shootdown_cpus() was  made’. This aims to juggle a post-crash conumdrum: system errors sufficient enough to generate an MCE (Machine Check Exception) should not be ignored (and thus the machine check handler should run in the kernel) but they might be generated during the process of actively taking a crash/kdump. The existing code might instead cause a panic on exit from the (old kernel provided) MCE handler. Borislav Petkov didn’t like some of the details of the patch. He wanted to also see explicit documentation as to the handling of MCEs.

Andy Lutomirski posted “KVM TSS cleanups and speedups”, which aims to refactor how the kernel handles guest TSS (Task Segment Selector) handling on Intel x64 Architecture systems. These are layered upon a series from Thomas Gleixner aimed at cleaning up GDT (Global Descriptor Table) use. He notes that there “may be a slight speedup, too, because they remove an STR [store] instruction from the VMX [Virtual Machine] entry path”.

Heikki Krogerus posted version 17 of a patch series implementing “USB Type-C Connector class” support. This is “meant to provide [a] unified interface to…userspace to present the USB Type-C ports in a system”. Your author is looking forward to trying this on his Dell XPS Skylake with USB-C.

Rob Herring posted a patch “Add SPDX license tag check for dts files and headers” to the kernel’s “” patch submission checking tool.

Finally this week, Lorenzo Pieralisi posted “PCI: fix config and I/O Address space memory mappings” intended to address the inconvenient fact that “ioremap” on 32-bit and 64-bit ARM platforms was failing to strictly comply with the PCI local bus specification’s “Transaction Ordering and Posting” requirements. These mandate that PCI configuration cycles (during startup or hotplug) and I/O address space accesses must be “non-posted” (in other words, they must always receive a write notification response and not be buffered arbitrarily). Lorenzo addresses this with a 20 part patch series that cleans this up.

February 28, 2017 07:53 AM

Kees Cook: security things in Linux v4.10

Previously: v4.9.

Here’s a quick summary of some of the interesting security things in last week’s v4.10 release of the Linux kernel:

PAN emulation on arm64

Catalin Marinas introduced ARM64_SW_TTBR0_PAN, which is functionally the arm64 equivalent of arm’s CONFIG_CPU_SW_DOMAIN_PAN. While Privileged eXecute Never (PXN) has been available in ARM hardware for a while now, Privileged Access Never (PAN) will only be available in hardware once vendors start manufacturing ARMv8.1 or later CPUs. Right now, everything is still ARMv8.0, which left a bit of a gap in security flaw mitigations on ARM since CONFIG_CPU_SW_DOMAIN_PAN can only provide PAN coverage on ARMv7 systems, but nothing existed on ARMv8.0. This solves that problem and closes a common exploitation method for arm64 systems.

thread_info relocation on arm64

As done earlier for x86, Mark Rutland has moved thread_info off the kernel stack on arm64. With thread_info no longer on the stack, it’s more difficult for attackers to find it, which makes it harder to subvert the very sensitive addr_limit field.

linked list hardening
I added CONFIG_BUG_ON_DATA_CORRUPTION to restore the original CONFIG_DEBUG_LIST behavior that existed prior to v2.6.27 (9 years ago): if list metadata corruption is detected, the kernel refuses to perform the operation, rather than just WARNing and continuing with the corrupted operation anyway. Since linked list corruption (usually via heap overflows) are a common method for attackers to gain a write-what-where primitive, it’s important to stop the list add/del operation if the metadata is obviously corrupted.

seeding kernel RNG from UEFI

A problem for many architectures is finding a viable source of early boot entropy to initialize the kernel Random Number Generator. For x86, this is mainly solved with the RDRAND instruction. On ARM, however, the solutions continue to be very vendor-specific. As it turns out, UEFI is supposed to hide various vendor-specific things behind a common set of APIs. The EFI_RNG_PROTOCOL call is designed to provide entropy, but it can’t be called when the kernel is running. To get entropy into the kernel, Ard Biesheuvel created a UEFI config table (LINUX_EFI_RANDOM_SEED_TABLE_GUID) that is populated during the UEFI boot stub and fed into the kernel entropy pool during early boot.

arm64 W^X detection

As done earlier for x86, Laura Abbott implemented CONFIG_DEBUG_WX on arm64. Now any dangerous arm64 kernel memory protections will be loudly reported at boot time.

64-bit get_user() zeroing fix on arm
While the fix itself is pretty minor, I like that this bug was found through a combined improvement to the usercopy test code in lib/test_user_copy.c. Hoeun Ryu added zeroing-on-failure testing, and I expanded the get_user()/put_user() tests to include all sizes. Neither improvement alone would have found the ARM bug, but together they uncovered a typo in a corner case.

no-new-privs visible in /proc/$pid/status
This is a tiny change, but I like being able to introspect processes externally. Prior to this, I wasn’t able to trivially answer the question “is that process setting the no-new-privs flag?” To address this, I exposed the flag in /proc/$pid/status, as NoNewPrivs.

That’s all for now! Please let me know if you saw anything else you think needs to be called out. :) I’m already excited about the v4.11 merge window opening…

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

February 28, 2017 06:31 AM

February 27, 2017

Dave Airlie: radv + steamvr

If anyone wants to run SteamVR on top of radv, the code is all public now.

The external memory code will be going upstream to master once I clean it up a bit, the semaphore hack is waiting on kernel
changes, and the NIR shader hack is waiting on a new SteamVR build that removes the bad use of SPIR-V.

I've run Serious SAM TFE in VR mode on this branch.

February 27, 2017 07:42 PM

Matthew Garrett: The Fantasyland Code of Professionalism is an abuser's fantasy

The Fantasyland Institute of Learning is the organisation behind Lambdaconf, a functional programming conference perhaps best known for standing behind a racist they had invited as a speaker. The fallout of that has resulted in them trying to band together events in order to reduce disruption caused by sponsors or speakers declining to be associated with conferences that think inviting racists is more important than the comfort of non-racists, which is weird in all sorts of ways but not what I'm talking about here because they've also written a "Code of Professionalism" which is like a Code of Conduct except it protects abusers rather than minorities and no really it is genuinely as bad as it sounds.

The first thing you need to know is that the document uses its own jargon. Important here are the concepts of active and inactive participation - active participation is anything that you do within the community covered by a specific instance of the Code, inactive participation is anything that happens anywhere ever (ie, active participation is a subset of inactive participation). The restrictions based around active participation are broadly those that you'd expect in a very weak code of conduct - it's basically "Don't be mean", but with some quirks. The most significant is that there's a "Don't moralise" provision, which as written means saying "I think people who support slavery are bad" in a community setting is a violation of the code, but the description of discrimination means saying "I volunteer to mentor anybody from a minority background" could also result in any community member not from a minority background complaining that you've discriminated against them. It's just not very good.

Inactive participation is where things go badly wrong. If you engage in community or professional sabotage, or if you shame a member based on their behaviour inside the community, that's a violation. Community sabotage isn't defined and so basically allows a community to throw out whoever they want to. Professional sabotage means doing anything that can hurt a member's professional career. Shaming is saying anything negative about a member to a non-member if that information was obtained from within the community.

So, what does that mean? Here are some things that you are forbidden from doing:

Now, clearly, some of these are unintentional - I don't think the authors of this policy would want to defend the idea that you can't report something to the police, and I'm sure they'd be willing to modify the document to permit this. But it's indicative of the mindset behind it. This policy has been written to protect people who are accused of doing something bad, not to protect people who have something bad done to them.

There are other examples of this. For instance, violations are not publicised unless the verdict is that they deserve banishment. If a member harasses another member but is merely given a warning, the victim is still not permitted to tell anyone else that this happened. The perpetrator is then free to repeat their behaviour in other communities, and the victim has to choose between either staying silent or warning them and risk being banished from the community for shaming.

If you're an abuser then this is perfect. You're in a position where your victims have to choose between their career (which will be harmed if they're unable to function in the community) and preventing the same thing from happening to others. Many will choose the former, which gives you far more freedom to continue abusing others. Which means that communities adopting the Fantasyland code will be more attractive to abusers, and become disproportionately populated by them.

I don't believe this is the intent, but it's an inevitable consequence of the priorities inherent in this code. No matter how many corner cases are cleaned up, if a code prevents you from saying bad things about people or communities it prevents people from being able to make informed choices about whether that community and its members are people they wish to associate with. When there are greater consequences to saying someone's racist than them being racist, you're fucking up badly.

comment count unavailable comments

February 27, 2017 01:40 AM

February 26, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: What if I Knew Then What I Know Now?

During my keynote at the 2017 Multicore World, Mark Moir asked what I would have done differently if I knew then what I know now, with the “then” presumably being the beginning of the RCU effort back in the early 1990s. Because I got the feeling that my admittedly glib response did not fully satisfy Mark, I figured I should try again. So imagine that you traveled back in time to the very end of the year 1993, not long after Jack Slingwine and I came up with read-copy lock (now read-copy update, or just RCU), and tried to pass on a few facts about my younger self's future. The conversation might have gone something like this:

You  By the year 2017, RCU will be part of the concurrency curriculum at numerous universities and will be very well-regarded in some circles.
Me  Nice! That must mean that DYNIX/ptx will also be doing well!

You  Well, no. DYNIX/ptx will disappear by 2005, being replaced by the combination of IBM's AIX and another operating system kernel started as a hobby.
Me  AIX??? Surely you mean Solaris, HP-UX or Ultrix! And I wouldn't say that BSD started as a hobby! It was after all fully funded research.

You  No, Sun Microsystems was acquired by Oracle in 2010, and Solaris was already in decline by that time. IBM's AIX was by then the last proprietary UNIX operating system standing. A new open-source kernel called "Linux" became the dominant OS.
Me  IBM??? But they are currently laying off more people each month than Sequent employs worldwide!!! Why would they even still be in business in 2010?

You  True. But their new CEO, Louis Gerstner, will turn IBM around.
Me  Well, yes, he did just become IBM's CEO, but before that he was CEO of RJR Nabisco. That should work about as well as John Sculley's tenure as CEO of Apple. What does Gerstner know about computers, anyway?

You  He apparently knew enough to get IBM back on its feet. In fact, IBM will buy Sequent, so that you will become an IBM employee on April 1, 2000.
Me  April Fools day? Now I know you are joking!!!

You  No joke. You will become an IBM employee on April 1, 2000, seven years to the day after Louis Gerstner became an IBM employee.
Me  OK, I guess that explains why DYNIX/ptx doesn't make it past 2005. That is really annoying! So the teaching of RCU in universities is some sort of pity play, then?

You  No. Dipankar Sarma will get RCU accepted into Linux in 2002.
Me  I could easily believe that—he is very capable. So what do I do instead?

You  You will take over maintainership of RCU in 2005.
Me  Is Dipankar going to be OK?

You  Of course! He will just move on to other projects. It is just that there will be a lot more work needed on RCU, which you will take on.
Me  What more work could there be? It is a pretty simple mechanism, way simpler than a memory allocator, for example.

You  Well, there will be quite a bit of scalability work needed. For example, you will receive a scalability bug report involving a 512-CPU shared-mmeory system.
Me  Hmmm... It took Sequent from 1985 to 1997 to get from 30 to 64 CPUs, so that is doubling every 12 years, so I am guessing that I received this bug report somewhere near the year 2019. So what did I do in the meantime?

You  No, you will receive this bug report in 2004.
Me  512-CPU system in 2004??? Well, suspending disbelief, this must be why I will start maintaining RCU in 2005.

You  No, a quick fix will be supplied by a guy named Manfred Spraul, who writes concurrent Linux-kernel code as a hobby. So you didn't do the scalability work until 2008.
Me  Concurrent Linux-kernel coding as a hobby? That sounds unlikely. But never mind. So what did I do between 2005 and 2008? Surely it didn't take me three years to create a highly scalable RCU implementation!

You  You will work with a large group of people adding real-time capabilities to the Linux kernel. You will create an RCU implementation that allowed readers to be preempted.
Me  That makes absolutely no sense! A context switch is a quiescent state, so preempting an RCU read-side critical section would result in a too-short grace period. That most certainly isn't going to help anything, given that a crashed kernel isn't going to offer much in the way of real-time response!

You  I don't know the details, but you will make it work. And this work will be absolutely necessary for the Linux kernel to achieve 20-microsecod interrupt and scheduling latencies.
Me  Given that this is a general-purpose OS, you obviously meant 20 milliseconds!!! But what could RCU possibly be doing that would contribute significantly to a 20-millisecond interrupt/scheduling delay???

You  No, I really did mean sub-20-microsecond latencies. By 2010 or so, even vanilla non-realtime Linux kernel will easily meet 20-millisecond latencies, assuming the hardware and software is properly configured.
Me  Ah, got it! CPU core clock rates should be somewhere around 50GHz by 2010, which might well make those sorts of latencies achievable.

You  No, power-consumption and heat-dissipation constraints will cap CPU core clock frequencies at about 5GHz in 2003. Most systems will run in the 1-3GHz range even as late as in 2017.
Me  Then I don't see how a general-purpose OS could possibly achieve sub-20-microsecond latencies, even on a single-CPU system, which wouldn't have all that much use for RCU.

You  No, this will be on SMP systemss. In fact, in 2012, you will receive a bug report complaining of excessively long 200-microsecond latencies on a system running 4096 CPUs.
Me  Come on! I believe that Amdahl's Law has something to say about lock contention on such large systems, which would rule out reasonable latencies, let alone 200-microsecond latencies! And there would be horrible reliability problems with that many CPUs! You wouldn't be able to keep the system running long enough to measure the latency!!!

You  Hey, I am just telling you what will happen.
Me  OK, so after I get RCU to handle insane scalability and real-time response, there cannot be anything left to do, right?

You  Actually, wrong. Energy efficiency becomes extremely important, and you will rewrite the energy-efficiency RCU code more than eight times before you get it right.
Me  Eight times??? You must be joking!!! Seems like it would be better to just waste a little energy. After all, computers don't consume all that much energy, especially compared to industrial and transportation systems.

You  No, that would not work. By 2005, there are quite a few datacenters that are limited by electrical power rather than by floor space. So much so that large data centers open in Eastern Oregon, on the sites of the old aluminum smelters. When you have that many servers, even a few percent of energy savings translates to millions of dollars a year, which is well worth spending some development effort on.
Me  That is an insanely large number of servers!!! How many Linux instances are running by that time, anyway?

You  By the mid-2010s, the number of Linux instances is well in excess of one billion, but no one knows the exact number.
Me  One billion??? That is almost one server for every family in the world! No way!!!

You  Well, most of the Linux instances are not servers. There are a lot of household appliances running Linux, to say nothing of battery-powered handl-held smartphones. By 2017, most of the smartphones will have multiple CPUs.
Me  Why on earth would you need multiple CPUs to make a phone call? And how would you fit multiple CPUs into a hand-held device? And where do you put the battery, in a large backpack or something???

You  No, the entire device, batteries, CPUs and all, will fit easily into your shirt pocket. And these smartphones can take pictures, record video, do video conference calls, find precise locations using GPS, translate among multiple languages, and much else besides. They are really full-fledged computers that fit in your pocket.
Me  A pocket-sized supercomputer??? And how would I possibly go about testing RCU code sufficiently for your claimed billion instances???

You  Interesting question. You will give a keynote at the 2017 Multicore World in February 2017 at Wellington, New Zealand describing some of your plans. These plans include the use of formal verification in your regression test suite.
Me  Formal verification of highly concurrent code in a regression test suite??? OK, now I know for sure that you are pulling my leg! It has been an interesting conversation, but I must get back to reality!!!

My 1993 self did not have a very accurate view of 2017, did he? As the old saying goes, predictions are hard, especially about the future! So it is quite wise to take such predictions with a considerable supply of salt.

February 26, 2017 11:09 PM

Pavel Machek: Using Linux notebook as an alarm clock

Is someone using notebook as an alarm clock? Yes, it would be easy if I did not suspend machine overnight, but that would waste power and produce noise from fans. I'd like version that suspends the machine...

February 26, 2017 10:32 PM

February 21, 2017

Pavel Machek: X220 to play with

Nice machine. Slightly bigger than X60, bezel around display way too big, but quite powerful. Biggest problem seems to be that it does not accept 9.5mm high drives...

I tried 4.10 there, and got two nasty messages during bootup. Am I the last one running 32 bit kernels?

I was hoping to get three-monitor configuration on my desk, but apparently X220 can not do that. xrandr reports 8 outputs (!), but it physically only has 3: LVDS, displayport and VGA. Unfortunately, it seems to only have 2 CRTCs, so only 2 outputs can be active at a time. Is there a way around that?

February 21, 2017 10:21 PM

Gustavo F. Padovan: Collabora Contributions to Linux Kernel 4.10

Linux Kernel v4.10 is out and this time Collabora contributed a total of 39 patches by 10 different developers. You can read more about the v4.10 merge window on part 1, part 2 and part 3.

Now here is a look at the changes made by Collaborans. To begin with Daniel Stone fixed an issue when waiting for fences on the i915 driver, while Emil Velikov added support to read the PCI revision for sysfs to improve the starting time in some applications.

Emilio López added a set of selftests for the Sync File Framework and Enric Balletbo i Serra added support for the ChromeOS Embedded Controller Sensor Hub. Fabien Lahoudere added support for the NVD9128 simple panel and enabled ULPI phy for USB on i.MX.

Gabriel Krisman fixed a spurious CARD_INT interrupts for SD cards that was preventing one of our kernelCI machines to boot. On the graphics side Gustavo Padovan added Explicit Synchronization support to DRM/KMS.

Martyn Welch added GPIO support for CP2105 USB serial device while Nicolas Dufresne fixed Exynos4 FIMC to roundup imagesize to row size for tiled formats, otherwise there would be enough space to fit the last row of the image. Last but not least, Tomeu Vizoso added debugfs interface to capture frames CRCs, which is quite helpful for debugging and automated graphics testing.

And now the complete list of Collabora contributions:

Daniel Stone (1):

Emil Velikov (1):

Emilio López (7):

Enric Balletbo i Serra (3):

Fabien Lahoudere (4):

Gabriel Krisman Bertazi (1):

Gustavo Padovan (18):

Martyn Welch (1):

Nicolas Dufresne (1):

Tomeu Vizoso (2):

February 21, 2017 04:02 PM

February 20, 2017

Kernel Podcast: Kernel Podcast for Feb 20th, 2017

UPDATE: Thanks to LWN for the mention. This podcast is in “alpha”. It will start to show up on iTunes and Google Play (which didn’t exist last time I did this thing!) stores within the next day or two. You can also subscribe (for the moment) by using this link: kernel podcast audio rss feed. This podcast format will be tweaked, and the format/layout will very likely change a bit as I figure out what works, and what does not. Equipment just started to arrive at home (Zoom H4N Pro, condenser mics, etc.), a new content publishing platform needs to get built (I intend ultimately for listeners to help to create summaries by annotating threads as they happen). And yes, my former girlfriend will once again be reprising her role as author of another catchy intro jingle…soon 😉

Audio: Kernel Podcast 20170220

Support for this podcast comes from Jon Masters, trying to bring back the Kernel Podcast since 2012.

In this week’s edition: Linus Torvalds announces Linux 4.10, Alan Tull updates his FPGA manager framework, and Intel’s latest 5-level paging patch series is posted for review. We will have this, and a summary of ongoing development in the first of the newly revived Linux Kernel Podcast.

Linux 4.10

Linus Torvalds announced the release of 4.10 final, noting that “it’s been quiet since rc8, but we did end up fixing several small issues, so the extra week was all good”. Linus added a (relatively rare) additional “RC8” (Release Candidate 8) to this kernel cycle due to the timing – many of us were attending the “Open Source Leadership Summit” (OSLS, formerly “Linux Foundation Collaboration Summit”, or “Collab”) over the past week. The 4.10 kernel contains about 13,000 commits, which used to seem large but somehow now…isn’t. has the usual summary of new features and fixes:

With the announcement of 4.10 comes the opening of the merge window for Linux 4.11 (the period of up to two weeks at the beginning of a development cycle, during with new features and disruptive changes are “pulled” into Linus’s kernel (git) tree). The 4.11 merge window begins today.

FPGA Manager Updates

Alan Tull posted a patch series implementing “FPGA Region enhancements and fixes”, which “intends to enable expanding the user of FPGA regions beyond device tree overlays”. Alan’s FPGA manager framework allows the kernel to manage regions within FPGAs (Field Programmable Gate Arrays) known as “partial reconfigurable” regions – areas of the logic fabric that can be loaded with new bitstream configs. Part of the discussion around the latest patches centered on their providing a new sysfs interface for loading FPGA images, and in particular the need to ensure that this ABI handle FPGA bitstream metadata in a standard and portable fashion across different OSes.

Intel 5-level paging

Kirill A. Shutemov posted version 3 of Intel’s 5 level paging patch series that expands the supportable VA (Virtual Address) space on Intel Architecture from 256TiB (64TiB physical) to 128PiB (4PiB physical). Channeling his inner Bill Gates, he suggests that this “ought to be enough for anybody”. Key among the TODO items remains “boot-time switch between 4 and 5-level paging” to avoid the need for custom kernels. The latest patches introduce two new prctl calls to manage the maximum virtual address space available to userspace processes during mmap calls (PR_SET_MAX_VADDR and PR_GET_MAX_VADDR). This is intended to aid in compatibility by preventing certain legacy programs from breaking when confronted with a 56-bit address space they weren’t expecting. In particular, some JITs use high order “canonical” bits in existing x86 addresses to encode pointer tags and other information (that they should not per a strict interpretation of Intel’s “Canonical Addressing”).


Steven Rostedt announced verious preempt-rt (“Real Time”) kernel trees (4.4.47-rt59, 4.1.38-rt45, 3.18.47-rt52, 3.12.70-rt94, and 3.10.104-rt118). Sebastian Andrzej also announced version v4.9.9-rt6 of the preempt-rt “Real Time” Linux patch series. It includes fixes for a spurious softirq wakeup, and a GPL symbol issue. A known issue is that CPU hotplug can still deadlock.

Junio C Hamano announced version v2.12.0-rc2 of git.


Hoeun Ryu posted version 6 of a patch that takes care to properly free up virtually mapped (vmapped) stacks that might be in the kernel’s stack cache when cpus are offlined (otherwise the kernel was leaking these during offline/online operations).

New Drivers

Mahipal Challa posted version 2 of a patch series implementing a compression driver for the Cavium ThunderX “ZIP” IP on their 64-bit ARM server SoC (System-on-Chip) to plumb into the kernel cryptoapi.

Anup Patel posted version 3 of a patch implementing RAID offload
support for the Broadcom “SBA” RAID device on their SoCs.

Ongoing Development

Andi Kleen posted various perf vendor events for Intel uncore devices, Kan Liang posted new core events for Intel Goldmont, and Srinivas Pandruvada posted perf events for Intel Kaby Lake.

Velibor Markovski (Broadcom) posted a patch implementing ARM Cache Coherent Network (CCN) 502 support.

Sven Schmidt posted version 7 of a patch series updating the LZ4 compression module to support a mode known as “LZ4 fast”, in particular for the benefit of its use by the lustre filesystem.

Zhou Xianrong posted a patch (for the ARM Architecture) that attempts to save kernel memory by freeing parts of the the linear memmap for physical PFNs (page frame numbers) that are marked reserved in a DeviceTree. This had some pushback. The argument is that it saves memory on resource constrained machines – 6MB of RAM in the example.

Jessica Yu (who took over maintaining the in-kernel module loader infrastructure from Rusty Russell some time back) posted a link to her module-next tree in the kernel MAINTAINERS document.

Bhupesh Sharma posted a patch moving in-kernel handling of ACPI BGRT (Boot(time) Graphics Resource) tables out of the x86 architecture tree and into drivers/firmware/efi (so that it can be shared with the 64-bit ARM Architecture).

Jarkko Sakkinen posted version 2 of a patch series implementing a new in-kernel resource manager for “TPM spaces” (these are “isolated execution context(s) for transient objects and HMAC and policy sessions.”. Various test scripts were provided also.

That’s all for this week. Tune in next time for the latest happenings in the Linux kernel community. Don’t forget to follow us @kernelpodcast

February 20, 2017 07:50 AM

Kernel Podcast: Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

February 20, 2017 06:43 AM

February 12, 2017

James Bottomley: Using letsencrypt certificates with DANE

If, like me, you run your own cloud server, at some point you need TLS certificates to export secure services.  Like a lot of people I object to paying a so called X.509 authority a yearly fee just to get a certificate, so I’ve been using a free startcom one for a while.  With web browsers delisting startcom, I’m unable to get a new usable certificate from them, so I’ve been investigating letsencrypt instead (if you’re in to fun ironies, by the way, you can observe that currently the letsencrypt site isn’t using a letsencrypt certificate, perhaps indicating the administrative difficulty of doing so).

The problem with letsencrypt certificates is that they currently have a 90 day expiry, which means you really need to use an automated tool to keep your TLS certificate current.  Fortunately the EFF has developed such a tool: certbot (they use a letsencrypt certificate for their site, indicating you can have trust that they do know what they’re doing).  However, one of the problems with certbot is that, by default, it generates a new key each time the certificate is renewed.  This isn’t a problem for most people, but if you use DANE records, it causes significant issues.

Why use both DANE and letsencrypt?

The beauty of DANE, as I’ve written before, is that it gives you a much more secure way of identifying your TLS certificate (provided you run DNSSEC).  People verifying your certificate may use DANE as the only verification mechanism (perhaps because they also distrust the X.509 authorities) which means the best practice is to publish a DANE TLSA record for each service and also use an X.509 authority rooted certificate.  That way your site just works for everyone.

The problem here is that being DNS based, DANE records can be cached for a while, so it can take a few days for DANE certificate updates to propagate through the DNS infrastructure. DANE records have an answer for this: they have a mode where the record identifies only the hash of the public key used by the website, not the certificate itself, which means you can change your certificate as much as you want provided you keep the same public/private key pair.  And here’s the rub: if certbot is going to try to give you a new key on each renewal, this isn’t going to work.

The internet society also has posts about this.

Making certbot work with DANE

Fortunately, there is a solution: the certbot manual mode (certonly) takes a –csr flag which allows you to construct your own certificate request to send to letsencrypt, meaning you can keep a fixed key … at the cost of not using most of the certbot automation.  So, how do you construct a correct csr for letsencrypt?  Like most free certificates, letsencrypt will only allow you to specify the certificate commonName, which must be a DNS pointer to the actual website.  If you need a certificate that covers multiple sites, all the other sites must be enumerated in the x509 v3 extensions field subjectAltName.  Let’s look at how openssl can generate such a certificate request.  One of the slight problems is that openssl, being a cranky tool, does not allow you to specify a subjectAltName on the command line, so you have to construct a special configuration file for it.  I called mine letsencrypt.conf

prompt = no
distinguished_name = req_dn
req_extensions = req_ext

commonName =



As you can see, I’ve given my canonical server (bedivere) as the common name and then four other subject alt names.  Once you have this config file tailored to your needs, you merely run

openssl req -new -key <mykey.key> -config letsencrypt.conf -out letsencrypt.csr

Where mykey.key is the path to your private key (you need a private key because even though the CSR only contains the public key, it is also signed).  However, once you’ve produced this letsencrypt.csr, you no longer need the private key and, because it’s undated, it will now work forever, meaning the infrastructure you put into place with certbot doesn’t need to be privileged enough to access your private key.  Once this is done, you make sure you have TLSA 3 1 1 records pointing to the hash of your public key (here’s a handy website to generate them for you) and you never need to alter your DANE record again.  Note, by the way that letsencrypt certificates can be used for non-web purposes (I use mine for encrypted SMTP as well), so you’ll need one DANE record for every service you use them for.

Putting it all together

Now that you have your certificate request, depending on what version of certbot you have, you may need it in DER format

openssl req -in letsencrypt.csr -out letsencrypt.der -outform DER

And you’re ready to run the following script from cron

date=$(date +%Y-%m-%d)

# certbot handling
# first it cannot replace certs, so ensure new locations (date suffix)
# each time mean the certificate is unique each time.  Next, it's
# really chatty, so the only way to tell if there was a failure is to
# check whether the certificates got updated and then get cron to
# email the log

certbot certonly --webroot --csr ${csr} --preferred-challenges http-01 -w /var/www --fullchain-path ${fullchain} --chain-path ${chain} --cert-path ${cert} > ${out} 2>&1

if [ ! -f ${fullchain} -o ! -f ${chain} -o ! -f ${cert} ]; then
    cat ${out}
    exit 1;

# link into place

# cert only (apache needs)
ln -sf ${cert} ${dir}/letsencrypt.crt
# cert with chain (stunnel needs)
ln -sf ${fullchain} ${dir}/letsencrypt.pem
# chain only (apache needs)
ln -sf ${chain} ${dir}/letsencrypt-chain.pem

# reload the services
sudo systemctl reload apache2
sudo systemctl restart stunnel4
sudo systemctl reload postfix

Note that this script needs the ability to write files and create links in /etc/ssl/certs (can be done by group permission) and the systemctl reloads need the following in /etc/sudoers

%LimitedAdmins ALL=NOPASSWD: /bin/systemctl reload apache2
%LimitedAdmins ALL=NOPASSWD: /bin/systemctl reload postfix
%LimitedAdmins ALL=NOPASSWD: /bin/systemctl restart stunnel4

And finally you can run this as a cron script under whichever user you’ve chosen to have sufficient privilege to write the certificates.  I run this every month, so I know I if anything goes wrong I have at least two months to fix it.

Oh, and just in case you need proof that I got this all working, here you are!

February 12, 2017 06:55 PM

February 03, 2017

Daniel Vetter: LCA Hobart: Maintainers Don't Scale

Seems that there was a rift in the spacetime that sucked away the video of my LCA talk, but the awesome NextDayVideo team managed to pull it back out. And there’s still the writeup and slides available.

February 03, 2017 12:00 AM

February 02, 2017

Pete Zaitcev: Richard Feynman on Gerrit reviews

Here's a somewhat romanticized vision what a code review is:

What am I going to do? I get an idea. Maybe it's a valve. I take my finger and put it down on one of the mysterious little crosses in the middle of one of the blueprints on page three, and I say, ``What happens if this valve gets stuck?'' --figuring they're going to say, ``That's not a valve, sir, that's a window.''

So one looks at the other and says, ``Well, if that valve gets stuck--'' and he goes up and down on the blueprint, up and down, the other guy goes up and down, back and forth, back and forth, and then both look at each other. They turn around to me and they open their mouths like astonished fish and say, ``You're absolutely right, sir.''

Quoted from "Surely You Are Joking, Mr. Feynman!".

February 02, 2017 04:37 PM

January 27, 2017

Michael Kerrisk (manpages): Next Linux/UNIX System Programming course in Munich: 15-19 May, 2017

I've scheduled another 5-day Linux/UNIX System Programming course to take place in Munich, Germany, for the week of 15-19 May 2017.

The course is intended for programmers developing system-level, embedded, or network applications for Linux and UNIX systems, or programmers porting such applications from other operating systems (e.g., Windows) to Linux or UNIX. The course is based on my book, The Linux Programming Interface (TLPI), and covers topics such as low-level file I/O; signals and timers; creating processes and executing programs; POSIX threads programming; interprocess communication (pipes, FIFOs, message queues, semaphores, shared memory), and network programming (sockets).
The course has a lecture+lab format, and devotes substantial time to working on some carefully chosen programming exercises that put the "theory" into practice. Students receive printed and electronic copies of TLPI, along with a 600-page course book that includes all slides and exercises presented in the course. A reading knowledge of C is assumed; no previous system programming experience is needed.

Some useful links for anyone interested in the course:

Questions about the course? Email me via

January 27, 2017 01:11 AM

January 26, 2017

Gustavo F. Padovan: Mainline Explicit Fencing – part 3

In the last two articles we talked about how Explicit Fencing can help the graphics pipeline in general and what happened on the effort to upstream the Android Sync Framework. Now on the third post of this series we will go through the Explicit Fencing implementation on DRM and other elements of the graphics stack.

The DRM implementation lays down on top of two kernel infrastructures, struct dma_fence, which represents the fence and struct sync file that provides the file descriptors to be shared with userspace (as it was discussed in the previous articles). With fencing the display infrastructure needs to wait for a signal on that fence before displaying the buffer on the screen. On a Explicit Fencing implementation that fence is sent from userspace to the kernel. The display infrastructure also sends back to userspace a fence, encapsulated in a struct sync_file, that will be signalled when the buffer is scanned out on the screen. The same process happens on the rendering side.

It is mandatory to use of Atomic Modesetting and here is not plan to support legacy APIs. The fence that DRM will wait on needs to be passed via the IN_FENCE_FD property for each DRM plane, that means it will receive one sync_file fd containing one or more dma_fence per plane. Remember that in DRM a plane directly relates to a framebuffer so one can also say that there is one sync_file per framebuffer.

On the other hand for the fences created by the kernel that are sent back to userspace the OUT_FENCE_PTR property is used. It is a DRM CRTC property because we only create one dma_fence per CRTC as all the buffers on it will be scanned out at the same time. The kernel sends this fence back to userspace by writing the fd number to the pointer provided in the OUT_FENCE_PTR property. Note that, unlike from what Android did, when the fence signals it means the previous buffer – the buffer removed from the screen – is free for reuse. On Android when the signal was raised it meant the current buffer was freed. However, the Android folks have patched SurfaceFlinger already to support the Mainline semantics when using Explicit Fencing!

Nonetheless, that is only one side of the equation and to have the full graphics pipeline running with Explicit Fencing we need to support it on the rendering side as well. As every rendering driver has its own userspace API we need to add Explicit Fencing support to every single driver there. The freedreno driver already has its Explicit Fencing support  mainline and there is work in progress to add support to i915 and virtio_gpu.

On the userspace side Mesa already has support for the EGL_ANDROID_native_fence_sync needed to use Explicit Fencing on Android. Libdrm incorporated the headers to access the sync file IOCTL wrappers. On Android, libsync now has support for both the old Android Sync and Mainline Sinc File APIs. And finally, on drm_hwcomposer, patches to use Atomic Modesetting and Explicit Fencing are available but they are not upstream yet.

Validation tests for both Sync Files and fences on the Atomic API were written and added to IGT.

January 26, 2017 03:23 PM

January 23, 2017

Matthew Garrett: Android permissions and hypocrisy

I wrote a piece a few days ago about how the Meitu app asked for a bunch of permissions in ways that might concern people, but which were not actually any worse than many other apps. The fact that Android makes it so easy for apps to obtain data that's personally identifiable is of concern, but in the absence of another stable device identifier this is the sort of thing that capitalism is inherently going to end up making use of. Fundamentally, this is Google's problem to fix.

Around the same time, Kaspersky, the Russian anti-virus company, wrote a blog post that warned people about this specific app. It was framed somewhat misleadingly - "reading, deleting and modifying the data in your phone's memory" would probably be interpreted by most people as something other than "the ability to modify data on your phone's external storage", although it ends with some reasonable advice that users should ask why an app requires some permissions.

So, to that end, here are the permissions that Kaspersky request on Android:

Every single permission that Kaspersky mention Meitu having? They require it as well. And a lot more. Why does Kaspersky want the ability to record audio? Why does it want to be able to send SMSes? Why does it want to read my contacts? Why does it need my fine-grained location? Why is it able to modify my settings?

There's no reason to assume that they're being malicious here. The reasons that these permissions exist at all is that there are legitimate reasons to use them, and Kaspersky may well have good reason to request them. But they don't explain that, and they do literally everything that their blog post criticises (including explicitly requesting the phone's IMEI). Why should we trust a Russian company more than a Chinese one?

The moral here isn't that Kaspersky are evil or that Meitu are virtuous. It's that talking about application permissions is difficult and we don't have the language to explain to users what our apps are doing and why they're doing it, and Google are still falling far short of where they should be in terms of making this transparent to users. But the other moral is that you shouldn't complain about the permissions an app requires when you're asking for even more of them because it just makes you look stupid and bad at your job.

comment count unavailable comments

January 23, 2017 07:58 AM

January 20, 2017

Daniel Vetter: Maintainers Don't Scale

This is the write-up of my talk at LCA 2017 in Hobart. It’s not exactly the same, because this is a blog and not a talk, but the same contents. The slides for the talk are here, and I will link to the video as soon as it is available. Update: Video is now uploaded.

Linux Kernel Maintainers

First let’s look at how the kernel community works, and how a change gets merged into Linus Torvalds’ repository. Changes are submitted as patches to mailing list, then get some review and eventually get applied by a maintainer to that maintainer’s git tree. Each maintainer then sends pull request, often directly to Linus. With a few big subsystems (networking, graphics and ARM-SoC are the major ones) there’s a second or third level of sub-maintainers in. 80% of the patches get merged this way, only 20% are committed by a maintainer directly.

Most maintainers are just that, a single person, and often responsible for a bunch of different areas in the kernel with corresponding different git branches and repositories. To my knowledge there are only three subsystems that have embraced group maintainership models of different kinds: TIP (x86 and core kernel), ARM-SoC and the graphics subsystem (DRM).

The radical change, at least for the kernel community, that we implemented over a year ago for the Intel graphics driver is to hand out commit rights to all regular contributors. Currently there are 19 people with commit rights to the drm-intel repository. In the first year of ramp-up 70% of all patches are now committed directly by their authors, a big change compared to how things worked before, and still work everywhere else outside of the graphics subsystem. More recently we also started to manage the drm-misc tree for subsystem wide refactorings and core changes in the same way.

I’ve covered the details of the new process in my Kernel Recipes talk “Maintainers Don’t Scale”, and LWN has covered that, and a few other talks, in their article on linux kernel maintainer scalability. I also covered this topic at the kernel summit, again LWN covered the group maintainership discussion. I don’t want to go into more detail here, mostly because we’re still learning, too, and not really experts on commit rights for everyone and what it takes to make this work well. If you want to enjoy what a community does who really has this all figured out, watch Emily Dunham’s talk “Life is better with Rust’s community automation” from last year’s LCA.

What we are experts on is the Linux Kernel’s maintainer model - we’ve run things for years with the traditional model, both as single maintainers and small groups, and now gained the outside perspective by switching to something completely different. Personally, I’ve come to believe that the maintainer model as implemented by the kernel community just doesn’t scale. Not in the technical sense of big-O scalability, because obviously the kernel community scales to a rather big size. Much larger organizations, entire states are organized in a hierarchical way, the kernel maintainer hierarchy is not anything special. Besides that, git was developed specifically to support the Linux maintainer hierarchy, and git won. Clearly, the linux maintainer model scales to big numbers of contributors. Where I think it falls short is the constant factor of how efficiently contributions are reviewed and merged, especially for non-maintainer contributors. Which do 80% of all patches.

Cult of Busy

The first issue that routinely comes out when talking about maintainer topics is that everyone is overloaded. There’s a pervasive spirit in our industry (especially in the US) hailing overworked engineers as heroes, with an entire “cult of busy” around. If you have time, you’re a slacker and probably not worth it. Of course this doesn’t help when being a maintainer, but I don’t believe it’s a cause of why the Linux maintainer model doesn’t work. This cult of busy leads to burnout, which is in my opinion a prime risk when you’re an open source person. Personally I’ve gone through a few difficult phases until I understood my limits and respected them. When you start as a maintainer for 2-3 people, and it increases to a few dozen within a couple of years, then getting a bit overloaded is rather natural - it’s a new job, with a different set of responsibilities and I had no clue about a lot of things. That’s no different from suddenly being a leader of a much bigger team anywhere else. A great talk on this topic is “What part of “… for life” don’t you understand?” from Jacob Kaplan-Moss since it’s by a former maintainer. It also contains a bunch of links to talks on burnout specifically. Ignoring burnout is not healthy, or not knowing about the early warning signs, it is rampant in our communities, but for now I’ll leave it at that.

Boutique Trees and Bus Factors

The first issue I see is how maintainers usually are made: You scratch an itch somewhere, write a bit of code, suddenly a few more people find it useful, and “tag” you’re the maintainer. On top, you often end up being stuck in that position “for life”. If the community keeps growing, or your maintainer becomes otherwise busy with work&life, you have your standard-issue overloaded bottleneck.

That’s the point where I think the kernel community goes wrong. When other projects reach this point they start to build up a more formal community structure, with specialized roles, boards for review and other bits and pieces. One of the oldest, and probably most notorious, is Debian with its constitution. Of course a small project doesn’t need such elaborate structures. But if the goal is world domination, or at least creating something lasting, it helps when there’s solid institutions that cope with people turnover. At first just documenting processes and roles properly goes a long way, long before bylaws and codified decision processes are needed.

The kernel community, at least on the maintainer side, entirely lacks this.

What instead most often happens is that a new set of ad-hoc, chosen-by-default maintainers start to crop up in a new level of the hierarchy, below your overload bottleneck. Because becoming your own maintainer is the only way to help out and to get your own features merged. That only perpetuates the problem, since the new maintainers are as likely to be otherwise busy, or occupied with plenty of other kernel parts already. If things go well that area becomes big, and you have another git tree with another overloaded maintainer. More often than not people move around, and accumulate small bits allover under their maintainership. And then the cycle repeats.

The end result is a forest of boutique trees, each covering a tiny part of the project, maintained by a bunch of notoriously overloaded people. The resulting cross-tree coordination issues are pretty impressive - in the graphics subsystem we fairly often end up with with simple drivers that somehow need prep patches in 5 different trees before you can even land that simple driver in the graphics tree.

Unfortunately that’s not the bad part. Because these maintainers are all busy with other trees, or their work, or life in general, you’re guaranteed that one of them is not available at any given time. Worse, because their tree has relatively little activity because it covers a small area, many only pick up patches once per kernel release, which means a built-in 3 month delay. That’s all because each tree and area has just one maintainer. In the end you don’t even need the proverbial bus to hit anyone to feel the pain of having a single point of failure in your organization - there’s so many maintainer trees around that that absence always happens, and constantly.

Of course people get fed up trying to get features merged, and often the fix is trying to become a maintainer yourself. That takes a while and isn’t easy - only 20% of all patches are authored by maintainers - and after the new code landed it makes it all worse: Now there’s one more semi-absent maintainer with one more boutique tree, adding to all the existing troubles.

Checks and Balances

All patches merged into the Linux kernel are supposed to be reviewed, and rather often that review is only done by the maintainers who merges the patch. When maintainers send out pull requests the next level of maintainers then reviews those patch piles, until they land in Linus’ tree. That’s an organization where control flows entirely top-down, with no checks and balances to reign in maintainers who are not serving their contributors well. History of dicatorships tells us that despite best intentions, the end result tends to heavily favour the few over the many. As a crude measure for how much maintainers subject themselves to some checks&balances by their peers and contributors I looked at how many patches authored and committed by the same person (probably a maintainer) do not also carry a reviewed or acked tag. For the Intel driver that’s less than 3%. But even within the core graphics code it’s only 5%, and that covers the time before we started to experiment with commit rights for that area. And for the graphics subsystem overall the ratio is still only about 25%, including a lot of drivers with essentially just one contributor, who is always volunteered as the maintainer, and hence somewhat natural that those maintainers lack reviewers.

Outside of graphics only roughly 25% of all patches written by maintainers are reviewed by their peers - 75% of all maintainer patches lack any kind of recorded peer review, compared to just 25% for graphics alone. And even looking at core areas like kernel/ or mm/ the ratio is only marginally better at about 30%. In short, in the kernel at large, peer review of maintainers isn’t the norm.

And there’s nothing outside of the maintainer hierarchy that could provide some checks and balance either. The only way to escalate disagreement is by starting a revolution, and revolutions tend to be long, drawn-out struggles and generally not worth it. Even Debian only recently learned that they lack a way to depose maintainers, and that maybe going maintainerless would be easier (again, LWN has you covered).

Of course the kernel is not the only hierarchy where there’s no meaningful checks and balances. Professor at universities, and managers at work are in a fairly similar position, with minimal options for students or employers to meaningfully appeal decisions. But that’s a recognized problem, and at least somewhat countered by providing ways to provide anonymous feedback, often through regular surveys. The results tend to not be all that significant, but at least provide some control and accountability to the wider masses of first-level dwellers in the hierarchy. In the kernel that amounts to about 80% of all contributions, but there’s no such survey. On the contrary, feedback sessions about maintainer happiness only reinforce the control structure, with e.g. the kernel summit featuring an “Is Linus happy?” session each year.

Another closely related aspect to all this is how a project handles personal conflicts between contributors. For a very long time Linux didn’t have any formal structures in this area either, with the only options available to unhappy people to either take it or leave it. Well, or usurping a maintainer with a small revolution, but that’s not really an option. For two years we’ve now had the “Code of Conflict”, which de facto just throws up its hands and declares that conflict are the normal outcome, essentially just encoding the status quo. Refusing to handle conflicts in a project with thousands of contributors just doesn’t work, except that it results in lots of frustration and ultimately people trying to get away. Again, the lack of a poised board to enforce a strong code of conduct, independent of the maintainer hierarchy, is in line with the kernel community unwillingness to accept checks and balances.

Mesh vs. Hierarchy

The last big issue I see with the Linux kernel model, featuring lots of boutique trees and overloaded maintainer, is that it seems to harm collaboration and integration of new contributors. In the Intel graphics, driver maintainers only ever reviewed a small minority of all patches over the last few years, with the goal to foster direct collaboration between contributors. Still, when a patch was stuck, maintainers were the first point of contact, especially, but not only, for newer contributors. No amount of explaining that only the lack of agreement with the reviewer was the gating factor could persuade people to fully collaborate on code reviews and rework the code, tests and documentation as needed. Especially when they’re coming with previous experience where code review is more of a rubber-stamp step compared to the distributed and asynchronous pair-programming it often resembles in open-source. Instead, new contributors often just ended up falling back to pinging maintainers to make a decision or just merge the patches as-is.

Giving all regular contributors commit rights and fully trusting them to do the right thing entirely fixed that: If the reviewer or author have commit rights there’s no easy excuse anymore to involve maintainers when the author and reviewer can’t reach agreement. Of course that requires a lot of work in mentoring people, making sure requirements for merging are understood and documented, and automating as much as possible to avoid screw ups. I think maintainers who lament their lack of review bandwidth, but also state they can’t trust anyone else aren’t really doing their jobs.

At least for me, review isn’t just about ensuring good code quality, but also about diffusing knowledge and improving understanding. At first there’s maybe one person, the author (and that’s not a given), understanding the code. After good review there should be at least two people who fully understand it, including corner cases. And that’s also why I think that group maintainership is the only way to run any project with more than one regular contributor.

On the topic of patch review and maintainers, there’s also the habit of wholesale rewrites of patches written by others. If you want others to contribute to your project, then that means you need to accept other styles and can’t enforce your own all the time. Merging first and polishing later recognizes new contributions, and if you engage newcomers for the polish work they tend to stick around more often. And even when a patch really needs to be reworked before merging it’s better to ask the author to do it: Worst case they don’t have time, best case you’ve improved your documentation and training procedure and maybe gained a new regular contributor on top.

A great take on the consequences of having fixed roles instead of trying to spread responsibilities more evenly is Alice Goldfuss’ talk “Rock Stars, Builders, and Janitors: You’re doing it wrong”. I also think that rigid roles present a bigger bar for people with different backgrounds, hampering diversity efforts and in the spirit of Sarah Sharps post on what makes a good community, need to be fixed first.

Towards a Maintainer’s Manifest

I think what’s needed in the end is some guidelines and discussions about what a maintainer is, and what a maintainer does. We have ready-made licenses to avoid havoc, there’s code of conducts to copypaste and implement, handbooks for building communities, and for all of these things, lots of conferences. Maintainer on the other hand you become by accident, as a default. And then everyone gets to learn how to do it on their own, while hopefully not burning too many bridges - at least I myself was rather lost on that journey at times. I’d like to conclude with a draft on a maintainer’s manifest.

It’s About the People

If you’re maintainer of a project or code area with a bunch of full time contributors (or even a lot of drive-by contributions) then primarily you deal with people. Insisting that you’re only a technical leader just means you don’t acknowledge what your true role really is.

And then, trust them to do a good job, and recognize them for the work they’re doing. The important part is to trust people just a bit more than what they’re ready for, as the occasional challenge, but not too much that they’re bound to fail. In short, give them the keys and hope they don’t wreck the car too badly, but in all cases have insurance ready. And insurance for software is dirt cheap, generally a git revert and the maintainer profusely apologizing to everyone and taking the blame is all it takes.

Recognize Your Power

You’re a maintainer, and you have essentially absolute power over what happens to your code. For successful projects that means you can unleash a lot of harm on people who for better or worse are employed to deal with you. One of the things that annoy me the most is when maintainers engage in petty status fights against subordinates, thinly veiled as technical discussions - you end up looking silly, and it just pisses everyone off. Instead recognize your powers, try to stay on the good side of the force and make sure you share it sufficiently with the contributors of your project.

Accept Your Limits

At the beginning you’re responsible for everything, and for a one-person project that’s all fine. But eventually the project grows too much and you’ll just become a dictator, and then failure is all but assured because we’re all human. Recognize what you don’t do well, build institutions to replace you. Recognize that the responsibility you initially took on might not be the same as that which you’ll end up with and either accept it, or move on. And do all that before you start burning out.

Be a Steward, Not a Lord

I think one of key advantages of open source is that people stick around for a very long time. Even when they switch jobs or move around. Maybe the usual “for life” qualifier isn’t really a great choice, since it sounds more like a mandatory sentence than something done by choice. What I object to is the “dictator” part, since if your goal is to grow a great community and maybe reach world domination, then you as the maintainer need to serve that community. And not that the community serves you.

Thanks a lot to Ben Widawsky, Daniel Stone, Eric Anholt, Jani Nikula, Karen Sandler, Kimmo Nikkanen and Laurent Pinchart for reading and commenting on drafts of this text.

January 20, 2017 12:00 AM

January 19, 2017

Matthew Garrett: Android apps, IMEIs and privacy

There's been a sudden wave of people concerned about the Meitu selfie app's use of unique phone IDs. Here's what we know: the app will transmit your phone's IMEI (a unique per-phone identifier that can't be altered under normal circumstances) to servers in China. It's able to obtain this value because it asks for a permission called READ_PHONE_STATE, which (if granted) means that the app can obtain various bits of information about your phone including those unique IDs and whether you're currently on a call.

Why would anybody want these IDs? The simple answer is that app authors mostly make money by selling advertising, and advertisers like to know who's seeing their advertisements. The more app views they can tie to a single individual, the more they can track that user's response to different kinds of adverts and the more targeted (and, they hope, more profitable) the advertising towards that user. Using the same ID between multiple apps makes this easier, and so using a device-level ID rather than an app-level one is preferred. The IMEI is the most stable ID on Android devices, persisting even across factory resets.

The downside of using a device-level ID is, well, whoever has that data knows a lot about what you're running. That lets them tailor adverts to your tastes, but there are certainly circumstances where that could be embarrassing or even compromising. Using the IMEI for this is even worse, since it's also used for fundamental telephony functions - for instance, when a phone is reported stolen, its IMEI is added to a blacklist and networks will refuse to allow it to join. A sufficiently malicious person could potentially report your phone stolen and get it blocked by providing your IMEI. And phone networks are obviously able to track devices using them, so someone with enough access could figure out who you are from your app usage and then track you via your IMEI. But realistically, anyone with that level of access to the phone network could just identify you via other means. There's no reason to believe that this is part of a nefarious Chinese plot.

Is there anything you can do about this? On Android 6 and later, yes. Go to settings, hit apps, hit the gear menu in the top right, choose "App permissions" and scroll down to phone. Under there you'll see all apps that have permission to obtain this information, and you can turn them off. Doing so may cause some apps to crash or otherwise misbehave, whereas newer apps may simply ask for you to grant the permission again and refuse to do so if you don't.

Meitu isn't especially rare in this respect. Over 50% of the Android apps I have handy request your IMEI, although I haven't tracked what they all do with it. It's certainly something to be concerned about, but Meitu isn't especially rare here - there are big-name apps that do exactly the same thing. There's a legitimate question over whether Android should be making it so easy for apps to obtain this level of identifying information without more explicit informed consent from the user, but until Google do anything to make it more difficult, apps will continue making use of this information. Let's turn this into a conversation about user privacy online rather than blaming one specific example.

comment count unavailable comments

January 19, 2017 11:36 PM

January 12, 2017

Pete Zaitcev: git-codereview

Not content with a legacy git-review, Google developed another Gerrit front-end, the git-coderevew. They use it for contributions to Go. I have to admit, that was a bit less of a special move than Facebook's git-review that uses the same name but does something entirely different.

P.S. There used to be a post about creating a truly distributed github, which used blockchain in order to vote on globally unique names. Can't find a link though.

January 12, 2017 07:56 PM

January 08, 2017

Pete Zaitcev: Mirantis and the business of OpenStack

It seems that only in November we heard about massive layoffs at Mirantis, "The #1 Pure Play OpenStack Company" (per <title>). Now they are teaching us thus:

And what about companies like Mirantis adding Kubernetes and other container technologies to their slate? Is that a sign of the OpenStack Apocalypse?

In a word, “no”.

Gee, thanks. I'm sure they know what it's like.

January 08, 2017 06:07 PM

January 03, 2017

James Bottomley: TPM2 and Linux

Recently Microsoft started mandating TPM2 as a hardware requirement for all platforms running recent versions of windows.  This means that eventually all shipping systems (starting with laptops first) will have a TPM2 chip.  The reason this impacts Linux is that TPM2 is radically different from its predecessor TPM1.2; so different, in fact, that none of the existing TPM1.2 software on Linux (trousers, the plug in for openssl, even my gnome keyring enhancements) will work with TPM2.  The purpose of this blog is to explore the differences and how we can make ready for the transition.

What are the Main 1.2 vs 2.0 Differences?

The big one is termed Algorithm Agility.  TPM1.2 had SHA1 and RSA2048 only.  TPM2 is designed to have many possible algorithms, including support for elliptic curve and a host of government mandated (Russian and Chinese) crypto systems.  There’s no requirement for any shipping TPM2 to support any particular algorithms, so you actually have to ask your TPM what it supports.  The bedrock for TPM2 in the West seems to be RSA1024-2048, ECC and AES for crypto and SHA1 and SHA256 for hashes1.

What algorithm agility means is that you can no longer have root keys (EK and SRK see here for details) like TPM1.2 did, because a key requires a specific crypto algorithm.  Instead TPM2 has primary “seeds” and a Key Derivation Function (KDF).  The way this works is that a seed is simply a long string of random numbers, but it is used as input to the KDF along with the key parameters and the algorithm and out pops a real key based on the seed.  The KDF is deterministic, so if you input the same algorithm and the same parameters you get the same key again.  There are four primary seeds in the TPM2: Three permanent ones which only change when the TPM2 is cleared: endorsement (EPS), Platform (PPS) and Storage (SPS).  There’s also a Null seed, which is used for ephemeral keys and changes every reboot.  A key derived from the SPS can be regarded as the SRK and a key derived from the EPS can be regarded as the EK. Objects descending from these keys are called members of hierarchies2. One of the interesting aspects of the TPM is that the root of a hierarchy is a key not a seed (because you need to exchange secret information with the TPM), and that there can be multiple of these roots with different key algorithms and parameters.

Additionally, the mechanism for making use of keys has changed slightly.  In TPM 1.2 to import a secret key you wrapped it asymmetrically to the SRK and then called LoadKeyByBlob to get a use handle.  In TPM2 this is a two stage operation, firstly you import a wrapped (or otherwise protected) private key with TPM2_Import, but that returns a private key structure encrypted with the parent key’s internal symmetric key.  This symmetrically encrypted key is then loaded (using TPM2_Load) to obtain a use handle whenever needed.  The philosophical change is from online keys in TPM 1.2 (keys which were resident inside the TPM) to offline keys in TPM2 (keys which now can be loaded when needed).  This philosophy has been reinforced by reducing the space available to keep keys loaded in TPM2 (see later).

Playing with TPM2

If you have a recent laptop, chances are you either have or can software upgrade to a TPM2.  I have a dell XPS13 the skylake version which comes with a software upgradeable Nuvoton TPM.  Dell kindly provides a 1.2->2 switching program here, which seems to work under Freedos (odin boot) so I have a physical TPM2 based system.  For those of you who aren’t so lucky, you can still play along, but you need a TPM2 emulator.  The best one is here; simply download and untar it then type make in the src directory and run it as ./tpm_server.  It listens on two TCP ports, 2321 and 2322, for TPM commands so there’s no need to install it anywhere, it can be run directly from the source directory.

After that, you need the interface software called tss2.  The source is here, but Fedora 25 and recent Ubuntu already package it.  I’ve also built openSUSE packages here.  The configuration of tss2 is controlled by environment variables.  The most important one is TPM_INTERFACE_TYPE which tells it how to connect to the TPM2.  If you’re using a simulator, you set this to “socsim” and if you have a real TPM2 device you set it to “dev”.  One final thing about direct device connection: in tss2 there’s no daemon like trousers had to broker the connection, all your users connect directly to the TPM2 device /dev/tpm0.  To do this, the device has to support read and write by arbitrary users, so its permissions need to be 0666.  I’ve got a udev script to achieve this

# tpm 2 devices need to be world readable
SUBSYSTEM=="tpm", ACTION=="add", MODE="0666"

Which goes in /etc/udev/rules.d/80-tpm-2.rules on openSUSE.  The next thing you need to do, if you’re running the simulator, is power it on and start it up (for a real device, this is done by the bios):


The simulator will now create a NVChip file wherever you started it to store NV ram based objects, which it will read on next start up.  The first thing you need to do is create an SRK and store it in NV memory.  Microsoft uses the well known key handle 81000001 for this, so we’ll do the same.  The reason for doing this is that a real TPM takes ages to run the KDF for RSA keys because it has to look for prime numbers:

jejb@jarvis:~> TPM_INTERFACE_TYPE=socsim time tsscreateprimary -hi o -st -rsa
Handle 80000000
0.03 user 0.00 system 0:00.06 elapsed

jejb@jarvis:~> TPM_INTERFACE_TYPE=dev time tsscreateprimary -hi o -st -rsa
Handle 80000000
0.04 user 0.00 system 0:20.51 elapsed

As you can see: the simulator created a primary storage key (the SRK) in a few milliseconds, but it took my real TPM2 20 seconds to do it3 … not something you want to wait for, hence the need to store this permanently under a well known key handle and get rid of the temporary copy

tssevictcontrol -hi o -ho 80000000 -hp 81000001
tssflushcontext -ha 80000000

tssevictcontrol tells the TPM to copy the key at transient handle 800000004  to permanent NV handle 81000001 and tssflushcontext erases the transient key.  Flushing transient objects is very important, because TPM2 has a lot less transient storage space than TPM1.2 did; usually only about three handles worth.  You can tell how much you have by doing

tssgetcapability -cap 6|grep -i transient
TPM_PT 0000010e value 00000003 TPM_PT_HR_TRANSIENT_MIN - the minimum number of transient objects that can be held in TPM RAM

Where the value (00000003) tells me that the TPM can store at least 3 transient objects.  After that you’ll start getting out of space errors from it.

The final step in taking ownership of a TPM2 is to set the authorization passwords.  Each of the four hierarchies (Null, Owner, Endorsement, Platform) and the Lockout has a possible authority password.  The Platform authority is cleared on startup, so there’s not much point setting it (it’s used by the BIOS or Firmware to perform TPM functions).  Of the other four, you really only need to set Owner, Endorsement and Lockout (I use the same password for all of them).

tsshierarchychangeauth -hi l -pwdn <your password>
tsshierarchychangeauth -hi e -pwdn <your password>
tsshierarchychangeauth -hi o -pwdn <your password>

After this is done, you’re all set.  Note that as well as these authorizations, each object can have its own authorization (or even policy), so the SRK you created earlier still has no password, allowing it to be used by anyone.  Note also that the owner authorization controls access to the NV memory, so you’ll need to supply it now to make other objects persistent.

An Aside about Real TPM2 devices and the Resource Manager

Although I’m using the code below to store my keys in the TPM2, there’s a couple of practical limitations which means it won’t work for you if you have multiple TPM2 using applications without a kernel update.  The two problems are

  1. The Linux Kernel TPM2 device /dev/tpm0 only allows one user at once.  If a second application tries to open the device it will get an EBUSY which causes TSS_Create() to fail.
  2. Because most applications make use of transient key slots and most TPM2s have only a couple of these, simultaneous users can end up running out of these and getting unexpected out of space errors.

The solution to both of these is something called a Resource Manager (RM).  What the RM does is effectively swap transient objects in and out of the TPM as needed to prevent it from running out of space.  Linux has an issue in that both the kernel and userspace are potential users of TPM keys so the resource manager has to live inside the kernel.  Jarkko Sakkinen has preliminary resource manager patches here, and they will likely make it into kernel 4.11 or 4.12.  I’m currently running my laptop with the RM patches applied, so multiple applications work for me, but since these are preliminary patches, I wouldn’t currently advise others to do this.  The way these patches work is that once you declare to the kernel via an ioctl that you want to use the RM, every time you send a command to the TPM, your context gets swapped in, the command is executed, the context is swapped out and the response sent meaning that no other user of the TPM sees your transient objects.  The moment you send the ioctl, the TPM device allows another user to open it as well.

Using TPM2 as a keystore

Once the implementation is sorted out, openssl and gnome-keyring patches can be produced for TPM2.  The only slight wrinkle is that for create_tpm2_key you require a parent key to exist in the NV storage (at the 81000001 handle we talked about previously).  So to convert from a password protected openssh RSA key to a TPM2 based one, you do

create_tpm2_key -a -p 81000001 -w id_rsa id_rsa.tpm
mv id_rsa.tpm id_rsa

And then gnome keyring manager will work nicely (make sure you keep a copy of your original private key in case you want to move to a new laptop or reseed the TPM2).  If you use the same TPM2 password as your original key password, you won’t even need to update the gnome loginkeyring for the new password.


Because of the lack of an in-kernel Resource Manager, TPM2 is ready for experimentation in Linux but definitely not ready for prime time yet (unless you’re willing to patch your kernel).  Hopefully this will change in the 4.11 or 4.12 kernel when the Resource Manager finally goes upstream5.

Looking forwards to the new stack, the lack of a central daemon is really a nice feature: tcsd crashing used to kill all of my TPM key based applications, but with tss2 having no central daemon, everything has just worked(tm) so far.  A kernel based RM also means that the kernel can happily use the TPM (for its trusted keys and disk encryption) without interfering with whatever userspace is doing.

January 03, 2017 12:55 AM

January 02, 2017

Paul E. Mc Kenney: Parallel Programming: January 2017 Update

Another year, another release of Is Parallel Programming Hard, And, If So, What Can You Do About It?!

Updates include:

  1. More formatting and build-system improvements, along with many bibliography updates, courtesy of Akira Yokosawa.
  2. A great many grammar and typo fixes from Akira and SeongJae Park.
  3. Numerous changes and fixes from Balbir Singh, Boqun Feng, Mike Rapoport, Praveen Kumar, and Tobias Klauser.
  4. Added code for concurrent skiplists, with the hope for added text in a later release.
  5. Added a running example to the deferred-processing chapter.
  6. Merged “Synchronization Primitives” into “Tools of Trade” section.
  7. Updated control-dependency discussion in memory-barriers section.
As always, git:// will be updated in real time.

January 02, 2017 05:23 PM

December 27, 2016

Pete Zaitcev: The idea of ARM has gone mainstream

We still don't have any usable servers on which I could install Fedora and have it supported for more than 3 releases, but gamers already debate the merits of ARM. The idea of SPEC-per-Watt has completely gone mainstream, like Marxism.

<sage> new uarch? it's about time
<sage> they just can't make x86 as power efficient as arm
<JTFish> What is the point
<JTFish> it's not like ARM will replace x86 in real servers any time soon
<sage> what is "real" servers?
<JTFish> anything that does REAL WORLD shit
<sage> what is "real world"?
<JTFish> serving internet content etc
<JTFish> database servers
<JTFish> I dunno
<JTFish> mass encoding of files
<sage> lots of startups and established companies are already betting on ARM for their cloud server offerings
<sage> database and mass encoding, ok
<sage> what else
<JTFish> are you saying
<JTFish> i'm 2 to 1
<JTFish> for x86
<JTFish> also I should just go full retard and say minecraft servers
<sage> the power savings are big, if they can run part of their operation on ARM and make it financially viable, they will do it

QUICK UPDATE: In the linked article:

The next Intel uArch will be very similar to the approach used by AMD with Zen – perfect balance of power consumption/performance/price – but with a huge news: in order to save physical space (Smaller Die) and to improve the power consumption/performance ratio, Intel will throw away some old SIMD and old hardware remainders.

The 100% backward hardware x86 compatibility will not guaranteed anymore, but could be not a handicap (Some SIMD today are useless, and also we can use emulators or cloud systems). Nowadays a lot of software house have to develop code for ARM and for x86, but ARM is lacking useful SIMD. So, frequently, these software are a watered-down compromise.

Intel will be able to develop a thin and fast x86 uArch, and ICC will be able to optimize the code both for ARM and for x86 as well.

This new uArch will be ready in 2019-2020.

Curious. Well, as long as they don't go full Transmeta on us, it may be fine.

December 27, 2016 06:51 PM