Kernel Planet

July 20, 2017

Paul E. Mc Kenney: Parallel Programming: Getting the English text out of the way

We have been making good progress on the next release of Is Parallel Programming Hard, And, If So, What Can You Do About It?, and hope to have a new release out soonish.

In the meantime, for those of you for whom the English text in this book has simply gotten in the way, there is now an alternative:

perfbook_cn_cover

On the off-chance that any of you are seriously interested, this is available from
Amazon China, JD.com, Taobao.com, and Dangdang.com. For the rest of you, you have at least seen the picture.  ;–)

July 20, 2017 02:37 AM

July 18, 2017

Matthew Garrett: Avoiding TPM PCR fragility using Secure Boot

In measured boot, each component of the boot process is "measured" (ie, hashed and that hash recorded) in a register in the Trusted Platform Module (TPM) build into the system. The TPM has several different registers (Platform Configuration Registers, or PCRs) which are typically used for different purposes - for instance, PCR0 contains measurements of various system firmware components, PCR2 contains any option ROMs, PCR4 contains information about the partition table and the bootloader. The allocation of these is defined by the PC Client working group of the Trusted Computing Group. However, once the boot loader takes over, we're outside the spec[1].

One important thing to note here is that the TPM doesn't actually have any ability to directly interfere with the boot process. If you try to boot modified code on a system, the TPM will contain different measurements but boot will still succeed. What the TPM can do is refuse to hand over secrets unless the measurements are correct. This allows for configurations where your disk encryption key can be stored in the TPM and then handed over automatically if the measurements are unaltered. If anybody interferes with your boot process then the measurements will be different, the TPM will refuse to hand over the key, your disk will remain encrypted and whoever's trying to compromise your machine will be sad.

The problem here is that a lot of things can affect the measurements. Upgrading your bootloader or kernel will do so. At that point if you reboot your disk fails to unlock and you become unhappy. To get around this your update system needs to notice that a new component is about to be installed, generate the new expected hashes and re-seal the secret to the TPM using the new hashes. If there are several different points in the update where this can happen, this can quite easily go wrong. And if it goes wrong, you're back to being unhappy.

Is there a way to improve this? Surprisingly, the answer is "yes" and the people to thank are Microsoft. Appendix A of a basically entirely unrelated spec defines a mechanism for storing the UEFI Secure Boot policy and used keys in PCR 7 of the TPM. The idea here is that you trust your OS vendor (since otherwise they could just backdoor your system anyway), so anything signed by your OS vendor is acceptable. If someone tries to boot something signed by a different vendor then PCR 7 will be different. If someone disables secure boot, PCR 7 will be different. If you upgrade your bootloader or kernel, PCR 7 will be the same. This simplifies things significantly.

I've put together a (not well-tested) patchset for Shim that adds support for including Shim's measurements in PCR 7. In conjunction with appropriate firmware, it should then be straightforward to seal secrets to PCR 7 and not worry about things breaking over system updates. This makes tying things like disk encryption keys to the TPM much more reasonable.

However, there's still one pretty major problem, which is that the initramfs (ie, the component responsible for setting up the disk encryption in the first place) isn't signed and isn't included in PCR 7[2]. An attacker can simply modify it to stash any TPM-backed secrets or mount the encrypted filesystem and then drop to a root prompt. This, uh, reduces the utility of the entire exercise.

The simplest solution to this that I've come up with depends on how Linux implements initramfs files. In its simplest form, an initramfs is just a cpio archive. In its slightly more complicated form, it's a compressed cpio archive. And in its peak form of evolution, it's a series of compressed cpio archives concatenated together. As the kernel reads each one in turn, it extracts it over the previous ones. That means that any files in the final archive will overwrite files of the same name in previous archives.

My proposal is to generate a small initramfs whose sole job is to get secrets from the TPM and stash them in the kernel keyring, and then measure an additional value into PCR 7 in order to ensure that the secrets can't be obtained again. Later disk encryption setup will then be able to set up dm-crypt using the secret already stored within the kernel. This small initramfs will be built into the signed kernel image, and the bootloader will be responsible for appending it to the end of any user-provided initramfs. This means that the TPM will only grant access to the secrets while trustworthy code is running - once the secret is in the kernel it will only be available for in-kernel use, and once PCR 7 has been modified the TPM won't give it to anyone else. A similar approach for some kernel command-line arguments (the kernel, module-init-tools and systemd all interpret the kernel command line left-to-right, with later arguments overriding earlier ones) would make it possible to ensure that certain kernel configuration options (such as the iommu) weren't overridable by an attacker.

There's obviously a few things that have to be done here (standardise how to embed such an initramfs in the kernel image, ensure that luks knows how to use the kernel keyring, teach all relevant bootloaders how to handle these images), but overall this should make it practical to use PCR 7 as a mechanism for supporting TPM-backed disk encryption secrets on Linux without introducing a hug support burden in the process.

[1] The patchset I've posted to add measured boot support to Grub use PCRs 8 and 9 to measure various components during the boot process, but other bootloaders may have different policies.

[2] This is because most Linux systems generate the initramfs locally rather than shipping it pre-built. It may also get rebuilt on various userspace updates, even if the kernel hasn't changed. Including it in PCR 7 would entirely break the fragility guarantees and defeat the point of all of this.

comment count unavailable comments

July 18, 2017 06:48 AM

July 13, 2017

Linux Plumbers Conference: VFIO/IOMMU/PCI Microconference Accepted into Linux Plumbers Conference

Following on from the successful PCI Microconference at Plumbers last year we’re pleased to announce a follow on this year with an expanded scope.

The agenda this year will focus on overlap and common development between VFIO/IOMMU/PCI subsystems, and in particular how consolidation of the shared virtual memory(SVM) API can drive an even tighter coupling between them.

This year we will also focus on user visible aspects such as using SVM to share page tables with devices and reporting I/O page faults to userspace in addition to discussing PCI and IOMMU interfaces and potential improvements.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

July 13, 2017 05:20 PM

July 11, 2017

Linux Plumbers Conference: Power Management and Energy-awareness Microconference Accepted into Linux Plumbers Conference

Following on from the successful Power Management and Energy-awareness at Plumbers last year we’re pleased to announce a follow on this year.

The agenda this year will focus on a range of topics including CPUfreq core improvements and schedutil governor extensions, how to best use scheduler signals to balance energy consumption and performance and user space interfaces to control capacity and utilization estimates.  We’ll also discuss selective throttling in thermally constrained systems, runtime PM for ACPI, CPU cluster idling and the possibility to implement resume from hibernation in a bootloader.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

July 11, 2017 04:15 PM

James Morris: Linux Security Summit 2017 Schedule Published

The schedule for the 2017 Linux Security Summit (LSS) is now published.

LSS will be held on September 14th and 15th in Los Angeles, CA, co-located with the new Open Source Summit (which includes LinuxCon, ContainerCon, and CloudCon).

The cost of LSS for attendees is $100 USD. Register here.

Highlights from the schedule include the following refereed presentations:

There’s also be the usual Linux kernel security subsystem updates, and BoF sessions (with LSM namespacing and LSM stacking sessions already planned).

See the schedule for full details of the program, and follow the twitter feed for the event.

This year, we’ll also be co-located with the Linux Plumbers Conference, which will include a containers microconference with several security development topics, and likely also a TPMs microconference.

A good critical mass of Linux security folk should be present across all of these events!

Thanks to the LSS program committee for carefully reviewing all of the submissions, and to the event staff at Linux Foundation for expertly planning the logistics of the event.

See you in Los Angeles!

July 11, 2017 11:30 AM

July 10, 2017

Kees Cook: security things in Linux v4.12

Previously: v4.11.

Here’s a quick summary of some of the interesting security things in last week’s v4.12 release of the Linux kernel:

x86 read-only and fixed-location GDT
With kernel memory base randomization, it was stil possible to figure out the per-cpu base address via the “sgdt” instruction, since it would reveal the per-cpu GDT location. To solve this, Thomas Garnier moved the GDT to a fixed location. And to solve the risk of an attacker targeting the GDT directly with a kernel bug, he also made it read-only.

usercopy consolidation
After hardened usercopy landed, Al Viro decided to take a closer look at all the usercopy routines and then consolidated the per-architecture uaccess code into a single implementation. The per-architecture code was functionally very similar to each other, so it made sense to remove the redundancy. In the process, he uncovered a number of unhandled corner cases in various architectures (that got fixed by the consolidation), and made hardened usercopy available on all remaining architectures.

ASLR entropy sysctl on PowerPC
Continuing to expand architecture support for the ASLR entropy sysctl, Michael Ellerman implemented the calculations needed for PowerPC. This lets userspace choose to crank up the entropy used for memory layouts.

LSM structures read-only
James Morris used __ro_after_init to make the LSM structures read-only after boot. This removes them as a desirable target for attackers. Since the hooks are called from all kinds of places in the kernel this was a favorite method for attackers to use to hijack execution of the kernel. (A similar target used to be the system call table, but that has long since been made read-only.)

KASLR enabled by default on x86
With many distros already enabling KASLR on x86 with CONFIG_RANDOMIZE_BASE and CONFIG_RANDOMIZE_MEMORY, Ingo Molnar felt the feature was mature enough to be enabled by default.

Expand stack canary to 64 bits on 64-bit systems
The stack canary values used by CONFIG_CC_STACKPROTECTOR is most powerful on x86 since it is different per task. (Other architectures run with a single canary for all tasks.) While the first canary chosen on x86 (and other architectures) was a full unsigned long, the subsequent canaries chosen per-task for x86 were being truncated to 32-bits. Daniel Micay fixed this so now x86 (and future architectures that gain per-task canary support) have significantly increased entropy for stack-protector.

Expanded stack/heap gap
Hugh Dickens, with input from many other folks, improved the kernel’s mitigation against having the stack and heap crash into each other. This is a stop-gap measure to help defend against the Stack Clash attacks. Additional hardening needs to come from the compiler to produce “stack probes” when doing large stack expansions. Any Variable Length Arrays on the stack or alloca() usage needs to have machine code generated to touch each page of memory within those areas to let the kernel know that the stack is expanding, but with single-page granularity.

That’s it for now; please let me know if I missed anything. The v4.13 merge window is open!

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

July 10, 2017 08:24 AM

Dave Airlie (blogspot): radv and the vulkan deferred demo - no fps left behind!

A little while back I took to wondering why one particular demo from the Sascha Willems vulkan demos was a lot slower on radv compared to amdgpu-pro. Like half the speed slow.

I internally titled this my "no fps left behind" project.

The deferred demo, does an offscreen rendering to 3 2048x2048 color attachments and one 2048x2048 D32S8 depth attachment. It then does a rendering using those down to as 1280x720 screen image.

Bas identifed the first cause was probably the fact we were doing clear color eliminations on the offscreen surfaces when we didn't need to. AMD GPU have a delta-color compression feature, and with certain clear values you don't need to do the clear color eliminations step. This brought me back from about 1/2 the FPS to about 3/4, however it took me quite a while to figure out where the rest of the FPS were hiding.

I took a few diversions in my testing, I pulled in some experimental patches to allow the depth buffer to be texture cache compatible, so could bypass the depth decompression pass, however this didn't seem to budge the number too much.

I found a bunch of registers we were setting different values from -pro, nothing too much came of these.

I found some places we were using a compute shader to fill some DCC or htile surfaces to a value, then doing a clear and overwriting the values, not much help.

I noticed the vertex descriptions and buffer attachments on amdgpu-pro were done quite different to how radv does it. With vulkan you have vertex descriptors and bindings, with radv we generate a set of hw descriptors from the combination of both descriptors and bindings. The pro driver uses typed buffer loads in the shader to embed the descriptor contents in the shader, then it only updates the hw descriptors for the buffer bindings. This seems like it might be more efficient, guess what, no help. (LLVM just grew support for typed buffer loads, so we could probably move to this scheme if we wished now).

I dug out some patches that inline all the push constants and some descriptors so our shaders had less overhead, (really helps our meta shaders have less impact), no helps.

I noticed they export the shader results in a different order from the fragment shader, and always at the end. (no help). The vertex shader emits pos first, (no help). The vertex shader uses off exports for unused channels, (no help).

I went on holidays for a week and came back to stare at the traces again, when I my brain finally noticed something I'd missed. When binding the 3 color buffers, the addresses given as the base address were unusual. A surface has a 40-bit address, normally for alignment and tiling the bottom 16-bits are 0, and we shift 8 of those off completely before writing them. This leaves the bottom 8 bits of the base address has should be 0, and the CIK docs from AMD say that. However the pro traces didn't have these at 0. It appears from earlier evergreen/cayman documents these register control some tiling offset bits. After writing a hacky patch to set the values, I managed to get back the rest of the FPS I was missing in the deferred demo. I discussed with AMD developers, and we worked out the addrlib library has an API for working out these values, and it seems that it allows better memory bandwidth utilisation. I've written a patch to try and use these values correctly and sent it out along with the DCC avoidance patch.

Now I'm not sure this will help any real apps, we may not be hitting limitations in that area, and I'm never happy with the benchmarks I run myself. I thought I saw some FPS difference with some madmax scenes, but I might be lying to myself. Once the patches land in mesa I'm sure others will run benchmarks and we can see if there is any use case where they have an effect. The AMD radeonsi OpenGL driver can also do the same tweaks so hopefully there as well there will be some benefit.

Otherwise I can just write this off as making deferred run at equality and removing at least one of the deltas that radv has compared to the pro driver. Some of the other differences I discovered along the way might also have some promise in other scenarios, so I'll keep an eye on them.

Thanks to Bas, Marek and Christian for looking into what the magic meant!

July 10, 2017 08:08 AM

Dave Airlie: Migrating to blogsport

Due to lots of people telling me LJ is bad, mm'kay, I've migrated to blogspot.

New blog is/will be here: https://airlied.blogspot.com

July 10, 2017 06:36 AM

Dave Airlie (blogspot): Migrating by blog here

I'm moving my blog from LJ to blogspot, because people keep telling me LJ is up to no go, like hacking DNC servers and interfering in elections.

July 10, 2017 06:29 AM

July 08, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/07/07

Audiohttp://traffic.libsyn.com/jcm/20170707.mp3

Linux 4.12 final is released, the 4.13 merge window opens, and various assorted ongoing kernel development is described in detail.

Editorial note

Reports of this podcast’s demise are greatly exaggerated. But it is worth noting that recording this weekly is HARD. That said, I am going to work on automation (I want the podcast to effectively write itself by providing a web UI via of LKML threads that allows anyone to write summaries, add author bios, links, etc. – and expand this to other communities) but that will all take some time. Until that happens, we’ll just have to live with some breaks.

Announcements

Linus Torvalds announced Linux 4.12 final. In his announcement mail, Linus reflects that “4.12 is just plain big”, noting that, this was “one of the bigger releases historically, and I think only 4.9 ends up having had more commits. And 4.9 was big at least partly because Greg announced it was an LTS [Long Term Support – receiving updates for several years] kernel”. In pure numbers, 4.12 adds over a million lines of code over 4.11, about half of which can be attributed to enablement for the AMD Vega GPU support. As usual, both Linux Weekly News (LWN) and KernelNewbies have excellent, and highly detailed summaries. Listeners are encouraged to support real kernel journalism by subscribing to Linux Weekly News and visiting lwn.net.

Theodore (Ted) Ts’o posted “Next steps and plans for the 2017 Maintainer and Kernel Summits”. He reminds everyone of the (slightly) revised format to the this year’s Kernel Summit (which is, as is often the case, co-located with a Linux Foundation event in the form of the Open Source Summit Prague in October). Notably, a program committee is established to help encourage submissions from those who feel they should be present at the event. To learn more, see the mailing list archives containing the announcement: https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss (technically the deadline is already passed, or tomorrow, depending)

Greg K-H (Kroah-Hartman) announced Linux 4.4.76, 4.9.36, and 4.11.9.

Willy Tarreau announced Linux 3.10.106, including a reminder that this “LTS” [Long Term Stable] kernel is “scheduled for end of life on end of October”.

Steven Rostedt released preempt-rt (“Real Time”) kernels 3.10.107-rt122, 3.18.59-rt65, 4.4.75-rt88, and 4.9.35-rt25, all of which were simply rebases to stable kernel updates and had “no RT specific changes”. It will be interesting to see if some of the hotplug fixes Thomas Gleixner has sent for Linux 4.13 will resolve issues seen by some RT users when doing hotplug.

Sebastian Andrzej Siewior announced preempt-rt (“Real time”) kernels v4.9.33-rt23, and v4.11.7-rt3, which still notes potential for a deadlock under CPU hotplug.

Stpehen Hemminger announced iproute2 version 4.12.0 matching Linux 4.12. This includes support for features present in the new kernel, including flower support and enhancements to the TC (Traffic Control) code: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.12.0.tar.gz

Bartosz Golaszewksi posted libgpiod v0.3:
https://github.com/brgl/libgpiod/releases/tag/v0.3

Mathieu Desnoyers announced LTTng modules 2.10.0-rc2, 2.9.3, 2.8.6, including support for “4.12 release candidate kernels”.

The 4.13 merge window

With the opening of the 4.13 merge window, many pull requests have begun flowing for what will become the new hotness in another couple of months. We won’t summarize each in detail (that resulted in a one hour long podcast the last time…) but will instead call out a few “interesting” changes of note. Stephen Rothwell also promptly updated his daily linux-next tree with the usual disclaimer that “Please do not add any v4.14 material to you[r] linux-next included branches until after v4.13-rc1 has been released”.

ACPI. Rafael J. Wysocki posted “ACPI updates for v4.13-rc1”, which includes an update to the ACPICA (ACPI Component Architecture) release of 20170531 that adds support to the OS-independent ACPICA layer for ACPI 6.2. This includes a number of new tables, including the PPTT (Processor Properties and Topology Table) that some of us have wanted to see for many years (as a means to more fully describe the NUMA properties of ARM servers, as just a random example…). In addition, Kees Cook has done some work to clean up the use of function pointer structures in ACPICA to use “designated initializers” so as “to make the structure layout randomization GCC plugin work with it”. All in all, this is a nice set of updates for all architectures.

AppArmor. John Johansen noted in his earlier pull request (to James Morris, who owns overall security subsystem pull requests headed to Linus) that an attempt was being made to get many of the Ubuntu specific AppArmor patches upstreamed. The 4.13 patches “introduces the domain labeling base code that Ubuntu has been carrying for several years”. He then plans to begin to RFC other Ubuntu-specific patches in later cycles.

ARM. Arnd Bergman notes a number of changes to 64-bit ARM platforms, including work done by Timur Tabi to change kernel def(ault)config files to “enable[s] a number of options that are typically required for server platforms”. It’s only been many years since this should have been the case in upstream Linux. Meanwhile, in a separate pull for “ARM: 64-bit DT [DeviceTree] updates”, support is added for many new boards (“For the first time I can remember, this is actually larger than the corresponding branch for 32-bit platforms”) including new varieties of “OrangePi” based on Allwinner chipsets.

Docs. Jon(athan) Corbet had noted that “You’ll also encounter more than the usual number of conflicts, which is saying something”. Linus “fixed the ones that were actual data conflicts” but he had some suggestions for how Kbuild could be modified such that an “make allmodconfig” checked for the existence of various files being reference in the rst documentation source files. He also noted that he was happy to see docbook “finally gone” but that sphinx, the tool used to generate documentation now, “isn’t exactly a speed demon”.

Hotplug. As noted elsewhere, Thomas Gleixner posted a pull request for various smp hotplug fixes that includes replacing an “open coded RWSEM [Read Write Semaphore] with a percpu RWSEM”. This is done to enable full coverage by the kernel’s “lockdep” locking dependency checker in order to catch hotplug deadlocks that have been seen on certain RT (Real Time) systems.

IRQ. Thomas Gleixner posted “irq updates for 4.13”, which includes “Expand the generic infrastructure handling the irq migration on CPU hotplug and convert X86 over to it” in preparation for cleaning up affinity management on blk multiqueue devices (preventing interrrupts being moved around during hotplug by instead shutting down affine interrupts intended to be always routed to a specific CPU). Thomas notes that “Jens [the blk maintainer] acked them and agreed that they should go with the irq changes”, but Linus later pushed back strongly after hitting merge conflicts that made him feel that some of these changes should have gone in via the blk tree instead of clashing with it. Linus was also concerned if the onlining code worked at all.

Objtool. Ingo Molnar posted a pull request including changes to the “objdump” tool intending to allow the tracking of stack pointer modifications through “machine instructions of disassembled functions found in kernel .o files”. The idea is to remove a dependency upon compiling the kernel with the CONFIG_FRAME_POINTERS=y option (which causes a larger stack frame and possible additional register pressure on some architectures) while still retaining the ability to generate correct kernel debuginfo data in the future.

PCI. Thomas Gleixner posted “x86/PCI updates for 4.13”, which includes work to separate PCI config space accessors from using a global PCI lock. Apparently, x86 already had an additional PCI config lock and so two layers of redundant locking were being employed, while neither was strictly necessary in the case of ECAM (“mmconfig”) based configuration, since “access to the extended configuration space [MMIO based configuration in PCIe] does not require locking”. Thomas also notes that a commit which had switched x86 to use ECAM [the MMIO mode] by default was removed so it will still use “type1 accessors” (the “old fashioned way” that Linus is so happy with) serialized by x86 internal locking for primary configuration space. This set of patches came in through x86 via Thomas with Bjorn Helgaas’s (PCI maintainer) permission.

RCU. Ingo Molnar noted that “The sole purpose of these changes is to shrink and simplify the RCU code base, which has suffered from creeping bloat”.

Scheduler. Ingo Molnar posted a pull request that included a number of changes, among them being NUMA scheduling improvements to address regressions seen when comparing 4.11 based kernels to older ones, from Rik van Riel.

VFS. Al Viro went to town with VFS updates split into more than 10 parts (yes, really, actually 11 as of this writing). These are caused by various intrusive changes which impact many parts of the kernel tree. Linus said he would “*much* rather do five separate pull requests where each pull has a stated reason and target, than do one big mixed-up one”. Which is good because Viro promised many more than 5. Patch series number 11 got the most feedback so far.

X86. Ingo Molnar also went to town, in typical fashion, with many different updates to the kernel. These included mm changes enabling more Intel 5-level paging features (switching the “GUP” or “Get User Pages” code over to the newer generic kernel implementation shared by other architectures), and “[C]ontinued work to add PCID [Process Context ID] support”. Per-process context IDs allow for TLB (Translation Lookaside Buffer – the micro caches that store virtual to physical memory translations following page table walks by the hardware walkers) flush infrastructure optimizations on legacy architectures such as x86 that do not have certain TLB hardware optimizations. Ingo also posted microcode updates that include support for saving microcode pointers and wiring them up for use early in the “resume-from-RAM” case, and fixes to the Hyper-V guest support that add a synthetic CPU MSR (Model Specific Register) providing the CPU TSC frequency to the guest.

Ongoing Development

ARM. Will Deacon posted the fith version of a patch series entitled “Add support for the ARMv8.3 Statistical Profiling Extension”, which provides a linear, virtually addressed memory buffer containing statistical samples (subject to various filtering) related to processor operations of interest that are performed by running (application) code. Sample records take the form of “packets”, which contain very detailed amounts of information, such as the virtual PC (Program Counter) address of a branch instruction, its type (conditional, unconditional, etc.), number of cycles waiting for the instruction to issue, the target, cycles spent executing the branch instruction, associated events (e.g. misprediction), and so on. Detailed information about the new extension is available in the ARM ARM, and is summarized in a blog post, here: https://community.arm.com/processors/b/blog/posts/statistical-profiling-extension-for-armv8-a

RISC-V. Palmer Dabbelt posted v4 of the enablement patch series adding support for the Open Source RISC-V architecture (which will then require various enablement for specific platforms that implement the architecture). In his patch posting, he notes changes from the previous version 3 that include disabling cmpxchg64 (a 64-bit instruction that performs an “atomic” compare and exchange operation, but which isn’t atomic on 32-bit systems) on 32-bit, adding an ELF_HWCAP (hardware capability) within binaries in order for users to determine the ISA of the machine, and various other miscellaneous changes. He asks for consideration that this be merged during the ongoing merge window for 4.13, which remains to be seen. We will track this in future episodes.

FOLL_FORCE. Keno Fischer noted that “Yes, people use FOLL_FORCE”, referencing a commit from Linus in which an effort had been made to “try to remove use of FOLL_FORCE entirely” on the procfs (/proc) filesystem. Keno says “We used these semantics as a hardening mechanism in the julia JIT. By opening /proc/self/mem and using these semantics, we could avoid needing RWX pages, or a dual mapping approach”. In other words, they cheat and don’t setup direct RWX mappings ahead of time but instead get access to them via the backdoor using the kernel’s “/proc/self/mem” interface directly. Linus replied, “Oh, we’ll just re-instate the kernel behavior, it was more an optimistic “maybe nobody will notice” thing, and apparently people did notice”.

GICv4. Marc Zyngier posted version 2 of a patch series entitled “irqchip: KVM: Add support for GICv4”, a “(monster of a) series [that] implements full suport for GICv4, bringing direct injection of MSIs [Message Signalled Interrupts] to KVM on arm and arm64, assuming you have the right hardware (which is quite unlikely)”. Marc says that the “stack has been *very lightly* tested on an arm64 model, with a PCI virtio block device passed from the host to a guet (using kvmtool and Jean-Philippe Brucker’s excellent VFIO support patches). As it has never seen any HW, I expect things to be subtly broken, so go forward and test if you can, though I’m mostly interested in people reviewing the code at the moment”. It’s awesome to see 64-bit ARM systems on par with legacy architectures when it comes to VM interrupt injection.

GPIO. Any Shevchenko posted a patch (with Linus Walleij’s approval) noting that Intel would help to maintain GPIO ACPI support in the GPIO subsystem.

Hardlockup. Nicholas Piggin posted “[RFC] arch hardlockup detector interfaces improvement” which aims to “make it easier for architectures that have their own NMI / hard lockup detector to reuse various configuration interfaces that are provided by generic detectors (cmdline, sysctl, suspend/resume calls)”. He “do[es] this by adding a separate CONFIG_SOFTLOCKUP_DETECTOR [kernel configuration option], and juggling around what goes under config options. HAVE_NMI_WATCHDOG continues to be the config for arch to override the hard lockup detector, which is expanded to cover a few more cases”.

HMM. Jérôme Glisse posted “Cache coherent device memory (CDM) with HMM” which layers above his previous HMM (Heterogenous Memory Management) to provide a generic means to manage device memory that behaves much like regular system memory but may still need managing “in isolation from regular memory” (for any number of reasons, including NUMA effects). This is particularly useful in the case of a coherently attached system bus being used to connect on-device memory memory, such as CAPI or CCIX. [disclaimer: this author chairs the CCIX software working group]

Hyper-V. KY Srinivasan posted an update version of his “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements” patches, which aim to optimize the case of remote TLB flushing on other vCPUs within a guest. TLBs are micro caches that store VA (Virtual Address) to PA (Physical Address) translations for VMAs (Virtual Memory Areas) that need to be invalidated during a context switch operation from one process to another. Typically, an Operating System may either utilize an IPI (Inter-Processor-Interrupt) to schedule a remote function on other CPUs that will tear down their TLB entries, or – on more enlightened and sophisticated modern computer architectures – may perform a hardware broadcast invalidation instruction that achieves the same without the gratuitous overhead. On x86 systems, IPIs are commonly used by guest operating systems and their impact can be reduced by providing special guest hypercalls allowing for hypervisor assistance in place of broadcast IPIs. Jork Loeser also posted a patch updating the Hyper-V vPCI driver to “use the Server-2016 version of the vPCI protocol, fixing MSI creation”.

ILP32. Yury Norov posted version 8 of a patch series entitled “ILP32 for ARM64” which aims to enable support for the Integer Long Pointer 32-bit optional userspace ABI on 64-bit ARM processors. In ways similar to “x32” on 64-bit “x86” systems, ILP32 aims to provide the benefits of the new ARMv8 ISA without having to use 64-bit data types and pointers for code that doesn’t actually require such large data or a large address space. Pointers (pun intended) are provided to an example kernel, GLIBC, and an OpenSuSE-based Linux distribution built against the newer ABI.

IMC Instrumentation Support. Madhavan Srinivasan posted version 10 of a patch series entitled “IMC Instrumentation Support” which aims to provide support for “In-Memory-Collection” infrastructure present in IBM POWER9 processors. IMC apparently “contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level. The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC (On-Chip Controller) complex. The microcode collects the counter data and moves the nest IMC counter data to memory”. This effectively seems to be a microcontroller managed mechanism for providing certain core and uncore counter data using a standardized interface.

Intel FPGA Device Drivers. Wu Hao posted version 2 of a patch series entitled “Intel FPGA Device Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open and access FPGA accelerators on platforms equipped with Intel(R) PCIe based FPGA solutions and enables system level management functions such as FPGA partial reconfiguration, power management and virtualization”. In other words, many of the capabilities required for datacenter level deployment of PCIe-attached FPGA accelerators.

Interconnects. Georgi Djakov posted version 2 of a patch series entitled “Introduce on-chip interconnect API”, which aims to provide a generic API to help manage the many varied high performance interconnects present on modern high-end System-on-Chip “processors”. As he notes, “Modern SoCs have multiple processors and various dedicated cores (video, gpu, graphics, model). These cores are talking to each other and can generate a lot of data flowing through the on-chip interconnects. These interconnect buses could form different topologies such as crossbar, point to point buses, hierarchical buses or use the network-on-chip concept”. The API provides an ability (subject to hardware support thereof) to control bandwidth use, QoS (Quality-of-Service), and other settings. It also includes code to enable the Qualcomm msm8916 interconnect with a layered driver.

IRQs. Daniel Lezcano posted version 10 of a patch series entitled “irq: next irq tracking” which aims to predict future IRQ occurances based upon previous system behavior. “As previously discussed the code is not enabled by default, hence compiled out”. A small circular buffer is used to keep track of non-timer interrupt sources. “A third patch provides the mathematic to compute the regular intervals”. The goal is to predict future expected system wakeups, which is useful from a latency perspective, as well as for various scheduling, or energy calculations later on.

Memory Allocation Watchdog. Tetsuo Handa posted version 9 of a patch series entitled “mm: Add memory allocation watchdog kernel thread”, which “adds a watchdog which periodically reports number of memory allocating tasks, dying tasks and OOM victim tasks when some task is spending too long time inside __alloc_pages_slowpath() [the code path called when a running program – known as a task within the kernel – must synchronously block and wait for new memory pages to become available for allocation]”. Tetsuo adds, “Thanks to OOM [Out-Of-Memory] repear which can guarantee forward progress (by selected next OOM victim) as long as the OOM killer can be invoked, we can start testing low memory situations which are previously too difficult to test. And we are now aware that there are still corner cases remaining where the system hands without invoking the OOM killer”. The patch aims to help explain whenever long hangs are explained by memory allocation failure.

Memory Protection Keys. Ram Pai posted version 5 of a patch series entitled “powerpc: Memory Protection Keys”, which aims to enable a feature in future ISA3.0 compliant POWER architecture platforms comparable to the “memory protection keys” added by Intel to their Intel x64 Architecture (“x86” variant). As Ram notes, “The overall idea: A process allocates a key and associates it with an address range within its address space. The process then can dynamically set read/write permissions on the key without involving the kernel. Any code that violates the permissions of the address space; as define by its associated key, will receive a segmentation fault”. The patches enable support on the “PPC64 HPTE platform” and are noted to have passed all of the same tests as on x86.

Modules. Djalal Harouni posted version 4 of a patch series entitled “modules: automatic module loading restrictions”, which adds a new global sysctl flag, as well as per task one, called “modules_autoload_mode”. “This new flag allows to control only automatic module loading [the kernel-invoked auto loading of certain modules in response to user or system actions] and if it is allowed or not, aligning in the process the implicit operation with the explicit [existing option to disable all module loading] one where both are now covered by capabilities checks”. The idea is to prevent certain classes of security exploit wherein – for example – a system can be caused to load a vulnerable network module by sending it a certain packet, or an application calling a certain kernel function. Other such classes of attack exist against automatic module loading, and have been the subject of a number of CVE [Common Vulnerabilities and Exposures] releases requiring frantic system patching. This feature will allow sysadmins to limit module auto loading on some classes of systems (especially embedded/IoT devices).

Network filtering. Shubham Bansal posted an RFC patch entitled “RFC: arm eBPF JIT compiler” which “is the first implementation of eBPF JIT for [32-bit] ARM”. Russell King had various questions, including whether the code handled “endian issues” well, to which Shubham replied that he had not tested it with BE (Big Endian) but was interested in setting up qemu to run Big Endian ARM models and would welcome help improving the code.

NMI. Adrien Mahieux posted “x86/kernel: Add generic handler for NMI events” which “adds a generic handler where sysadmins can specify the behavior to adopt for each NMI event code. List of events is provided at module load or on kernel cmdline, so can also generic kdump upon boot error”. The options include silently ignoring NMIs (which actually passes them through to the next handler), drop NMIs (actually discard them), or to panic the kernel immediately. An example given is using the drop parameter during kdump in order to prevent a second NMI from triggering a panic while another crash dump is already capturing from the first.

Randomness. Jason A. Donenfield posted version 4 of a patch series entitled “Unseeded In-Kernel Randomness Fixes” which aims to address “a problem with get_random_bytes being used before the RNG [Random Number Generator] has actually been seeded [given an initial set of values following boot time]. The solution for fixing this appears to be multi-pronged. One of those prongs involves adding a simple blocking API so that modules that use the RNG in process context an just sleep (in an interruptable manner) until the RNG is ready to be used. This winds up being a very useful API that covers a few use cases, several of which are included in this patch set”.

Scheduler. Nico[las] Pitre posted “scheduler tinification” which “makes it possible to configure out some parts of the scheduler such as the deadline and realtime scheduler classes. The saving in kernel footprint is non negligible”. In the examples cited, kernel text shrinks by almost 8K, which is significant in some very small Linux systems, such as in IoT.

S.A.R.A. Salvatore Mesoraca posted “S.A.R.A. a new stacked LSM” (which your author is choosing to pronounce as in “Sarah”, for various reasons, and apparently actually stands for “S.A.R.A is Another Recursive Acronym”). This is “a stacked Linux Security Module that aims to collect heterogeneous security measures, providing a common interface to manage them. It can be useful to allow minor security features to use advanced management options, like user-space configuration files and tools, without too much overhead”.

Secure Memory Encryption (SME). Tom Lendacky posted version 8 of a patch series that implements support in Linux for this feature of certain future AMD CPUs. “SME can be used to mark individual pages of memory as encrypted through the page tables. A page of memory that is marked encrypted will be automatically decrypted when read from DRAM and will be automatically encrypted when written to DRAM”. In other words, SME allows a datacenter operator to build systems in which all data leaving the SoC is encrypted either at rest (on disk), or when hitting external memory buses that might (theoretically) be monitored. When combined with other features, such as “another AMD processor feature called Secure Encrypted Virtualization (SEV)” it becomes possible to protect user data from intrusive monitoring by hypervisor operators (whether mallicious or coerced). This is the correct way to provide memory encryption. While others have built a nonsense known as “enclaves”, the AMD approach correctly solves a more general problem. The AMD patches update various pieces of kernel infrastructure, from the UEFI code, to IOMMU support for carry page encryption state through.

SMIs. Kan Liang posted version 2 of a patch entitled “measure SMI cost (user)” which adds a “new sysfs entry /sys/device/cpu/freeze_on_smi” which will cause the “FREEZE_WHILE_SMM” bit in the Intel “IA32_DEBUGCTL” processor control register to be set. Once it is set, “the PMU core counters will freeze on SMI handler”. This can be usd with a “new –smi-cost mode in perf stat…to measure the SMI cost by calculating unhalted core cycles and aperf results”. SMIs, or “System Management Interrupts” are also referred to as “cycle stealing” in that they are used by platform firmware to perform various housekeeping tasks using the application processor cores, usually without either the Operating System, nor the user’s knowledge. SMIs are used by OEMs and ODMs to “add value”, but they are also used for such things as system fan control and other essentials. What should happen, of course, is that a generic management controller should be defined to handle this, but it was easier for the industry to build the mess that is SMIs, and for Intel to then add tracking for users to see where bad latencies come from.

Speculative Page Faults. Luarent Dufour posted version 5 of a patch series entitled “Speculative page faults”, which is “a port on kernel 4.12 of the work done by Peter Zijlstra to handle page fault without holding the mm semaphore”. As he notes, “The idea is to try to handle user space page faults without holding the mmap_sem [a per-task – the kernel side name for a running process – semaphore that is shared by all threads within a process]. This should allow better concurrency for massively threaded processes since the page fault handler will not wait for other threads[‘] memory layout change to be done, assuming that this change is done in another part of the process’s memory space. This type of page fault is named speculative page fault. If the speculative page fault fails because of a concurrency is detected of because underlying PMD [Page Middle Directory] or PTE [Page Table Entry] tables are not yet allocat[ed], it [fails] its processing and a classic page fault is then tried”.

THP. Kirill A. Shutemov posted a “HELP-NEEDED” thread entitled “Do not lose dirty bit on THP pages”, in which he notes that Vlastimil Babka “noted that pmdp_invalidate [Page Middle Directory Pointer invalidate] is not atomic and we can loose dirty and access bits if CPU sets them after pmdp dereference, but before set_pmd_at()”. Kirill notes that this doesn’t currently happen to lead to user-visible problems in the current kernel, but “fixing this would be critical for future work on THP: both huge-ext4 and THP [Transparent Huge Pages] swap out rely on proper dirty tracking”. By access and dirty tracking, Kirill means page table bits that indicate whether a page has been accessed or contains dirty data which should be written back to storage. Such bits are updated by hardware automatically on memory access. He adds that “Unfortunately, there’s no way to address the issue in a generic way. We need to fix all architectures that support THP one-by-one”. Hence the topic of the thread containing the words “HELP-NEEDED”. Martin Schwidefsky had some feedback to the proposed solution that it would not work on s390, but that if pmdp_invalidate returned the old entry, that could be used in order to update certain logic based on the dirty bits. Andrea Arcangeli replied to Martin, “That to me seems the simplest fix”. Separately, Kirill posted the “Last bits for initial 5-level paging” on x86.

Timers. Christoph Hellwig posted “RFC: better timer interface”, a patch series which “attempts to provide a “modern” timer interface where the callback gets the timer_list structure as an argument so that it can use container_of instead of having to cast to/from unsigned long all the time”. Arnd Bergmann noted that “This looks really nice, but what is the long-term plan for the interface? Do you expect that we will eventually change all 700+ users of timer_list to the new type, or do we keep both variants around indefinitely to avoid having to do mass-conversions?”. Christoph thought it was possible to perform a wholesale conversion, but that “it might take some time”.

Thunderbolt. Mika Westerberg posted version 3 of a patch series implementing “Thunderbolt security levels and NVM firmware upgrade”. Apparently, “PCs running Intel Falcon Ridge or newer need these in order to connect devices if the security level is set to “user(SL1) or secure(SL2)” from BIOS” and “The security levels were added to prevent DMA attacks when PCIe is tunneled over Thunderbolt fabric where IOMMU is not available or cannot be enabled for different reasons”. While cool, it is slightly saddening that some of the awesome demos from recent DEFCONs will be slightly harder to reproduce by nation state actors and those who really need to get outside more often.

VAS. Sukadev Bhattiprolu posted version 5 of a patch series entitled “Enable VAS”, a “hardware subsystem referred to as the Virtual Accelerator Switchboard” in the IBM POWER9 architecture. According to Sukadev, “VAS allows kernel subsystems and user space processes to directly access the Nest Accelerator (NX) engines which implement compression and encryption algorithms in the hardware”. In other words, these are simple workload acceleration engines that were previously only available using special (“icswx”) privileged instructions in earlier versions of POWER machines and are now to be available to userspace applications through a multiplexing API.

WMI. Darren Hart posted an updated “Convert WMI to a proper bus” patch series, which “converts WMI [Windows Management Instrumentation] into a proper bus, adds some useful information via sysfs, and exposes the embedded MOF binary. It converts dell-wmi to use the WMI bus architecture”. WMI is required to manage various contempory (especially laptop) hardware, including backlights.

Xen. Juergen Gross posted “xen: add sysfs node for guest type” which provides information known to the guest kernel but not previously exposed to userspace, including the type of virtualization in use (HVM, PV, or PVH), and so on.

zRam. Minchan Kim posted an RFC patch entitled “writeback incompressible pages to storage”, which seeks to have the best of both worlds – the compression of Ram while handling cases where memory is incompressible. In the case that an admin sets up a suitable block device, it can be arranged that incompressible pages are written out to storage instead of using RAM.

zswap. Srividya Desireddy posted version 2 of a patch that seeks to explicitly test for so-called “zero-filled” pages before submitting them for compression. This saves time and energy, and reduces application startup time (on the order of about 3% in the example given).

 

July 08, 2017 09:31 PM

July 06, 2017

Rusty Russell: Broadband Speeds, 2 Years Later

Two years ago, considering the blocksize debate, I made two attempts to measure average bandwidth growth, first using Akamai serving numbers (which gave an answer of 17% per year), and then using fixed-line broadband data from OFCOM UK, which gave an answer of 30% per annum.

We have two years more of data since then, so let’s take another look.

OFCOM (UK) Fixed Broadband Data

First, the OFCOM data:

So in the last two years, we’ve seen 26% increase in download speed, and 22% increase in upload, bringing us down from 36/37% to 33% over the 8 years. The divergence of download and upload improvements is concerning (I previously assumed they were the same, but we have to design for the lesser of the two for a peer-to-peer system).

The idea that upload speed may be topping out is reflected in the Nov-2016 report, which notes only an 8% upload increase in services advertised as “30Mbit” or above.

Akamai’s State Of The Internet Reports

Now let’s look at Akamai’s Q1 2016 report and Q1-2017 report.

This gives an estimate of 19% per annum in the last two years. Reassuringly, the US and UK (both fairly high-bandwidth countries, considered in my previous post to be a good estimate for the future of other countries) have increased by 26% and 19% in the last two years, indicating there’s no immediate ceiling to bandwidth.

You can play with the numbers for different geographies on the Akamai site.

Conclusion: 19% Is A Conservative Estimate

17% growth now seems a little pessimistic: in the last 9 years the US Akamai numbers suggest the US has increased by 19% per annum, the UK by almost 21%.  The gloss seems to be coming off the UK fixed-broadband numbers, but they’re still 22% upload increase for the last two years.  Even Australia and the Philippines have managed almost 21%.

July 06, 2017 10:01 AM

June 29, 2017

Linux Plumbers Conference: Containers Microconference accepted into Linux Plumbers Conference

Following on from the Containers Microconference last year, we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

The agenda for this year will focus on unsolved issues and other problem areas in the Linux Kernel Container interfaces with the goal of allowing all container runtimes and orchestration systems to provide enhanced services.  Of particular interest is the unprivileged use of container APIs in which we can use both to enable self containerising applications as well as to deprivilege (make more secure) container orchestration systems.  In addition we will be discussing the potential addition of new namespaces: (LSM for per-container security modules; IMA for per-container integrity and appraisal, file capabilities to allow setcap binaries to run within unprivileged containers)

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

June 29, 2017 05:59 PM

June 20, 2017

Arnaldo Carvalho de Melo: Pahole in the news

Found another interesting article, this time mentioning a tool I wrote long ago and that, at least for kernel object files, has been working for a long time without much care on my part: pahole, go read a bit about it at Will Cohen’s “How to avoid wasting megabytes of memory a few bytes at a time” article.

Guess I should try running a companion script that tries to process all .o files in debuginfo packages to see how bad it is for non-kernel files, with all the DWARF changes over these years…


June 20, 2017 03:49 PM

June 15, 2017

Linux Plumbers Conference: Early Bird Rate Registration Ending Soon

A reminder that our Early Bird registration rate is ending soon. The last day at the Early Bird rate of 400$ is Sunday June 18th. We are also almost sold out of Early Bird slots (15% left of our quota). Get yours soon!
Starting June 19th registration will be at the regular rate of 550$.
Please see the Attend page for info.

June 15, 2017 11:20 PM

June 14, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: Simplifying Linux-kernel RCU

The last month or two has seen a lot of work simplifying the Linux-kernel RCU implementation, with more than 2700 net lines of code removed. The remainder of this post lists the user-visible changes, along with alternative ways to get the corresponding job done.


  1. The infamous CONFIG_RCU_KTHREAD_PRIO Kconfig parameter is now defunct, but the rcutree.kthread_prio kernel boot parameter gets the job done.
  2. The CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter has kicked the bucket. There is no replacement because no one was using it. If you need it, revert the -rcu commit tagged by sysidle.2017.05.11a.
  3. The CONFIG_PROVE_RCU_REPEATEDLY Kconfig parameter is no more. There is no replacement because as far as I know, no one has used it for many years. It was a great help in tracking down lockdep-RCU warnings back in the day, but these warnings are now sufficiently rare that finding them one boot at a time is no longer a problem. If you need it, do the obvious hacking on Kconfig and lockdep.c.
  4. The CONFIG_SPARSE_RCU_POINTER Kconfig parameter now rests in peace. There is no replacement because there doesn't seem to be any reason for RCU's sparse checking to be the only such checking that is optional. If you really need to disable RCU's sparse checking, hand-edit the definition as needed.
  5. The CONFIG_CLASSIC_SRCU Kconfig parameter bought the farm. This was only present to handle massive failures of the new Tree/Tiny SRCU implementations, but these appear to be quite reliable and should be used instead of Classic SRCU.
  6. RCU's debugfs tracing is done for. As far as I know, I was the only real user, and I haven't used it in years. If you need it, revert the -rcu commit tagged by debugfs.2017.05.15a.
  7. The CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO, and CONFIG_RCU_NOCB_CPU_ALL Kconfig parameters have departed. Use the rcu_nocbs kernel boot parameter instead, which can do quite a bit more than those Kconfig parameters ever could.
  8. Tiny RCU's event tracing and RCU CPU stall warnings are now pushing up daisies. The point of Tiny RCU is to be tiny and educational, and these added features were not helping reach either of these two goals. The replacement is to reproduce the problem with Tree RCU.
  9. These changes should matter only to people running rcutorture:

    1. The CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT and CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT_DELAY Kconfig parameters have been entombed: Use the rcutree.gp_preinit_delay kernel boot parameter instead.
    2. The CONFIG_RCU_TORTURE_TEST_SLOW_INIT and CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY Kconfig parameters have given up the ghost: Use the rcutree.gp_init_delay kernel boot parameter instead.
    3. The CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP and CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig parameters have passed on: Use the rcutree.gp_cleanup_delay kernel boot parameter instead.
There will probably be a few more simplifications in the near future, but this should be at least enough for one merge window!

June 14, 2017 09:03 PM

June 12, 2017

Linux Plumbers Conference: RDMA Microconference Accepted into the Linux Plumbers Conference

Following on from the successful RDMA Microconference last year, which resulted in a lot of fruitful discussions we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

In addition to looking at the usual kernel core gaps and ABI issues, Documentation and testing, we’ll also be looking at new fabrics (including NVME), challenges to implement virtual RDMA device and integration possibilities with netdev.

For more details on this, please see this microconference’s wiki page.

June 12, 2017 09:53 PM

Arnaldo Carvalho de Melo: Article about ‘perf annotate’

Just found out about Ravi’s article about ‘perf annotate’, concise yet covers most features, including cross-annotation, go read it!


June 12, 2017 07:21 PM

June 09, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: rcutorture Accidentally Catches an RCU Bug

With the Linux-kernel v4.13 merge window coming up, it is time to do at least a little heavy-duty testing of the patches destined for v4.14, which had been but lightly tested on my laptop. An overnight run on a larger test machine looked very good—with the exception of scenario TREE01 (defined by tools/testing/selftests/rcutorture/configs/rcu/TREE01{.boot,} in the Linux-kernel source tree), which got no fewer than 190 failures in a half-hour run. In other words, rcutorture saw 190 too-short grace periods in 30 minutes, for about one every 20 seconds.

This is not just bad. This is RCU completely and utterly failing to be RCU.

My first action was to re-run the tests on the commits slated for v4.13. You can imagine my relief to see them pass on all scenarios, including TREE01.

Then it was time for bisection. I have been burned many times by false bisections due to RCU's probabilistic failure modes, so I ran 24 30-minute tests on each commit. Fortunately, I could run six in parallel, so that each commit only consumed about two hours of test time. The bisection converged on a commit that adds a --kconfig argument to the rcutorture scripts, which allow me to do things like force lockdep to run in all scenarios. However, this commit should have absolutely no effect on the inner workings of RCU.

OK, perhaps this commit managed to fatally mess up the .config file. But no, the .config files from this commit compare equal to those from the preceding commit. Some additional poking gives me confidence that the kernels being built are also identical. Still, the one fails and the other does not.

The next step is to look very carefully at the console output from the failing runs, most of which contain many complaints about RCU grace periods being too short. Except that one of them also contains RCU CPU stall warnings. In fact, one of the stall warnings lists no fewer than 26 CPUs as stalling the current RCU grace period.

This came as a bit of a surprise, partly because I don't ever recall ever seeing that many CPUs stalling a single grace period, but mostly because the test was only supposed to use eight CPUs.

A look at the beginning of the console output showed that RCU was inexplicably prepared to deal with 43 CPUs instead of the expected eight. A bit more digging showed that the qemu command used to run the failing test had “-smp 43”, while the qemu command for the successful test instead had “-smp 8”. In both cases, the qemu command also included the kernel boot parameter “maxcpus=8”. And a very stupid bug in the --kconfig change to the scripts turned out to be responsible for the bogus -smp argument.

The next step is to swap the values of qemu's -smp argument. And the failure follows the “-smp 43” setting. This means that it is possible that the RCU failures are due to a latent timing bug in RCU. After all, the test system has only 64 CPUs, and I was running 43*6=258 CPUs worth of tests on it. But running six concurrent rcutorture tests with both -smp and maxcpus set to 43 passes with flying colors. So RCU must be suffering from some other problem.

The next question is exactly what is supposed to happen when qemu and the kernel have very different ideas of how many CPUs there are. The ever-helpful Documentation/admin-guide/kernel-parameters.txt file states that maxcpus= limits not the overall number of CPUs, but rather the number that are brought up at boot time. Another look at the console output confirms that in the failing case, eight CPUs are brought up at boot time. However, the other 35 come online some time after boot, sometimes taking a few minutes to come up. Which explains another anomaly I noticed while bisecting, namely that about half the tests ran 30 minutes without failure, but the ones that failed did so within the first five minutes of the run. Apparently the RCU failures are connected somehow to the late arrival of the extra 35 CPUs.

Except that RCU configured itself for the full 43 CPUs, and RCU is supposed to be able to handle CPUs coming and going. In fact, RCU has repeatedly demonstrated its ability to handle CPUs coming and going for more than a decade. So it is time to enable event tracing on a failure scenario (thank you, Steve!). One of the traces shows that there is no RCU callback connected with the first failure, which points the finger of suspicion at RCU expedited grace periods.

A quick inspection of the expedited code shows missing synchronization for the case where a CPU makes its very first appearance just as an expedited grace period starts. Oh, the leaf rcu_node structure's ->lock is held both when updating the number of CPUs that have ever been seen (which is the rcu_state structure's ->ncpus field) and when updating the bitmasks indicating exactly which CPUs have ever been seen (which is the leaf rcu_node structure's ->expmaskinitnext field), but it drops that lock between those two updates.

This means that the expedited grace period might sample the ->ncpus field, notice the change, and therefore check all the ->expmaskinitnext fields—but before those fields had been updated. Not a problem for this grace period, since the new CPUs haven't yet started and thus cannot yet be running any RCU read-side critical sections, which means that there is no reason whatsoever for this grace period to pay any attention to them. However, the next expedited grace period would again sample the ->ncpus field, see no change, and thus not bother checking the ->expmaskinitnext fields. Thus, this grace period would also ignore the new CPUs, which by this time could be very much alive and running RCU read-side critical sections. Hence the too-short grace periods, and hence them showing up within the first few minutes of the run, during the time that the extra 35 CPUs are in the process of coming online.

The fix is easy: Just move the update of ->ncpus to the same critical section as the update of ->expmaskinitnext. With this fix, rcutorture passes the TREE01 scenario even with bogus -smp arguments to qemu. There is therefore once again a bug in rcutorture: There are still bugs in RCU somewhere, and rcutorture is failing to find them!

Strangely enough, I might never have noticed the bug in expedited grace periods had I not made a stupid mistake in the scripting. Sometimes it takes a bug to locate a bug!

June 09, 2017 08:49 PM

June 07, 2017

Paul E. Mc Kenney: Verification Challenge 6: Linux-Kernel Tree RCU

It has been more than two years since I posted my last verification challenge, so it is only natural to ask what, if anything, has happened in the meantime. The answer is “Quite a bit!”

I had the privilege of attending The Royal Society Verified Trustworthy Software Systems Meeting, where I was on a panel on “Verification in Industry”. I also attended the follow-on Verified Trustworthy Software Systems Specialist Meeting, where I presented on formal verification and RCU. There were many interesting presentations (see slides 9-12 of this presentation), the most memorable being a grand challenge to apply formal verification to machine-learning systems. If you think that challenge is not all that grand, you should watch this video, which provides an entertaining demonstration of a few of the difficulties.

Closer to home, in the past year there have been three successful applications of automated formal-verification tools to Linux-kernel RCU:


  1. Lihao Liang applied the C Bounded Model Checker (CBMC) to Tree RCU (draft paper). This was a bit of a tour de force, converting Linux-kernel C code (with a bit of manual preprocessing) to a logic expression, then invoking a SAT solver on that expression. The expression's variables correspond to the inputs to the code, the possible multiprocessor scheduling decisions, and the possible memory-model reorderings. The expression evaluates to true if there exists an execution that triggers an assertion. The largest expression had 90 million boolean variables, 450 million clauses, occupied some tens of gigabytes of memory, and, stunningly, was solved with less than 80 hours of CPU time. Pretty amazing considering that SAT is NP complete and that two to the ninety millionth power is an excessively large number!!!
  2. Michalis Kokologiannakis applied Nidhugg to Tree RCU (draft paper). Nidhugg might not make quite as macho an attack on an NP-complete problem as does CBMC, but there is some reason to believe that it can handle larger chunks of the Linux-kernel RCU code.
  3. Lance Roy applied CBMC to Classic SRCU. Interestingly enough, this verification could be carried out completely automatically, without manual preprocessing. This approach is therefore available in my -rcu tree (git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git) on the branch formal.2017.06.07a.


In all three cases, the tools verified portions of RCU and SRCU as correct, and in all three cases, the tools successfully located injected bugs. (Hey, any of us could write a program that consisted of “printf(“Validated/n");”, so you do have to validate the verifier!) And yes, I have given these guys trouble about the fact that their tools didn't find any bugs that I didn't already know about, but these results are nevertheless extremely impressive. Had you told me ten years ago that this would happen, I have no idea how I would have responded, but I most certainly would not have believed you.

In theory, Nidhugg is more scalable but less thorough than CBMC. In practice, it is too early to tell.

So what is the status of the first five verification challenges?


  1. rcu_preempt_offline_tasks(): Still open. That said, Michalis found the infamous RCU bug at Linux-kernel commit 281d150c5f88 and further showed that my analysis of the bug was incorrect, though my fixes did actually fix the bug. So this challenge is still open, but the tools have proven their ability to diagnose rather ornate concurrency bugs.
  2. RCU NO_HZ_FULL_SYSIDLE: Still open. Perhaps less pressing given that it will soon be removed from the kernel, but the challenge still stands!
  3. Apply CBMC to something: This is an ongoing challenge to developers to give CBMC a try on concurrent code. And why not also try Nidhugg?
  4. Tiny RCU: This was a self-directed challenge was “born surmounted”.
  5. Uses of RCU: Lihao, Michalis, and Lance verified some simple RCU uses as part of their work, but this is an ongoing challenge. If you are a formal-verification researcher and really want to prove your tool's mettle, take on the Linux kernel's dcache subsystem!


But enough about the past! Given the progress over the past two years, a new verification challenge is clearly needed!

And this sixth challenge is available on 20 branches whose names start with “Mutation” at https://github.com/paulmckrcu/linux.git. Some of these branches are harmless transformations, but others inject bugs. These bugs range from deterministic failures to concurrent data races to forward-progress failures. Can your tool tell which is which?

If you give any of these a try, please let me know how it goes!

June 07, 2017 08:41 PM

June 04, 2017

Kernel Podcast: Catching up on podcasts…new one drops Monday!

Sorry for the delay with getting podcasts out. I’m working on a new one! Coming Monday!

June 04, 2017 06:44 AM

May 31, 2017

Eric Sandeen: 2012 Nissan LEAF battery deathwatch

First of all – I think EVs are great.  They are the future of personal transportation.  But this is the story of a first-gen EV battery with some … issues.

I bought a used 2012 Nissan LEAF with about 38k miles for a great price – in part because it started life as a leased car in Texas, and the early LEAF batteries didn’t much like the heat.  As a result, the battery is not super healthy, with only about 60 miles of range on a full charge on a balmy day.  While this is enough to get me around on most days, there are times when a bit more range would be nice.  Thankfully, Nissan has retroactively warrantied LEAF batteries to retain 70% of their capacity (really, closer to 66%) for the first 5 years or 60,000 miles.

The LEAF dash shows remaining battery capacity (as opposed to current charge) on a 12-bar scale; when new, it showed 12 bars, and Nissan will warranty the battery if it gets to 8 bars or less.  My car currently has 9 bars.  1 to go.

So this was a gamble.  I’d actually like my battery to lose enough capacity before January 2018 to get a warranty replacement.

Thanks to a cool app called LeafSpy, I can monitor battery health,and correlate it to what others have said about when they dropped that 9th bar.  I’ll try to remember to update this periodically, but here are the readings so far, with trend lines and “target” values based on when The Internet said they lost their 9th bar, on average.  The aHr metric seems most relevant. With luck, it looks like I may make it, though I can’t explain the recent plateau after the initial steady decline…

I’ll try to remember to update this occasionally as time goes by.
Update: Here’s a constantly updated version of my stats:

May 31, 2017 01:41 AM

May 24, 2017

Pete Zaitcev: Community Meeting

<notmyname> first, the idea of having a regular meeting in addition to this one for people in different timezones
<cschwede_> +2!
<notmyname> specifically, mahatic and pavel/onovy/seznam. but of course we've all seen various chinese contributors too
<notmyname> but the point is that it's a place to bring up stuff that those in the other time zones are working on
<mattoliverau> Cool
<notmyname> I think it's a terrific idea
<tdasilva> i bet the guys working on tape would like that too
<notmyname> my goal is to find a time for it that is so horrible for US timezones that it will be obvious that not everyone needs to be there
<zaitcev> Yeah, if only there was a way to send a message... like a mail... to a list of people. And then it could be stored on a computer somewhere, ready to be read in any timezone recepient is in.
<notmyname> zaitcev: crazytown!
<mattoliverau> zaitcev: now your just talkin crazy

May 24, 2017 09:52 PM

May 23, 2017

Michael Kerrisk (manpages): Linux Shared Libraries course, Munich, Germany, 20 July 2017

I've scheduled a public instance of my "Building and Using Shared Libraries on Linux" course to take place in Munich, Germany on 20 July 2017.  This one-day course provides a thorough introduction to building and using shared libraries. covering topics such as: the basics of creating, installing, and using shared libraries; shared library versioning and naming conventions; the role of the dynamic linker; run-time symbol resolution; controlling symbol visibility; symbol versioning; preloading shared libraries; and dynamically loaded libraries (dlopen). The course format is a mixture of theory and practical.

The course is aimed at programmers who create and use shared libraries. Systems administrators who are managing and troubleshooting applications that use shared libraries will also find the course useful.

You can find out more about the course (such as expected background and course pricing) at http://man7.org/training/shlib/ and see a detailed course outline at
http://man7.org/training/shlib/shlib_course_outline.html.

May 23, 2017 02:14 PM

Michael Kerrisk (manpages): Cgroups/namespaces/seccomp/capabilities course

There are still some places available on my "Linux Security and Isolation APIs" that will take place in Munich, Germany on 17-19 July 2017.  This three-day course provides a deep understanding of the low-level Linux features (set-UID/set-GID programs, capabilities, namespaces, cgroups, and seccomp) used to implement privileged applications and build container, virtualization, and sandboxing technologies. The course format is a mixture of theory and practical.

The course is aimed at designers and programmers building privileged applications, container applications, and sandboxing applications. Systems administrators who are managing such applications are also likely to find the course of benefit.

You can find out more about the course (such as expected background and course pricing) at
http://man7.org/training/sec_isol_apis/
and see a detailed course outline at
http://man7.org/training/sec_isol_apis/sec_isol_apis_course_outline.html

May 23, 2017 02:01 PM

May 18, 2017

Linux Plumbers Conference: Linux Kernel Memory Model Workshop Accepted into Linux Plumbers Conference

A good understanding of the Linux kernel memory model is essential for a great many kernel-hacking and code-review tasks.  Unfortunately, the current documentation (memory-barriers.txt) has been said to frighten small children, so this workshop’s goal is to demystify this memory model, including hands-on demos of the tools, help installing/running the tools, and help constructing appropriate litmus tests.  These tools should go a long way toward the ultimate goal of automating the process of using memory models to frighten small children.

For more information, please see this microconference’s wiki page.  For those who like getting a head start, this page also includes information on downloading and installing the tools, the memory model, and thousands of pre-existing litmus tests.  (Collect the whole set!!!) We also welcome experience reports from early adopters of these tools.

We hope to see you there!

May 18, 2017 05:04 PM

May 15, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/05/14

Audio: http://traffic.libsyn.com/jcm/20170514.mp3

In this week’s catchup mega-issue: Linux 4.12-rc1 (including a full summary of the 4.12 merge window), Linux 4.11 final is released, saving TLB flushes, various ongoing development, and a bunch of announcements.

Editorial Note

This podcast is a free service that I provide to the community in my spare time. It takes many, many hours to prepare and produce a single episode, much more during the merge window. This means that when I have major events (such as Red Hat Summit followed by OpenStack Summit) it will be delayed, as was the case this last week week. Over the coming months, I hope to automate the production in order to reduce the overhead but there will be some weeks where I need to skip a show. I am however covering the whole 4.12 merge window regardless. So while I would usually have just moved on, the circumstance warrants a mega-length catchup episode. I hope you’re still awake by the end.

Linux 4.12-rc1

Linus Torvalds announced Linux 4.12-rc1, “one day early, because I don’t like last-minute pull requests during the merge window anyway, and tomorrow is mother’s day [in the US], so I may end up being roped into various happenings”. He also noted “Besides, this has actually been a pretty large merge window, so despite there technically being time for one more day of pulls, I actually do have enough changes already. So there.” In his announcement, he says things look smooth so far, but calls those also “Famous last words”. Finally, he calls out the “odd” diffstat which is dominated by the AMD Vega10 headers. As was noted in the pull requests, unlike certain other graphics companies, AMD actually provides nice automatically generated headers and other information about their graphics chipsets, which is why the Vega10 update is plentiful.

Later in the day yesterday, following the 4.12-rc1 announcement, Guenter Roeck posted “watchdog updates for v4.12”, and Jon Mason posted “NTB bug fixes for vv4.12”, along with an apologies for tardiness.

Linux 4.11

Linus Torvalds announced Linux 4.11 noting that the extra week due a (rare-ish) “rc8” (Release Candidate 8) meant that he had felt “much happier releasing a final 4.11 now”. As usual, Linux Kernel Newbies has a writeup of 4.11, here: https://kernelnewbies.org/Linux_4.11

Announcements

Greg K-H (Kroah-Hartman) announced Linux 4.4.68, 4.9.28, 4.10.16, and 4.11.1. He later sent “Bad signatures on recent stable updates” in which he noted that “The stable kernels I just released have had signatures due to a mixup using pixz in the new kernel.org backend. It will be fixed soon…”, which were later corrected. He would like to hear from anyone still seeing problems.

Greg also announced (separately) Linux 3.18.52. While Jiri Slaby announced Linux 3.12.74.

Stephen Hemminger announced iproute2 4.11 matching the new kernel release.

Michael Kerrisk announced map-pages-4.11.

Steven Rostedt announced trace-cmd 2.6.1.

Steven also announced Linux 4.4.66-rt79, 3.18.51-rt57, and 3.12.73-rt98 (preempt-rt) kernels.

Con Kolivas posted an updated version of his (renamed) “MuQSS CPU scheduler” [renamed from the BFS – Brain F*** Scheduler] in Linux 4.11-ck1.

Karel Zak announced util-linux v2.30-rc1, which includes a fix to libblkid that “has been fixed to extract LABEL= and UUID= from UDF rather than ISO9660 header on hybrid CDROM/DVD media. This change[] makes UDF media on Linux user-space more compatible with another operation systems.” but he calls it out since it could also introduce regressions for some other users.

Junio C Hamano announced Git version 2.13.0. Separately, he released maintenance versions of “Git v2.12.3 and others” which include fixes for
“a recently disclosed problem with “git shell”, which may allow a user who comes over SSH to run an interactive pager by causing it to spawn “git upload-pack –help” (CVE-2017-8386).”

Jan Kiszka announced version 0.7 of the Jailhouse hypervisor, which includes various debug and regular console driver updates and gcov debug statistics.

Bartosz Golaszewski announced libgpiod v0.2: “The most prominent new feature is the test suite working together with the gpio-mockup module”.

Christoph Hellwig notes that the Open OSD [an in-kernel OSD – Object-Based Storage Device] SCSI initiator library for Linux seems to be dead. He does this by posting a patch to the MAINTAINERS file “update OSD entries” in which he removes the (now defunct) open-osd.org website, and the bouncing email address for Benny Halevy. Benny appeared and ACKed.

In a similar vain, Ben Hutchings pondered aloud about the “Future of liblockdep”, which apparently “hasn’t been buildable since (I think) Linux
4.6”. Sasha Levin said things would be cleaned up promptly. And they were, with a pull request soon following with fixes for Linux 4.12.

Masahiro Yamada posted an RFC patch entitled “Increase Minimal GNU Make version for Linux Kernel from 3.80 to 3.81” in which he essentially noted that the kernel hadn’t actually worked with 3.80 (which is 15 years old!) in a very long time, but instead actually really needs 3.81 (which was itself released in 2006). It was apparently “broken” 3 years ago, but nobody noticed. Neither Greg K-H (Kroah-Hartman) nor Linus seemed to lose any sleep over this, with Linus saying “you make a strong case of “it hasn’t worked for a while already and nobody even noticed””.

Paolo Bonzini posted “CFP: KVM Forum 2017” announcing that the KVM Forum will be held October 25-27 at the Hilton in Prague, CZ, and that all submissions for proposed topics must be made by midnight June 15.

Thomas Gleixner announced “[CFP] RT-Summit – Call for Presentations” noting that the Real-Time Summit 2017 is being organized by the Linux Foundation Real-Time Linux (RTL) collaborative project in cooperation with OSADL/RTLWS and will be held also in Prague on October 21st. The cutoff for submissions is July 14th via rt-cfp@linutronix.de.

4.12 Merge Window

In his 4.11 announcement, Linus reminded us that the release of 4.11 meant that “the merge window [for kernel 4.12] is obviously open. I already have two pull request[s] for 4.12 in my inbox, I expect that overnight I’ll get a lot more.” He wasn’t disappointed. The flood gates well and truly opened. And they continued going for the entire two week (less one day) period. Let’s dive into what has been posted so far for 4.12 during the (now closed) merge window.

Stephen Rothwell [linux-next pre-merge development kernel tree maintainer] noted in a head’s up that Linus was going to see a “Large new drm driver” [drm – Direct Rendering Manager, not the “digital rights” technology]. Dave Airlie (the drm maintainer) had a reply but Stephen said everything was just fine and he was simply seeking to avoid surprising Linus (again). Once the pull came in, and Linus had pulled it, he quickly followed up to note that he was getting a lot of warnings about “Atomic update on pipe (A) took”. Daniel Vetter followed up to say that “We [Intel] did improve evasion a lot to the point that it didn’t show up in our CI machines anymore, so we felt we could risk enabling this everywhere. But of course it pops up all over the place as soon as drm-next hits mainline”.

4.12 git Pulls for existing subsystems

Hans-Christian Noren Egtvedt posted “AVR32 change for 4.12 – architecture removal” in which he removes AVR32 and “clean away the most obvious architecture related parts”. He posted followups to pick off more leftovers.

Ingo Molnar posted “RCU changes for 4.12” which includes “Parallelize SRCU callback handling”, performance improvements, documentation updates, and various other fixes. Linus pulled it. But then “after looking at it, ended up un-pulling it again”. He posted a rant about a new header file (linux/rcu_segcblist.h) which was a “header file from hell”, saying “I see absolutely no point in taking a heade file of several hundred lines of code”, along with more venting about the use of too much inline code (code that is always expanded in-place rather than called as a function – leading to a larger footprint sometimes). Finally, Linus said “The RCU code needs to start showing some good taste”. Sir Paul McKenney, the one and only author of RCU followed up swiftly, apologizing for the transgression in attempting to model “the various *list*.h header files”, proposing a fix, which Linus liked. Ingo Molnar implemented the suggestions, in “srcu: Debloat the <linux/rcu_segcblist.h> head”, which Paul provided a minor fix against for the case of !SMP (non-multi-processor kernel) builds.

Ingo Molnar also posted “EFI changes for 4.12” including fixes to the BGRT ACPI table (used for boottime graphics information) to allow it to be shared between x86 and ARM systems, an update to the arm64 boot protocol, improvements to the EFI stub’s command line parsing, and support for randomizing the virtual mapping of UEFI runtime services on arm64. The latter means that the function pointers for UEFI Runtime Services callbacks will be placed into random virtual address locations during the call to ExitBootServices which sets up the new mappings – it’s a good way to look for problems with platforms containing broken firmware that doesn’t correctly handle the change in location of runtime service calls.

Ingo Molnar also posted “x86/process changes for 4.12” which includes a new ARCH_[GET|SET]_CPUID prctl (process control) ABI extension that a running process can use in order to determine whether it has access to call the CPUID instruction directly. This is to support a userspace debugger known as “rr” that would like to trap and emulate calls to “CPUID” which are otherwise normally unprivileged on x86 systems.

Separately, Ingo posted “x86 fixes”, which includes “mostly misc fixes” for such things as “two boot crash fixes”, etc.

Ingo Molnar also posted “perf changes for 4.12” which includes updates to K and uprobes, making their trampolines (the codepaths jumped through when executing the probe sequence) read-only while they are used, changing UPROBES_EVENTS to be default yes in the Kconfig (since distros do this), and various other fixes. He also includes support for AMD IOMMU events, and new events for Intel Goldmont CPUs. The perf tooling itself gets many more updates, including PERF_RECORD_NAMESPACES, which allows the kernel to record information “required to associate samples to namespaces”.

Separately, Ingo posted “perf fixes”, which includes “mostly tooling updates”.

Ingo Molnar also posted “RAS changes for v4.12” which includes a “correct Errors Collector” kernel feature that will gather statistics aout correctable errors affecting physical memory pages. Once a certain watermark is reached, pages generating many correctable errors will be permanently offlined [this is useful both for DDR and NV-DIMMs]. Finally, he deprecates the existing /dev/mcelog driver and includes cleanups for MCE (Machine Check Exception) errors during kexec on x86 (which we covered in previous editions of this podcast).

Ingo Molnar also posted “x86/asm changes for v4.12”, which includes various fixes, among which are cleanups to stack trace unwinding.

Ingo Molanr also posted “x86/cpu changes for v4.12”, which includes support for “an extension of the Intel RDT code to extend it with Intel Memory Bandwidth Allocation CPU support: MBA allows bandwidth allocation between cores, while CBM (already upstream) allows CPU cache partitioning”. Effectively, Intel incorporate changes to their memory controller’s hardware scheduling algorithms as part of RDT. These allow the DDR interface to manage bandwidth for specific cores, which will almost certainly include both explict data operations, as well as separate algorithms for prefetching and speculative fetching of instructions and data. [This author has spent many hours reading about memory controller scheduling over the past year]

Ingo Molnar also posted “x86/debug changes for v4.12”, which includes support for the USB3 “debug port” based early console. As we have mentioned previously, USB3 includes a built-in “debug port” which no longer requires a special dongle to connect a remote machine for debug. It’s common in Windows kernel development to use a debug port, and since USB3 includes baseline support with the need for additional hardware, serial over USB3 is likely to become more common when developing for Linux – especially with the demise of DB9 headers on systems or even IDC10 headers on motherboards internally (to say nothing of laptop systems). As a reminder, with debug ports, usually only one USB port will support debug mode. I
guess my old USB debug port dongle can go in the pile of obsolete gear.

Ingo Molnar also posted “x86/platform changes for v4.12” which includes “continued SGI UV4 hardware-enablement changes, plus there’s also new Bluetooth support for the Intel Edison [a low cost IoT board] platform”.

Ingo Molnar also posted “x86/vdso changes for v4.12” which includes support for a “hyper-V TSC page” which is what it sounds like – a special shared page made available to guests under Microsoft’s Hyper-V hypervisor and providing a fast means to enumerate the current time. This is plumbed into the kernel’s vDSO mechanism (Virtual Dynamic Shared Objects look a bit like software libraries that are automatically linked against every running program when it launches) to allow fast clock reading.

Ingo Molnar also posted “x86/mm changes for v4.12”, which includes yet more work toward Intel 5-level paging among many other updates.

Separately Ingo posted a single “core kernel fix” to “increase stackprotector canary randomness on 64-bit kernels with very little cost”.

Thomas Gleixner posted “irq updates for 4.12”, which include a new driver for a MediaTek SoC, ACPI support for ITS (Interrupt Translation Services) when using a GICv3 on ARM systems, support for shared nested
interrupts, and “the usual pile of fixes and updates all over t[h]e place”.

Thomas Gleixner also posted “timer updates for 4.12” that include more reworking of year 2038 support (the infamous wrap of the Unix epoch), a “massive rework of the arm architected timer”, and various other work.

Separately, Ingo Molnar followed up with “timer fix” including “A single ARM Juno clocksource driver fix”.

Corey Minyard posted “4.12 for IPMI” including a watchdog fix. He “switched over to github at Stephen Rothwell’s [linux-next maintainer] request”.

Jonathan Corbet posted “Docs for 4.12” which includes “a new guide for user-space API documents” along with many other updates. Anil Nair noted “Missing File REPORTING-BUGS in Linux Kernel” which suggests that the Debian kernel package tools need to be taught about the recent changes in the kernel’s documentation layout. Separately, Jonathan replied to a thread entitled “Find more sane first words we have to say about Linux” noting that the kernel’s documentation files might not be the first place that someone completely new to Linux is going to go looking for information: “So I don’t doubt we could put something better there, but can we think for a moment about who the audience is here? If you’re “completely new to Linux”, will you really start by jumping into the kernel source tree?” The guy should do kernel standup in addition to LWN. It’d be hilarious.

Later, Jon posted “A few small documentation updates” which “Connect the newly RST-formatted documentation to the rest; this had to wait until the input pull was done. There’s also a few small fixes that wandered in”.

Tejun Heo posted “libata changes for 4.12-rc1” which includes “removal of SCT WRITE SAME support, which never worked properly”. SCT stands for “SMART [Self Monitoring And Reporting Technology – an error management mechanism common in contemporary disks] Command Transport”. The “write same” part means to set the drive content to a specific pattern (e.g. to zero it out) in cases that TRIM is not available. One wonders if that is also a feature used during destruction, though apparently the only (NSA) trusted way to destroy disks today is shredding and burning after zeroing.

Tejun Heo also posted “workqueue changes for v4.12-rc1”, which includes “One trivial patch to use setup_deferrable_timer() instead of open-coding the initialization”.

Tejun Heo also posted “cgroup changes for v4.12-rc1”, which includes a “second stab at fixing the long-standard race condition in the mount path and suppression of spurious warning from cgroup_get”.

Rafael J. Wysocki posted “Power management updates for v4.12-rc1, part 1” which includes many updates to the cpufreq subsystem and “to the intel_pstate driver in particular”. Its sysfs interface has apparently also been reworked to be more consistent with general expectations. He adds “Apart from that, the AnalyzeSuspend utility for system suspend profiling gets a companion called AnalyzeBoot for the analogous profiling of system boot and they both go into one place”.

Separately, he posted “Power management updates for v4.12-rc1, part 2”, which “add new CPU IDs [Intel Gemini Lake] to a couple of drivers [intel_idle and intel_rapl – Running Average Power Limit], fix a possible NULL pointer deference in the cpuidle core, update DT [DeviceTree]-related things in the generic power domains framwork and finally update the suspend/resume infrastructure to improve the handling of wakeups from suspend-to-idle”.

Rafael J. Wysocki also posted “ACPI updates for v4.12-rc1, part 1”, which includes a new Operation Region driver for the Intel CHT [Cherry Trail] Whiskey Cove PMIC [Power Management Integrated Circuit], and new sysfs entries for CPPC [Collaborative Processor Performance Control], which is a much more fine grained means for OS and firmware to coordinate on power management and CPU frequency/performance state transitions.

Separately, he posted “ACPI updates for v4.12-rc1, part 2”, which “update the ACPICA [ACPI – Advanced Configuration and Power Interface – Component Architecture, the cross-Operating System reference code]” to “add a few minor fixes and improvements”, and also “update ACPI SoC drivers with new device IDs, platform-related information and similar, fix the register information in xpower PMIC [Power Management IC] driver, introduce a concept of “always present” devices to the ACPI device enumeration code and use it to fix a problem with one platform [INT0002, Intel Cherry Trail], and fix a system resume issue related to power resources”.

Separately, Benjamin Tissories posted a patch reverting some ACPI laptop lid logic that had been introduced in Linux 4.10 but was breaking laptops from booting with the lid closed (a feature folks especially in QE use).

Rafael J. Wysocki also posted “Generic device properties framework updates for v4.12-rc1”, which includes various updates to the ACPI _DSD [Device Properties] method call to recognize “ports and endpoints”.

Shaohua Li posted “MD update for 4.12” which includes support for the “Partial Parity Log” feature present on the Intel IMSM RAID array, and a rewrite of the underlying MD bio (the basic storage IO concept used in Linux) handling. He notes “Now MD doesn’t directly access bio bvec, bi_phys_segments and uses modern bio API for bio split”.

Ulf Hansson posted “MMC for v[.]4.12” which includes many driver updates as well as refactoring of the code to “prepare for eMMC CMDQ and blkmq”. This is the planned transition to blkmq (block-multiqueue) for such storage devices. Previously it had stalled due to the performance hit when trying to use a multi-queue approach on legacy and contemporary non-mq devices.

Linus Walleij posted “pin control bulk changes for v4.12” in which he notes that “The extra week before the merge window actually resulted in some of the type of fixes that usually arrive after the merge window already starting to trickle in from eager developers using -next, I’m impressed”. He’s also impressed with the new “Samsung subsystem maintainer” (Krzysztof). Of the many changes, he says “The most pleasing to see is Julia Cartwright[‘]s work to audit the irqchip-providing drivers for realtime locking compliance. It’s one of those “I should really get around to looking into that” things that have been on my TODO list since forever”.

Linus Walliej also posted “Bulk GPIO changes for v4.12”, which has “Nothing really exciting goes on here this time, the most exciting for me is the same as for pin control: realtime is advancing thanks [t]o Julia Cartwright”.

Petr Mladek posted “printk for 4.12” which includes a fix for the “situation when early console is not deregistered because the preferred one matches a wrong entry. It caused messages to appear twice”.

Jiri Kosina posted “HID for 4.12” which includes various fixes, amongst them being an inversion of the HID_QUIRK_NO_INIT_REPORTS to the opposite due to the fact that it is appearently easier to whitelist working devices.

Jiri Kosina also posted “livepatching for 4.12” which includes a new “per-task consistency model” that is “being added for architectures that support reliable stack dumping”, which apparently “extends the nature of the types of patches than can be applied by live patching”.

Lee Jones posted “Backlight for 4.12” which includes various fixes.

Lee Jones also posted “MFD for v4.12” which includes some new drivers, new device support, and various new functionality and fixes.

Juergen Gross posted “xen: fixes and features for 4.12” which includes support for building the kernel with Xen enabled but without enabling paravirtualization, a new 9pfs xen frontend driver(!), and support for EFI “reset_sytem” (needed for ARMv8 Dom0 host to reboot), among various other fixes and cleanups.

Alex Williamson posted “VFIO updates for v4.12-rc1”.

Joerg Roedel posted “IOMMU Updates for Linux v4.12”, which includes “Some code optimizations for the Intel VT-d driver, code to “switch off a previously enabled Intel IOMMU” (presumably in order to place it into bypass mode for performance or other reasons?), “ACPI/IORT updates and fixes” (which enables full support for the ACPI IORT on 64-bit ARM).

Dmitry Torokhov posted “Input updates for v.4.11-rc0” which includes a documentation converstion to ReST (RST, the new kernel doc format), an update to the venerable Synaptics “PS/2” driver to be aware of companion “SMBus” devices and various other miscellaneous fixes.

Darren Hart posted “platform-drivers-x86 for 4.12-1” which includes “a significantly larger and more complex set of changes than those of prior merge windows”. These include “several changes with dependencies on other subsytems which we felt were best managed through merges of immutable branches”.

James Bottomley posted “first round of SCSI updates for the 4.11+ merge window”, which includes many driver updates, but also comes with a warning to Linus that “The major thing you should be aware of is that there’s a clash between a char dev change in the char-misc tree (adding the new cdev_device_add) and the make checking the return value of scsi_device_get() mandatory”. Linus and Greg would later clarify what cdev_device_add does in response to Greg’s request to pull “Char/Misc driver patches for 4.12-rc1”.

David Miller posted “Networking” which includes many fixes.

David also posted “Sparc”, which includes a “bug fix for handling exceptions during bzero on some sparc64 cpus”.

David also posted “IDE”, which includes “two small cleanups”.

Greg K-H (Kroah-Hartman) posted “USB driver patches for 4.12-rc1”, which includes “Lots of good stuff here, after many many many attempts, the kernel finally has a working typeC interface, many thanks to Heikki and Guenter and others who have taken the time to get this merged. It wasn’t an easy path for them at all.” It will be interesting to test that out!

Greg K-H also posted “Driver core patches for 4.12-rc1”, which is “very tiny” this time around and consists mostly of documentation fixes, etc.

Greg K-H also posted “Char/Misc driver patches for 4.12-rc1” which features “lots of new drivers” including Google firmware drivers, FPGA drivers, etc. This lead to a reaction from Linus about how the tree conflicted with James Bottomley’s tree (which he had already pulled, “as per James’ suggestion”, and a back and forth between James and Greg about how to better handle such a conflict next time, and Linus noting that he prefers to fix merge conflicts himself but “*also* really really prefer the two sides of the conflict having been more aware of the clash” and providing him with a head’s up in the pull.

Greg K-H also posted “Staging/IIO driver fixes for 4.12-rc1”, which adds “about 350k new lines of crap^Wcode, mostly all in a big dump of media drivers from Intel”. He notes that the Android low memory killer driver has finally been deleted “much to the celebration of the -mm developers”.

Greg K-H also posted “TTY patches for 4.12-rc1”, which wasn’t big.

Dan Williams posted “libnvdimm for 4.12” which includes “Region media error reporting [a generic interface more friendly to use with multiple namespaces]”, a new “struct dax_device” to allow drivers to have their own custom direct access operations, and various other updates. Dan also posted “libnvdimm: band aid btt vs clear posion locking”, a patch which “continues the 4.11 status quo of disabling of error clearing from the BTT [Block Translation Table] I/O path” and notes that “A solution for tracking and handling media errors natively in the BTT is needed”. The BTT or Block Translation Table is a mechanism used by NV-DIMMs to handle “torn sectors” (partially complete writes) in hardware during error or power failure. As the “btt.txt” in the kernel documentation notes, NV-DIMMs do not have the same atomicity guarantees as regular flash drives do. Flash drives have internal logic and store enough energy in capacitors to complete outstanding writes during a power failure (rotational drives have similar for flushing their memory based caches and syncing remap block state) but NV-DIMMs are designed differently. Thus the BTT provides a level of indirection that is used to provide for atomic sector semantics.

Separately, Dan posted “libnvdimm fixes for 4.12-rc1” which includes “incremental fixes and a small feature addition relative to the libnvdimm 4.12 pull request”. Gert had “noticed hat tinyconfig was bloated by BLOCK selecting DAX [Direct Acess Execution]”, while “Vishal adds a feature that missed the initial pull due to pending review feedback. It allows the kernel to clear media errors when initializing a BTT (atomic sector update driver) instance on a pmem namespace”.

Dave Airlie posted “drm tegra for 4.12-rc1” containing additional updates due because he missed a pull from Thierry Reding for NVidia Tegra patches. He also followed up with a “drm document code of conduct” patch that describes a code of conduct for graphics written by freedesktop.org.

Stafford Horne posted “Initramfs fix for 4.12-rc1” containing a fix “for an issue that has caused 4.11 to not boot on OpenRISC”.

Catalin Marinas posted “arm64 updates for 4.12” including kdump support, “ARMv8.3 HWCAP bits for JavaScript conversion instructions, complex numbers and weaker release consistency [memory ordering]”, and support for platform (non-enumerated buses) MSI support when using ACPI, among other patches. He also removes support for ASID-tagged VIVT [Virtually Indexed, Virtually Tagged] caches since “no ARMv8 implementation using it and deprecated in the architecture” [caches are PIPT – Physically Indexed, Physically Tagged – except that an implementation might do VIPT or otherwise internally using various CAM optimizations].

Catalin later posted “arm64 2nd set of updates for 4.12”, which include “Silence module allocation failures when CONFIG_ARM*_MODULE_PLTS is enabled”.

Olof Johansson posted “ARM: SoC contents for 4.12 merge window”. In his pull request, Olof notes that “It’s been a relatively quiet release cycle here. Patch count is about the usual (818 commits, which includes merges).”
He goes on to add, “Besides dts [DeviceTree files], the mach-gemini cleanup by Linus Walleij is the only platform that pops up on its own”. He called out the separate post for the TEE [Trusted Execution Environment] subsystem. Olof also removed Alexandre Courbot and Stephen Warren from NVidia Tega maintainership, and added Jon Hunter in their place.

Rob Herring posted “DeviceTree for 4.12”, which includes updates to the Device Tree Compiler (dtc), and more DeviceTree overlay unit tests, among various other changes.

Darrick J. Wong posted “xfs: updates for 4.12”, which includes the “big new feature for this release” of a “new space mapping ioctl that we’ve been discussing since LSF2016 [Linux Storage and Filesystem conference]”.

Max Filippov posted “Xtensa improvements for 4.12”.

Ted Ts’o posted “ext4 updates for 4.12”, which adds “GETFSMAP support” (discussed previously in this podcast) among other new features.

Ted also posted “fscrypt updates for 4.12” which has “only bug fixes”.

Paul Moore posted “Audit patches for 4.12” which includes 14 patches that “span the full range of fixes, new featuresm and internal cleanups”. These include a move to 64-bit timestamps, converting refcounts to the new refcount_t type from atomic_t, and so on.

Wolfram Sang posted “i2c for 4.12”.

Mark Brown posted “regulator updates for 4.12”, which includes “Quite a lot going on with the regulator API for this release, much more in the core than in the drivers for a change”. This includes “Fixes for voltage change propagation through dumb power switches, a notification when regulators are enabled, a new settling time property for regulators where the time taken to move to a new voltage is not related to the size of the change”, etc.

Mark also posted “SPI updates for 4.12”, which includs “quite a lot of small
driver specific fixes and enhancements”.

Jessica Yu posted “module updates for 4.12”, containing minor fixes.

Mauro Carvalho Chehab posted “media updates” including mostly driver updates and the removal of “two staging LIRC drivers for obscure hardware”. He also posted a 5 part patch series entitled “Conver more books to ReST”, which converted three kernel DocBook format documentation file sets to RST, the new format being used for kernel documentation (on the kernel-doc mailing list, and maintained by Jonathan Corbet of LWN): librs, mtdnand, and sh. He noted that “After this series, there will be just one DocBook pending conversion: ” lsm (Linux Security Modules)”. He also notes that the existing LSM documentation is very out of date and no longer describes the current API.

Michael Ellerman posted “Please pull powerpc/linux.git powerpc-4.12-1 tag”, which includes suppot for “Larger virtual address space on 64-bit server CPUs. By default we use a 128TB virtual address space, but a process can request access to the full 512TB by passing a hint to mmap() [this seems very similar to the 56-bit la57 feature from Intel]”. It also includes “TLB flushing optimisations for the radix MMU on Power9” and “Support for CAPI cards on Power9, using the “Coherent Accelerator Interface Architecture 2.0″ [which definitely sounds like juicy reading]”.

Separately Michael Ellerman posted “Please pull powerpc-linux.git powerpc-4.12-2 tag” which includes “rework the Linux page table geometry to lower memory usage on 64-bit Book3S (IBM chips) using the Hash MMU [IBM uses a special inverse page tables “reverse lookup” hashing format]”.

Eric W. Biederman posted “namespace related changes for v4.12-rc1”, which includes a “set of small fixes that were mostly stumbled over during more significant development. This proc fix and the fix to posix-timers are the most significant of the lot. There is a lot of good development going on but unfortunately it didn’t quite make the merge window”.

Takashi Iwai posted “sound updates for 4.12-rc1”, noting that it was “a relatively calm development cycle, no scaring changes are seen”.

Steven Rostedt posted “tracing: Updates for v4.12” which includes “Pretty much a full rewrite of the process of function probes”. He followed up with “Three more updates for 4.12” that contained “three simple changes”.

Martin Schwidefsky posted “s390 patches for 4.12 merge window” which includes improvements to VFIO support on mainframe(!) [this author was recently amazed to see there are also DPDK ports for s390x], a new true random number generator, perf counters for the new z13 CPU, and many others besides.

Geert Uytterhoeven posted “m68k updates for 4.12” with a couple fixes.

Jacek Anaszewski posted “LED updates for 4.12” with various fixes.

Kees Cook posted “usercopy updates for v4.12-rc1” with a couple fixes.

Kees also posted “pstore updates for v4.12-rc1”, which included “large
internal refactoring along with several smaller fixes”.

James Morris posted “Security subsystem updates for v4.12”.

Sebastian Reichel posted “hsi changes for hsi-4.12”.

Sebastian also posted “power-supply changes for 4.12”, which includes a couple of new drivers and various fixes.

Separately, Sebastian poted “power-supply changes for 4.12 (part 2), which includes some new drivers and some fixes.

Paolo Bonzini posted “First batch of KVM changes for 4.12 merge window” which includes kexec/kdump support on 32-bit ARM, support for a userspace virtual interrupt controller to handle the “weird” Raspberry Pi 3, in-kernel acceleration for VFIO on POWER, nested EPT support for accessed and dirty bits on x86, and many other fixes and improvements besides.

Separately Paolo posted “Second round of KVM changes for 4.12”, which include various ARM (32 and 64-bit) cleanups, support for PPC [POWER] XIVE (eXternal Interrupt Virtualization Engine), and “x86: nVMX improvements, including emulated page modification logging (PML) which brings nice performance improvements [under nested virtualization] on some workloads”.

Ilya Dryomov posted “Ceph updates for 4.12-rc1”, which include “support for disabling automatic rbd [resilent block device] exclusive lock transfers” and “the long awaited -ENOSPC [no space] handling series”. The latter finally handles out of space situations by aborting with -ENOSPC rather than “having them [writers] block indefinitely”.

Miklos Szeredi posted “fuse updates for 4.12”, which “contains support for pid namespaces from Seth and refcount_t work from Elena”.

Miklos also posted “overlayfs update for 4.12”, which includes “making st_dev/st_ino on the overlay behave like a normal filesystem”. “Currently this only wokrs if all layers are on the same filesystem, but future work will move the general case towards more sane behavior”.

Bjorn Helgaas posted “PCI changes for v4.12” which includes a framework for supporting PCIe devices in Endpoint mode from Kishon Vjiay Abraham, fixes for using non-posted PCI config space on ARM from Lorenzo Pieralisi, allowing slots below PCI-to-PCIe “reverse bridges”, a bunch of quirks, and many other fixes and enhancements.

Jaegeuk Kim posted “f2fs for 4.12-rc1”, which “focused on enhancing performance with regards to block allocation, GC [Garbage Collection], and discard/in-place-update IO controls”.

Shuah Khan posted “Kselftest update for 4.12-rc1” with a few fixes.

Richard Weinberg posted “UML changes for v4.12-rc1” which includes “No new stuff, just fixes” to the “User Mode Linux” architecture in 4.12. Separately, Masami Hiramatsu posted an RFC patch entitled “Output messages to stderr and support quiet option” intended to “fix[] some boot time printf output to stderr by adding os_info() and os_warn(). The information-level messages via os_info() are suppressed when “quiet” kernel option is specified”.

Richard also postd “UBI/UBIFS updates for 4.12-rc1”, which “contains updates for both UBI and UBIFS”. It has a new CONFIG_UBIFS_FS_SECURITY option, among “minor improvements” and “random fixes”.

Thierry Reding posted “pwm: Changes for v4.12-rc1”, which amongst other things includes “a new driver for the PWM controller found on MediaTek SoCs”.

Vinod Koul posted “dmaengine updates” which includes “a smaller update consisting of support for TI DA8xx dma controller” among others.

Chris Mason posted “Btrfs” which “Has fixes and cleanups” as well as “The biggest functional fixes [being] between btrfs raid5/6 and scrub”.

Trond Myklebust posted “Please pull NFS client fixes for 4.12”, which includes various fixes, and new features (such as “Remove the v3-only data server limitation on pNFS/flexfiles”).

J. Bruce Fields posted “nfsd changes for 4.12”, which includes various RDMA updates from Chuck Lever.

Stephen Boyd posted “clk changes for v4.12”. Of the changes, the “biggest things are the TI clk driver rework to lay the groundwork for clkctrl support in the next merge window and the AmLogic audio/graphics clk support”.

Alexandre Belloni posted “RTC [Real Time Clock] for 4.12”, which uses a new GPG subkey that he also let Linus know about at the same time.

Nicholas A. Bellinger posted “target updates for v4.12-rc1”, which was “a lot more calm than previously expected. It’s primarily fixes in various areas, with most of the new functionality centering around TCMU [TCM – Linux iSCSI Target Support in Userspace] backend work with Xiubo Li has been driving”.

Zhang Rui posted “Thermal management updates for v4.12-rc1”, which includes a number of fixes, as well as some new drivers, and a new interface in “thermal devfreq_cooling code so that the driver can provide more precise data regarding actual power to the thermal governor every time the power budget is calculated”.

4.12 git pulls for new subsystems and features

David Howells posted “Hardware module parameter annotation for secure boot” in which he requested that Linus pull in new “kmod” macros (the same name is used for the userspace module tooling, but in this case refers to the in-kernel kernel module infrastructure of the same name). The new macros add annotations to “module_param” of the new form “module_param_hw” with a “hwtype” such as “ioport” or “iomem”, and so forth. These are used by the kernel to prevent those parameters from being used under a UEFI Secure Boot situation in which the kernel is “locked down” (to prevent someone from loading a signed kernel image and then compromising it to circumvent the secure boot mechanism).

Arnd Bergmann sent a special pull request to Linus Torvalds for “TEE driver infrastructure and OP-TEE drivers”, which “introduces a generic TEE [Trusted Execution Environment] framework in the kernel, to handle trusted environ[ments] (security coprocessor or software implementations such as OP-TEE/TrustZone)”. He sent the pull separately from the other arm-soc pull specifically to call it out, and to make sure everyone knew that this was finally headed upstream, but he noted it would probably be maintained through the arm-soc kernel tree. He included a lengthy defense of why now was the right time to merge TEE support into upstream Linux.

Saving TLB flushes on Intel x86 Architecture

Andy Lutomirski posted an RFC patch series entitled “x86 TLB flush cleanups, moving toward PCID support”. Modern (non-legacy) architectures implement a per-process context identifier that can be used in order to tag VMA (Virtual Memory Area) translations that end up in the TLB (Translation Lookaside Buffer) caches within the microprocessor core. The processor’s hardware (or in some mostly embedded cases, software) (page table) “walkers” will navigate the page tables for a process and populate the TLBs (except in the embedded software case, such as on certain PowerPC and MIPS processors, in which the kernel contains special assembly routines to perform this in software). On legacy architectures, the TLB is fairly simple, containing a simple virtual address to physical (or intermediate, in the case of virtualization) address. But on more sophisticated architectures, the TLB includes address space identification information that allows the TLB to distinguish between hits to the same virtual address that are from two different processes (known as tasks from within the kernel). Using additional tagging in the TLB avoids the traditional need to invalidate the entire TLB on process context switch.

Modern architectures, such as AArch64, have implemented context tagging support in their architecture code for some time, and now x86 is finally set to follow, enabling a feature that has actually been present in x86 for some time (but was not wired up), thanks to Andy’s work on PCID (Process Context IDentifier) support. In his patch series, Andy notes that as he has been “polishing [his] PCID code, a major problem [he’s] encountered is that there are too many x86 TLB flushing code paths and that they have too many inconsequential differences”. This patch series aims to “clean up the mess”. Now if x86 finally gains hardware broadcast TLB invalidations it will also be able to remove the wasted IPIs (Inter-Processor-Interrupts) that it implements to cause remote processors to invalidate TLB entries, too. Linus liked Andy’s initial work, but said he is “always a bit nervous about TLB changes like this just because any potential bugs tend to be really really hard to see and catch”. Those of us who have debugged nasty TLB issues on other architectures would be inclined to agree with him.

Ongoing Development

Laurent Dufour posted version 3 of a patch series entitled “Speculative page faults”. This is a contemporary development inspired by Peter Zijstra’s earlier work, which was based upon ideas of still others. The whole concept dates back to at least 2009 and generally involves removing the traditional locking constraints of updates to VMAs (Virtual Memory Areas) used by Linux tasks (processes) to represent the memory of running programs. Essentially, a “speculative fault” means “not holding mmap_sem” (a semaphore guarding a tasks’ current memory map). Laurent (and Peter) make VMA lookups lockless, and perform updats speculatively, using a seqlock to detect a change to the underlying VMA during the fault. “Once we’ve obtained the page and are ready to update the PTE, we validate if the state we started the fault with is still valid, if not, we’ll fail the fault with VM_FAULT_RETRY, otherwise we update the PTE and we’re done”. Earlier testing showed very significant performance upside to this work due to the reduced lock contention.

Aaron Lu posted “smp: do not send IPI if call_single_queue not empty”. The Linux kernel (and most others) uses a construct known as an IPI – or Inter-Processor-Interrupt – a form of software generated interrupt that a processor will send to one or more others when it needs them to perform some housekeeping work on the kernel’s behalf. Usually, this is to handle such things as a TLB shootdown (invalidating a virtual address translation in a remote processor due to a virtual address space being removed), especially on less sophisticated legacy architectures that do not feature invalidation of TLBs through hardware broadcast, though there are many other uses for IPIs. Aaron’s patch realizes, effectively, that if a remote processor is already going to process a queue of CSD (call_single_data) function calls it has been asked to via IPI then there is no need to send another IPI and generate additional interrupts – the queue will be drained of this entry as well as existing entries by the IPI management code.

Romain Perier posted version 8 of “Replace PCI pool by DMA pool API” which realizes that the current PCI pool API uses “simple macro functions direct expanded to the appropriate dma pool functions”, so it simply replaces them with a direct use of the corresponding DMA pool API instead.

Sandhya Bankar posted “vfs: Convert file allocation code to use the IDR”. This replaces existing filesystem code that allocates file descriptors using a
custom allocator with Matthew (Willy) Wilcox’s idr (ID Radix) tree allocator.

Serge E. Hallyn posted a resend of version 2 of a patch series entitled “Introduce v3 namespaced file capabilities”. We covered this last time.

Heinrich Schuchardt posted “arm64: Always provide “model name” in /proc/cpuinfo”, which was quickly shot down (for the moment).

Christian König posted verision 5 of his “Resizeable PCI BAR support” patch series. We have featured this in a previous episode of the podcast.

Prakash Sangappa posted “hugetlbfs ‘noautofill’ mount option” which aims to allow (optionally) for hugetlbfs pseudo-filesystems to be mounted with an option which will not automatically populate holes in files with zeros during a page fault when the file is accessed though the mapped address. This is intended to benefit applications such as Oracle databases, which make heavy use of such mechanisms but don’t take kindly to the kernel having side effects that change on-disk files even if only zero fill. Dave Hansen pushed back against this change saying that it was “further specializing hugetlbfs” and that Oracle should be using userfaultfd or “an madvise() option that disallows backing allocations”. Prakash replied that they had considered those but with a database there are such a large number of single threaded processes that “The concern with using userfaultfs is the overhead of setup and having an additional thread per process”.

Sameer Goel posted “arm64: Add translation functions for /dev/mem read/write” which “Port architecture specific xlate [translate] and unxlate [untranslate] functions for /dev/mem read/write. This sets up the mapping for a valid physical address if a kernel direct mapping is not alread present”. Depending upon the ARM platform, access to a bad address in /dev/mem could result in a synchronous exception in the core, or a System Error (SError) generated by a system memory controller interface. In either case, it is handled as a fatal error where the same is not true on x86. While access to /dev/mem is restricted, increasingly being deprecated, and has other semantics to prevent its used on 64-bit ARM systems, it still exists and is used. In this case, to read the ACPI FPDT table which provides performance pointer records. Nevertheless, both Will Deacon and Leif Lindholm objected to the reasoning given here, saying that the kernel should instead be taught how to parse this table and expose its information via /sys rather than having userspace tools go poking in /dev/mem to try to read from the table directly.

Minchan Kim posted “vmscan: scan pages until it f[inds] eligible pages” in which he notes that “There are premature OOM [Out Of Memory killer invocations] happening. Although there are ton of free swap and anonymous LRU list of eligible zones, OOM happened. With investigation, skipping page of isolate_lru_pages makes reclaim void because it returns zero nr_taken easily so LRU shrinking is effectively nothing and just increases priority aggressively. Finally, OOM happens”.

Julius Werner posted version 3 of his “Memconsole changes for new coreboot format” which teaches the Google firmware driver for their memconsole to deal with the newer type of persistent ring buffer console they introduced.

Olliver Schinagl and Jamie Iles had a back and forth about the latter’s work on “glue-code” (generic handling code) for the DW (DesignWare) 8250 (a type of serial port interface made popular by PC) IP block as used in many different designs. Depending upon how the block is configured, it can behave differently, and there was some discussion about how to handle that. In particular the location of the UART_USR register.

Xiao Guangrong posted “KVM: MMU: fast write protect” which “introduces a[n] extremely fast way to write protec all the guest memory. Comparing with the ordinary algorthim which write protects last level sptes [the page table entries used by the guest] based on the rmap [the “reverse” map, the means that Linux uses to encode page table information within the kernel] one by one, it just simply updates the generation number to ask all vCPUs to reload its root page table, particularly it can be done out of mmu-lock”. The idea was apparently originally proposed by Avi (Kivity). Paolo Bonzini thought “This is clever” and wondered “how the alternative write protection mechanism would affect performance of the dirty page ring buffer patches”. Xiao thought it could be used to speed up those patches after merging, too [Paolo noted that he aims to merge these early in 4.13 development].

Bogdan Mirea posted version 2 of”Add “Preserve Boot Time Support””, which follows up on a previous discussion about retaining “Boot Time Preservation between Bootloader and Linux Kernel. It is based on the idea that the Bootloader (or any other early firmware) will start the HW Timer and Linux Kernel will count the time starting with the cycles elapsed since timer start”. By “Bootloader” he means “firmware” to those who live in x86-land.

Igor Stoppa posted “post-init-read-only protection for data allocated dynamically” which aims to provide a mechanism for dynamically allocated data which is similar to the “__read_only” special linker section that certain annotated (using special GCC directives) code will be placed into. That works great for read-only data (which is protected by the MMU locking down the corresponding region early in boot). His “wish” is to start with the “policy DB of SE Linux and the LSM Hooks, but eventually I would like to extend the protection also to other subsystems, in a way that can merged into mainline.” His patch includes an analysis of how he feels he can be as “little invasive as possible”, noting that “In most, if not all, the cases that could be enhanced, the code will be calling kmalloc/vmalloc, including GFP_KERNEL [Get Free Pages of Kernel Type Memory] as the desired type of memory”. Consequently, he says, “I suspect/hope that the various maintainer[s] won’t object too much if my changes are limited to replacing GFP_KERNEL with some other macro, for example what I previously called GFP_LOCKABLE”. Michal Hocko had some feedback, largely along the lines of a “master toggle” (tha would allow protection to be disabled for small periods in order to make changes to “read only” data) was largely pointless – due to it re-exposing the data. Instead, he wanted to see the protection being done at the kmem_cache_create time by adding a “SLAB_SEAL” parameter that would later be enabled on a per kmem_cache basis using “kmem_cache_seal(cache)” or a similar mechanism.

Bharat Bhushan posted “ARM64/PCI: Allow userspace to mmap PCI resources”, which Lorenzo Pieralisi noted was already implemented by another patch.

A lengthy, and “spirited” discussion took place between Timur Tabi and the various maintainers of the 64-bit ARM Architecture and SoC platform trees over the desire for the maintainers to have changes to “defconfigs” for the architecture go through a special “arm@kernel.org” alias. Except that after they had told Timur to use that, they objected to him posting a patch informing others of this alias in the kernel documentation. Instead, as Timur put it “without a MAINTAINERS entry, how would anyone know to CC: that address? I posted 3 versions of my defconfig patchset before someone told me that I had to send it to arm@kernel.org.” The discussion thread is entitled “MAINTAINERS: add arm@kernel.org as the list for arm64 defconfig changes”.

Xunlei Pang posted version 3 of his “x86/mm/ident_map: Add PUD level 1GB page support” which helps “kernel_ident_mapping_init” to create a single and very large identitiy page mapping in order to reduce TLB (Translation Lookaside Buffer – the caches that store virtual to physical memory lookups performed by hardware) pressure on an architecture that is currently using many 2MB (PMD – Page Middle Directory) level pages for this process.

Anju T Sudhakar posted version 8 of “IMC Instrumentation Support”, which provides support for POWER9’s “In-Memory-Collection” or IMC infrastructure, which “contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level.”

Greg K-H (Kroah-Hartman) posted an RFC patch entitled “add more new kernel pointer filter options” which “implemnt[s] some new restrictions when printing out kernel pointers, as well as the ability to whitelist kernel pointers where needed.”

Kees Cook posted “x86/refcount: Implement fast refcount overflow protection”, which seeks to upstream a “modified version of the x86 PAX_REFCOUNT defense from PaX/grsecurity. This speeds up the refcount_t API by duplicating the existing atomic_t implementation with a single instruction added to detect if the refcount has wrapped past INT_MAC (or below 0) resuling in a negative value, where the handler then restores the refcount_t to INT_MAX”.

David Howlls posted an RFC patch entitled “VFS: Introduce superblock configuration context” which is a “set of patches to create a superblock configuration contenxt prior to setting up a new mount, populating it with the parsed options/binary data, creating the superblock and then effecting the mount. This allows namespaces and other information to be conveyed through the mount procedure. It also allows extra error information”.

The Google Chromebook team let folks know that they were (rarely, like one in a million) seeing “Threads stuck in zap_pid_ns_processes()”. Guenter Roeck noted that the “Problem is that if the main task [which has children that are being ptraced] doesn’t exit, it [the child] hangs forever. Chrome OS (where we see the problem in the field, and the application is chrome) is configured to reboot on hung tasks – if a task is hung for 120 seconds on those systems, it tends to be in a bad shape. This makes it a quite severe problem for us”. He asked “Are there other conditions besides ptrace where a task isn’t reaped?”. Reaping refers to the behavior in which tasks, when they exit are reparented to the init task, which “reaps” them (cleans up and makes sure the state that exit with is seen), except under ptrace in this case where the parent task spawning the children “was outside of the pid namespace and was choosing not to reap the child”. Various proposals as to how to deal with this in the namespace code were discussed.

Mahesh Bandewar posted “kmod: don’t load module unless req process has CAP_SYS_MODULE” which notes that “A process inside random user-ns [a user namespace] should not load a module, which is currently possible”. He shows how a user namespace can be created that causes the kernel to load a module upon access to a file node indirectly. This could be a security risk if this approach were used to cause a host kernel to load a vulnerable but otherwise not loaded kernel driver through the privileged permissions in the namespace.

Marc Zyngier posted “irqdomain: Improve irq_domain_mapping facility” in which he “Update[s] IRQ-domain.txt to document irq_domain_mapping” among otherwise seeking to make it easier to access and understand this kernel feature.

Jens Axboe accepted a patch from Ulf Hansson adding Paolo Valente as a MAINTAINER of the BFQ I/O scheduler.

Cyrille Pitchen updated the git repos for the SPI NOR subsystem, which is “now hosted on MTD repos, spi-nor/next is on l2-mtd and spi-nor/fixes will be on linux-mtd”.

Alexandre Courbot posted “MAINTAINERS: remove self from GPIO maintainers”.

The folks at Codeaurora posted a lengthy analysis of the Linux kernel scheduler and specific problems with load_balance that will be covered next time around, along with work by Peter Zijlstra on the “cgroup/PELT overhaul (again).

Finally, Paul McKenney previously posted “Make SRCU be once again optional”, after having noted that the need to build it in by default (caused by other recent changes in header files) increased the kernel by 2K. Nico(las) Pitre was happy to hear this, saying “If every maintainer finds a way to (optionally) reduce the size of the code they maintain by 2K then we’ll get a much smaller kernel pretty soon”.

May 15, 2017 04:07 AM

May 13, 2017

Linux Plumbers Conference: Today is the very last day for Plumbers refereed track submissions

The submission site

https://linuxplumbersconf.org/2017/ocw/events/LPC2017TALKS/proposals

Will close at midnight pacific tonight

 

May 13, 2017 03:48 PM

May 09, 2017

Matthew Garrett: Intel AMT on wireless networks

More details about Intel's AMT vulnerablity have been released - it's about the worst case scenario, in that it's a total authentication bypass that appears to exist independent of whether the AMT is being used in Small Business or Enterprise modes (more background in my previous post here). One thing I claimed was that even though this was pretty bad it probably wasn't super bad, since Shodan indicated that there were only a small number of thousand machines on the public internet and accessible via AMT. Most deployments were probably behind corporate firewalls, which meant that it was plausibly a vector for spreading within a company but probably wasn't a likely initial vector.

I've since done some more playing and come to the conclusion that it's rather worse than that. AMT actually supports being accessed over wireless networks. Enabling this is a separate option - if you simply provision AMT it won't be accessible over wireless by default, you need to perform additional configuration (although this is as simple as logging into the web UI and turning on the option). Once enabled, there are two cases:

  1. The system is not running an operating system, or the operating system has not taken control of the wireless hardware. In this case AMT will attempt to join any network that it's been explicitly told about. Note that in default configuration, joining a wireless network from the OS is not sufficient for AMT to know about it - there needs to be explicit synchronisation of the network credentials to AMT. Intel provide a wireless manager that does this, but the stock behaviour in Windows (even after you've installed the AMT support drivers) is not to do this.
  2. The system is running an operating system that has taken control of the wireless hardware. In this state, AMT is no longer able to drive the wireless hardware directly and counts on OS support to pass packets on. Under Linux, Intel's wireless drivers do not appear to implement this feature. Under Windows, they do. This does not require any application level support, and uninstalling LMS will not disable this functionality. This also appears to happen at the driver level, which means it bypasses the Windows firewall.
Case 2 is the scary one. If you have a laptop that supports AMT, and if AMT has been provisioned, and if AMT has had wireless support turned on, and if you're running Windows, then connecting your laptop to a public wireless network means that AMT is accessible to anyone else on that network[1]. If it hasn't received a firmware update, they'll be able to do so without needing any valid credentials.

If you're a corporate IT department, and if you have AMT enabled over wifi, turn it off. Now.

[1] Assuming that the network doesn't block client to client traffic, of course

comment count unavailable comments

May 09, 2017 08:18 PM

Matthew Garrett: Intel's remote AMT vulnerablity

Intel just announced a vulnerability in their Active Management Technology stack. Here's what we know so far.

Background

Intel chipsets for some years have included a Management Engine, a small microprocessor that runs independently of the main CPU and operating system. Various pieces of software run on the ME, ranging from code to handle media DRM to an implementation of a TPM. AMT is another piece of software running on the ME, albeit one that takes advantage of a wide range of ME features.

Active Management Technology

AMT is intended to provide IT departments with a means to manage client systems. When AMT is enabled, any packets sent to the machine's wired network port on port 16992 or 16993 will be redirected to the ME and passed on to AMT - the OS never sees these packets. AMT provides a web UI that allows you to do things like reboot a machine, provide remote install media or even (if the OS is configured appropriately) get a remote console. Access to AMT requires a password - the implication of this vulnerability is that that password can be bypassed.

Remote management

AMT has two types of remote console: emulated serial and full graphical. The emulated serial console requires only that the operating system run a console on that serial port, while the graphical environment requires drivers on the OS side requires that the OS set a compatible video mode but is also otherwise OS-independent[2]. However, an attacker who enables emulated serial support may be able to use that to configure grub to enable serial console. Remote graphical console seems to be problematic under Linux but some people claim to have it working, so an attacker would be able to interact with your graphical console as if you were physically present. Yes, this is terrifying.

Remote media

AMT supports providing an ISO remotely. In older versions of AMT (before 11.0) this was in the form of an emulated IDE controller. In 11.0 and later, this takes the form of an emulated USB device. The nice thing about the latter is that any image provided that way will probably be automounted if there's a logged in user, which probably means it's possible to use a malformed filesystem to get arbitrary code execution in the kernel. Fun!

The other part of the remote media is that systems will happily boot off it. An attacker can reboot a system into their own OS and examine drive contents at their leisure. This doesn't let them bypass disk encryption in a straightforward way[1], so you should probably enable that.

How bad is this

That depends. Unless you've explicitly enabled AMT at any point, you're probably fine. The drivers that allow local users to provision the system would require administrative rights to install, so as long as you don't have them installed then the only local users who can do anything are the ones who are admins anyway. If you do have it enabled, though…

How do I know if I have it enabled?

Yeah this is way more annoying than it should be. First of all, does your system even support AMT? AMT requires a few things:

1) A supported CPU
2) A supported chipset
3) Supported network hardware
4) The ME firmware to contain the AMT firmware

Merely having a "vPRO" CPU and chipset isn't sufficient - your system vendor also needs to have licensed the AMT code. Under Linux, if lspci doesn't show a communication controller with "MEI" or "HECI" in the description, AMT isn't running and you're safe. If it does show an MEI controller, that still doesn't mean you're vulnerable - AMT may still not be provisioned. If you reboot you should see a brief firmware splash mentioning the ME. Hitting ctrl+p at this point should get you into a menu which should let you disable AMT.

How about over Wifi?

Turning on AMT doesn't automatically turn it on for wifi. AMT will also only connect itself to networks it's been explicitly told about. Where things get more confusing is that once the OS is running, responsibility for wifi is switched from the ME to the OS and it forwards packets to AMT. I haven't been able to find good documentation on whether having AMT enabled for wifi results in the OS forwarding packets to AMT on all wifi networks or only ones that are explicitly configured.

What do we not know?

We have zero information about the vulnerability, other than that it allows unauthenticated access to AMT. One big thing that's not clear at the moment is whether this affects all AMT setups, setups that are in Small Business Mode, or setups that are in Enterprise Mode. If the latter, the impact on individual end-users will be basically zero - Enterprise Mode involves a bunch of effort to configure and nobody's doing that for their home systems. If it affects all systems, or just systems in Small Business Mode, things are likely to be worse.
We now know that the vulnerability exists in all configurations.

What should I do?

Make sure AMT is disabled. If it's your own computer, you should then have nothing else to worry about. If you're a Windows admin with untrusted users, you should also disable or uninstall LMS by following these instructions.

Does this mean every Intel system built since 2008 can be taken over by hackers?

No. Most Intel systems don't ship with AMT. Most Intel systems with AMT don't have it turned on.

Does this allow persistent compromise of the system?

Not in any novel way. An attacker could disable Secure Boot and install a backdoored bootloader, just as they could with physical access.

But isn't the ME a giant backdoor with arbitrary access to RAM?

Yes, but there's no indication that this vulnerability allows execution of arbitrary code on the ME - it looks like it's just (ha ha) an authentication bypass for AMT.

Is this a big deal anyway?

Yes. Fixing this requires a system firmware update in order to provide new ME firmware (including an updated copy of the AMT code). Many of the affected machines are no longer receiving firmware updates from their manufacturers, and so will probably never get a fix. Anyone who ever enables AMT on one of these devices will be vulnerable. That's ignoring the fact that firmware updates are rarely flagged as security critical (they don't generally come via Windows update), so even when updates are made available, users probably won't know about them or install them.

Avoiding this kind of thing in future

Users ought to have full control over what's running on their systems, including the ME. If a vendor is no longer providing updates then it should at least be possible for a sufficiently desperate user to pay someone else to do a firmware build with the appropriate fixes. Leaving firmware updates at the whims of hardware manufacturers who will only support systems for a fraction of their useful lifespan is inevitably going to end badly.

How certain are you about any of this?

Not hugely - the quality of public documentation on AMT isn't wonderful, and while I've spent some time playing with it (and related technologies) I'm not an expert. If anything above seems inaccurate, let me know and I'll fix it.

[1] Eh well. They could reboot into their own OS, modify your initramfs (because that's not signed even if you're using UEFI Secure Boot) such that it writes a copy of your disk passphrase to /boot before unlocking it, wait for you to type in your passphrase, reboot again and gain access. Sealing the encryption key to the TPM would avoid this.

[2] Updated after this comment - I thought I'd fixed this before publishing but left that claim in by accident.

(Updated to add the section on wifi)

(Updated to typo replace LSM with LMS)

(Updated to indicate that the vulnerability affects all configurations)

comment count unavailable comments

May 09, 2017 08:04 PM

May 07, 2017

Linux Plumbers Conference: Submission deadline for Linux Plumbers Conference refereed track proposals extended by a week

The deadline for submitting refereed track proposals for the 2017 Linux Plumbers Conference has been extended until 13 May at 11:59PM Pacific Time.  The refereed track will have 50-minute presentations on a specific aspect of Linux “plumbing” (e.g. core libraries, media creation/playback, display managers, init systems, kernel APIs/ABIs, etc.) that are chosen by the LPC committee to be given during all three days of the conference.

May 07, 2017 09:26 PM

May 04, 2017

Linux Plumbers Conference: Android/Mobile Microconference Accepted into Linux Plumbers Conference

Android continues to find interesting new applications and problems to
solve, both within and outside the mobile arena.  Mainlining continues
to be an area of focus, as do a number of areas of core Android
functionality, including the kernel.  Other areas where there is ongoing
work include eBPF, Lowmemory alternatives, the Android emulator, and
SDCardFS.

For the latest details, please see this microconference’s wiki page

http://wiki.linuxplumbersconf.org/2017:android_mobile

We hope to see you there!

May 04, 2017 07:25 PM

Dave Airlie: how much of the conformance test suite does radv pass now?

Test run totals:
Passed: 109293/150992 (72.4%)
Failed: 0/150992 (0.0%)
Not supported: 41697/150992 (27.6%)
Warnings: 2/150992 (0.0%)

This is effectively a pass. The Not Supported stuff isn't missing features as uneducated people are quick to spout, it's more stuff the hardware doesn't support or is pointless to expose on the hardware. (lots of image formats).

This is the results from the Vulkan CTS 1.0.2 branch, against mesa master with one patch (a workaround for some InternalErrors that CTS throws up).

Do not call the driver conformant as that is against the Khronos rules as we haven't paid or filed for approval, but the driver does now effectively pass the latest conformance test suite. I'll update on things if that changes.

Thanks again to everyone involved.

May 04, 2017 04:22 AM

May 03, 2017

Pete Zaitcev: AWS price reductions

Amazon announced another price cut today and anaylists are all about "there's more margin in VMs", "Microsoft Azure finds it tough to compete", and so on.

Being in storage I'm obviously biased, but I think the price cuts in S3 and Glacier are more consequential. All that data is like savings account for Amazon. Any time they need cash, they can just increase prices right back, and what are you going to do? Data is not VMs, you cannot just download it all overnight, you cannot migrate off the service.

Obviously things are not that simple, and they are trying to bind code to AWS in ways that yield similar inertia to the data. As Adrian Cockroft tweeted, "OpenStack not closing the gap". Cut him a slack though, he was at NetFlix, and it was 4 years ago. One way or the other, data's inertia is natural and code's inertia is artificial — it's something engineers can fix.

May 03, 2017 08:05 PM

Michael Kerrisk (manpages): man-pages-4.11 is released

I've released man-pages-4.11. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from over 30 contributors. It includes more than 300 commits changing over 100 pages. The changes include the addition of 5 pages, significant rewriting of 1 other page, and enhancements to many other pages.

Among the more significant changes in man-pages-4.11 are the following:

May 03, 2017 07:59 PM

Michael Kerrisk (manpages): man-pages-4.09 is released

I've released man-pages-4.09. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 44 contributors. This is one of the more substantial releases in recent times, with more than 500 commits changing around 190 pages. The changes include the addition of eight new pages and significant enhancements or rewrites to many existing pages.

Among the more significant changes in man-pages-4.09 are the following:

In addition to the above, substantial changes were also made to the close(2), getpriority(2), nice(2), timer_create(2), timerfd_create(2), random(4), and proc(5) pages.

May 03, 2017 07:42 PM

Michael Kerrisk (manpages): man-pages-4.10 is released

I've released man-pages-4.10. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from over 40 contributors. This release sees a large number of changes: over 600 commits changing around 160 pages. The changes include the addition of 11 pages, significant rewrites of 3 other pages, and enhancements to many other pages.

Among the more significant changes in man-pages-4.10 are the following:

May 03, 2017 07:41 PM

May 02, 2017

Kees Cook: security things in Linux v4.11

Previously: v4.10.

Here’s a quick summary of some of the interesting security things in this week’s v4.11 release of the Linux kernel:

refcount_t infrastructure

Building on the efforts of Elena Reshetova, Hans Liljestrand, and David Windsor to port PaX’s PAX_REFCOUNT protection, Peter Zijlstra implemented a new kernel API for reference counting with the addition of the refcount_t type. Until now, all reference counters were implemented in the kernel using the atomic_t type, but it has a wide and general-purpose API that offers no reasonable way to provide protection against reference counter overflow vulnerabilities. With a dedicated type, a specialized API can be designed so that reference counting can be sanity-checked and provide a way to block overflows. With 2016 alone seeing at least a couple public exploitable reference counting vulnerabilities (e.g. CVE-2016-0728, CVE-2016-4558), this is going to be a welcome addition to the kernel. The arduous task of converting all the atomic_t reference counters to refcount_t will continue for a while to come.

CONFIG_DEBUG_RODATA renamed to CONFIG_STRICT_KERNEL_RWX

Laura Abbott landed changes to rename the kernel memory protection feature. The protection hadn’t been “debug” for over a decade, and it covers all kernel memory sections, not just “rodata”. Getting it consolidated under the top-level arch Kconfig file also brings some sanity to what was a per-architecture config, and signals that this is a fundamental kernel protection needed to be enabled on all architectures.

read-only usermodehelper

A common way attackers use to escape confinement is by rewriting the user-mode helper sysctls (e.g. /proc/sys/kernel/modprobe) to run something of their choosing in the init namespace. To reduce attack surface within the kernel, Greg KH introduced CONFIG_STATIC_USERMODEHELPER, which switches all user-mode helper binaries to a single read-only path (which defaults to /sbin/usermode-helper). Userspace will need to support this with a new helper tool that can demultiplex the kernel request to a set of known binaries.

seccomp coredumps

Mike Frysinger noticed that it wasn’t possible to get coredumps out of processes killed by seccomp, which could make debugging frustrating, especially for automated crash dump analysis tools. In keeping with the existing documentation for SIGSYS, which says a coredump should be generated, he added support to dump core on seccomp SECCOMP_RET_KILL results.

structleak plugin

Ported from PaX, I landed the structleak plugin which enforces that any structure containing a __user annotation is fully initialized to 0 so that stack content exposures of these kinds of structures are entirely eliminated from the kernel. This was originally designed to stop a specific vulnerability, and will now continue to block similar exposures.

ASLR entropy sysctl on MIPS
Matt Redfearn implemented the ASLR entropy sysctl for MIPS, letting userspace choose to crank up the entropy used for memory layouts.

NX brk on powerpc

Denys Vlasenko fixed a long standing bug where the kernel made assumptions about ELF memory layouts and defaulted the the brk section on powerpc to be executable. Now it’s not, and that’ll keep process heap from being abused.

That’s it for now; please let me know if I missed anything. The v4.12 merge window is open!

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

May 02, 2017 09:17 PM

May 01, 2017

Matthew Garrett: Looking at the Netgear Arlo home IP camera

Another in the series of looking at the security of IoT type objects. This time I've gone for the Arlo network connected cameras produced by Netgear, specifically the stock Arlo base system with a single camera. The base station is based on a Broadcom 5358 SoC with an 802.11n radio, along with a single Broadcom gigabit ethernet interface. Other than it only having a single ethernet port, this looks pretty much like a standard Netgear router. There's a convenient unpopulated header on the board that turns out to be a serial console, so getting a shell is only a few minutes work.

Normal setup is straight forward. You plug the base station into a router, wait for all the lights to come on and then you visit arlo.netgear.com and follow the setup instructions - by this point the base station has connected to Netgear's cloud service and you're just associating it to your account. Security here is straightforward: you need to be coming from the same IP address as the Arlo. For most home users with NAT this works fine. I sat frustrated as it repeatedly failed to find any devices, before finally moving everything behind a backup router (my main network isn't NATted) for initial setup. Once you and the Arlo are on the same IP address, the site shows you the base station's serial number for confirmation and then you attach it to your account. Next step is adding cameras. Each base station is broadcasting an 802.11 network on the 2.4GHz spectrum. You connect a camera by pressing the sync button on the base station and then the sync button on the camera. The camera associates with the base station via WPS and now you're up and running.

This is the point where I get bored and stop following instructions, but if you're using a desktop browser (rather than using the mobile app) you appear to need Flash in order to actually see any of the camera footage. Bleah.

But back to the device itself. The first thing I traced was the initial device association. What I found was that once the device is associated with an account, it can't be attached to another account. This is good - I can't simply request that devices be rebound to my account from someone else's. Further, while the serial number is displayed to the user to disambiguate between devices, it doesn't seem to be what's used internally. Tracing the logon traffic from the base station shows it sending a long random device ID along with an authentication token. If you perform a factory reset, these values are regenerated. The device to account mapping seems to be based on this random device ID, which means that once the device is reset and bound to another account there's no way for the initial account owner to regain access (other than resetting it again and binding it back to their account). This is far better than many devices I've looked at.

Performing a factory reset also changes the WPA PSK for the camera network. Newsky Security discovered that doing so originally reset it to 12345678, which is, uh, suboptimal? That's been fixed in newer firmware, along with their discovery that the original random password choice was not terribly random.

All communication from the base station to the cloud seems to be over SSL, and everything validates certificates properly. This also seems to be true for client communication with the cloud service - camera footage is streamed back over port 443 as well.

Most of the functionality of the base station is provided by two daemons, xagent and vzdaemon. xagent appears to be responsible for registering the device with the cloud service, while vzdaemon handles the camera side of things (including motion detection). All of this is running as root, so in the event of any kind of vulnerability the entire platform is owned. For such a single purpose device this isn't really a big deal (the only sensitive data it has is the camera feed - if someone has access to that then root doesn't really buy them anything else). They're statically linked and stripped so I couldn't be bothered spending any significant amount of time digging into them. In any case, they don't expose any remotely accessible ports and only connect to services with verified SSL certificates. They're probably not a big risk.

Other than the dependence on Flash, there's nothing immediately concerning here. What is a little worrying is a family of daemons running on the device and listening to various high numbered UDP ports. These appear to be provided by Broadcom and a standard part of all their router platforms - they're intended for handling various bits of wireless authentication. It's not clear why they're listening on 0.0.0.0 rather than 127.0.0.1, and it's not obvious whether they're vulnerable (they mostly appear to receive packets from the driver itself, process them and then stick packets back into the kernel so who knows what's actually going on), but since you can't set one of these devices up in the first place without it being behind a NAT gateway it's unlikely to be of real concern to most users. On the other hand, the same daemons seem to be present on several Broadcom-based router platforms where they may end up being visible to the outside world. That's probably investigation for another day, though.

Overall: pretty solid, frustrating to set up if your network doesn't match their expectations, wouldn't have grave concerns over having it on an appropriately firewalled network.

(Edited to replace a mistaken reference to WDS with WPS)

comment count unavailable comments

May 01, 2017 06:17 PM

April 27, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/27

Audiohttp://traffic.libsyn.com/jcm/20170427.mp3

In this week’s edition: Linux 4.11-rc8, updating kernel.org cross compilers, Intel 5-level paging, v3 namespaced file capabilities, and ongoing development.

Editorial Notes

Apologies for the delay to this week’s podcast. I got flu around the time I was preparing last week’s podcast, limped along to the weekend, and then had to stay in bed for a long time. On the other hand, it let me play with a bunch of new SDRs [HackRF, RTL-SDR, and friends, for the curious) on Sunday when I skipped the 5K I was supposed to run 🙂

I would also like to note my thanks for the first 10,000 downloads of the new series of this podcast. It’s a work in progress. I am going to make (positive!) changes over the coming months, including a web interface that will track all LKML posts and allow for community-directed collaboration on creating this (and hopefully other) podcasts. I will include automatic patch tracking (showing when patches have landed in upstream trees, and so on), info on post authors, and allow you to edit personal bios, links, etc. And employer info. After some discussions around the best way to handle author employer attribution (to make sure everyone is treated fairly), I’ve decide to take a little time away from including employer names until I have a populated database of mappings. Jon Corbet from LWN has something similar already, which I believe is also stored in git, but there’s more to be done here (thanks to Alex and others for the G+ feedback and discussion on this).

Linux 4.11-rc8

Linus Torvalds announced Linux 4.11-rc8, saying “So originally I was just planning on releasing the final 4.11 today, but while we didn’t have a *lot* of changes the last week, we had a couple of really annoying ones, so I’m doing another rc release instead”. As he also notes, “The most noticeable of the issues is that we’ve quirked off some NVMe power management that apparently causes problems on some machines. It’s not entirely clear what caused the issue (it wasn’t just limited to some NVMe hardware, but also particular platforms), but let’s test it”.

With the release of Linux 4.11-rc8 comes that impending moment of both elation and dread that is a final kernel. It’ll be great to see 4.11 out there. It’s an awesome kernel, with lots of new features, and it will be well summarized in kernelnewbies and elsewhere. But upon its release comes the opening of the merge window for 4.12. Tracking that was exciting for 4.11. Hopefully it doesn’t finish me off trying to do that for 4.12 😉

Geert Utterhoeven posted “Build regressions/improvements in v4.11-rc8”, in which he noted that (compared with v.4.10), an addition build error and several hundred more warnings were recently added to the kernel. The error he points to is in the AVR32 architecture when applying a relocation in the linker, probably due to an unsupported offset.

Announcements

Greg K-H (Kroah-Hartman) announced Linux 4.4.64, 4.9.25, and 4.10.13

Junio C Hamano announced Git v2.13.0-rc1

Alex Williams posted “Generic DMA-capable streaming device driver looking for home” in which he describes some generic features of his device (the ability to “carry generic data to/from userspace”) and inquired as to where it should live in the kernel. It could do with some followup.

Updating kernel.org cross compilers

Andre Przywara inquired as to the state of the kernel.org cross compilers. This was a project, initiated by Tony Breeds and located on kernel.org, to maintain current Intel x86 Architecture builds of cross compiler toolchains for various architecture targets (a cross compiler is one that runs on one architecture, targeting another, which is incidentally different from a “Canadian cross” compiler – look it up if you’re ever bored or want to bootstrap compilers for fun). It was a great project, but like so many others one day (three years ago) there were no more updates. That is something Andre would like to see changed. He posted, noting that many people still use the compilers on kernel.org (including yours truly, in a pinch) and that “The latest compiler I find there is 4.9.0, which celebrated its third birthday at the weekend, also has been superseded by 4.9.4 meanwhile”.

Andre used build scripts from Segher Bossenkool to build binutils (the GNU assembler) 2.28 and GCC (the GNU Compiler Collection) 6.3.0. With some tweaks, he was able to build for “all architectures except arc, m68k, tilegx and tilepro”. He wondered “what the process is to get these [the compilers linked from the kernel website] updated?”. It seems like he is keen to clean this up, which is to be commended and encouraged. And hopefully (since he works for ARM) that will eventually also include cross compiler targets for x86 that run on ARMv8 server systems.

Intel 5-level paging

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he provides an “updated version the fourth and the last bunch of [] patches that brings initial 5-level paging enabling.” This is in support of Intel’s “la57” feature of future microprocessors that allows them to exceed the traditional 48-bit “Canonical Addressing” in order to address up to 56-bits of Virtual Address space (a big benefit to those who want to map large non-volatile storage devices and accelerators into virtual memory). His latest patch series includes a fix for a “KASLR [Kernel Address Space Layout Randomization”] bug due to rewriting [] startup_64() in C”.

Separately, John Paul Adrian Glaubitz inquired about Kirill’s patch series, saying, “I recently read the LWN article on your and your colleagues work to add five-level page table support for x86 to the Linux kernel. Since this extends the address space beyond 48-bits, as you know, it will cause potential headaches with Javascript engines which use tagged pointers. On SPARC, the virtual address space already extends to 52 bits and we are running into these very issues with Javascript engines on SPARC”.

He goes on to discuss passing the “hint” parameter to mmap() “in order to tell the kernel not to allocate memory beyond the 48 bits address space. Unfortunately, on Linux this will only work when the area pointed to by “hint” is unallocated which means one cannot simply use a hardcoded “hint” to mitigate this problem”. What he means here is that the mmap call to map a virtual memory area into a userspace process allows an application to specify where it would like that mapping to occur, but Linux isn’t required to respect this. Contemporary Linux implements “MAP_FIXED” as an option to mmap, which will either map a region where requested or explicitly fail (as Andy Lutomirski pointed out). This is different from a legacy behavior where Linux used to take a hint and might just not respect placement (as Andi Kleen alluded to in followup).

This whole discussion is actually the reason that Kirill had (thoughtfully) already included a feature bit setting in his patches that allows an application to effectively override the existing kernel logic and always allocate below 48 bits (preserving as close to existing behavior as possible on a per application basis while allowing a larger VA elsewhere). The thread resulted in this being pointed out, but it’s a timely reminder of the problems faced as the pressure continues upon architectures to grow their VA (Virtual Address) space size.

Often, efforts at growing virtual memory address spaces run up against uses of the higher order bits that were never sanctioned but are in widespread use. Many people strongly dislike pointer tagging of this kind (your author included), but it is not going away. It is great that Kirill’s patches have a form of solution that can be used for the time being by applications that want to retain a smaller address space, but that’s framed in the context of legacy support, not to enable runtimes to continue to use high order bits forevermore.

Introduce v3 namespaced file capabilities

Serge E. Hallyn posted “Introduce v3 namespaced file capabilities”. Linux includes a comprehensive capability mechanism that allows applications to limit what privileged operations may be performed by them. In the “good old days” when Unix hacker beards were more likely than today’s scruffy look, root was root and nobody really cared about remote compromise because they were still fighting having to have login passwords at all. But in today’s wonderful world of awesome, in which anything not bolted down is often not long for this world, “root” can mean very little. The traditionally privileged users can be extremely restricted by security policy frameworks, such as SELinux, but even more fundamentally can be subject to restrictions imposed by the growth in use of “capabilities”.

A classic example of a capability is CAP_NET_RAW, which the “ping” utility needs in order to create a raw socket. Traditionally, such utilities were created on Unix and Linux filesystems as “setuid root”, which means that they had the “s” bit set in their permissions to “run as root” when they were executed by regular users. This allowed the utility to operate, but it also allowed any user who could trick the utility into providing a shell conveniently gain a root login. Many security exploits over the years later and we have filesystem capabilities which allow binaries to exist on disk, tagged with just those extra capabilities they require to get the job done, through the filesystem “xattr” extended attributes. “ping” has CAP_NET_RAW, so it can create raw sockets, but it doesn’t need to run as root, so it isn’t market as “setuid root” on modern distros.

Fast forward still further into the modern era of containers and namespaces, and things get more complex. As Serge notes in his patch, “Root in a non-initial user ns [namespace] cannot be trusted to write a traditional security.capability xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the xattr, and execute the file with privilege on the host”. However, as he also notes, “supporting file capabilities in a user namespace is very desirable. Not doing so means that and programs designed to run with limited privilege must continue to support other methods of gaining and dropping privilege. For instance a program installer must detect whether file capabilities can be assigned, and assign them if so but set setuid-root otherwise. The program in turn must known how to drop partial capabilities [which is a mess to get right], and do so only if setuid-root”. This is, of course, far from desirable.

In the patch series, Serge “builds a vfs_ns_cap_data struct by appending a uid_t [user ID] rootid to struct vfs_cap_data. This is the absolute uid_d (that is, the uid_t in user namespace which mounted the filesystem, usually init_user_ns [the global default]) of the root id in whosr namespace the file capabilities may take effect”. He then rewrites xattrs within the namespace for unprivileged “root” users with the appropriate notion of capabilities for that environment (in a “v3” xattr that is transparently converted to/from the conventional “v2” security.capability xattr), in accordance with capabilities that have been granted to the namespace from outside by a CAP_SETFCAP. This allows capability use without undermining host system security and seems like a nice solution.

Ongoing Development

Ashish Kalra posted “Fix BSS corruption/overwrite issue in early x86 kernel setup”. The BSS (Block Started by Symbol) is the longstanding name used to refer to statically allocated (and pre-zeroed) variables that have memory set aside at compile time. It’s a common feature of almost every ELF (Executable and Linking Format) Linux binary you will come across, the kernel not being much different. Linux also uses stacks for small runtime allocations by having a page (or several) of memory that contains a pointer which descends (it’s actually called a “fully descending” type of stack) in address as more (small) items are allocated within it. At boot time, the kernel typically expects the bootloader will have setup a stack that can be used for very early code, but Linux is willing to handle its own setup if the bootloader isn’t sophisticated enough to handle this. The latter code isn’t well exercised and it turns out doesn’t reserve quite enough space, which causes the stack to descend (run into) the BSS segment, resulting in corruption. Ashish fixes this by increasing the fallback stack allocation size from 512 to 1024 bytes in arch/x86/boot/boot.h.

Vladimir Murzin posted “ARM: Fix dma_alloc_coherent()” and friends for NOMMU”, noting “It seem that addition of cache support for M-class CPUs uncovered [a] latent bug in DMA usage. NOMMU memory model has been treated as being always consistent; however, for R/M [Real Time and Microcontroller] classes [of ARM cores] memory can be covered by MPU [Memory Protection Unit] which in turn might configure RAM as Normal i.e. bufferable and cacheable. It breaks dma_alloc_coherent() and friends, since data can stuck in caches”.

Andrew Pinski posted “arm64/vdso: Rewrite gettimeofday into C”, which improves performance by up to 32% when compared to the existing in-kernel implementation on a Cavium ThunderX system (because there are division operations that the compiler can optimize). On their next generation, it apparently improves performance by 18% while also benefitting other ARM platforms that were tested. This is a significant improvement since that function is often called by userspace applications many times per second.

Baoquan He posted “x86/KASLR: Use old ident map page table if physical randomization failed”. Dave Young discovered a problem with the physical memory map setup of kexec/kdump kernels when KASLR (Kernel Address Space Layout Randomization) is enabled. KASLR does what it says on the tin. It applies a level of randomization to the placement of (most) physical pages of the kernel such that it is harder for an attacker to guess where in memory the kernel is located. This reduces the ability for “off the shelf” buffer overflow/ROP/similar attacks to leverage known kernel layout. But when the kernel kexec’s into a kdump kernel upon a crash, it’s loading a second kernel while attempting to leave physical memory not allocated to the crash kernel alone (so that it can be dumped). This can lead to KASLR allocation failures in the crash kernel, which (until this patch) would result in the crash kernel not correctly setting up an identity mapping for the original (older) kernel, resulting in immediately resetting the machine. With the patch, the crash kernel will fallback to the original kernel’s identity mapping page tables when KASLR setup fails.

On a separate, but related, note, Xunlei Pang posted “x86_64/kexec: Use PUD level 1GB page for identity mapping if available” which seeks to change how the kexec identity mapping is established, favoring a new top-level 1GB PUD (Page Upper Directory) allocation for the identity mappings needed prior to booting into the new kernel. This can save considerable memory (128MB “On one 32TB machine”…) vs using the current approach of many 2MB PTEs (Page Table Entries) for the region. Rather than many PTEs, an effective huge page can be mapped. PTEs are grouped into “directories” in memory that the microprocessor’s walker engines can navigate when handling a “page fault” (the process of loading the TLB – Translation Lookaside Buffer – and microTLB caches). Middle Directories are collections of PTEs, and these are then grouped into even larger collections at upper levels, depending upon nesting depth. For more about how paging works, see Mel Gorman’s “Linux Memory Management”, a classic text that is still very much relevant for the fundamentals.

Janakarajan Natarajan posted “Prevent timer value 0 for MWAITX” which limits the kernel from providing a value of zero to the privileged x86 “MWAITX” instruction. MWAIT (Memory Wait) is a series of instructions on contemporary x86 systems that allows the kernel to temporarily block execution (in place of a spinloop, or other solution) until a memory location has been updated. Then, various trickery at the micro-architectural level (a dedicated engine in the core that snoops for updates to that memory address) will handle resuming execution later. This is intended for use in waiting relatively small amounts of time in an energy efficient and high performance (low wakeup time) manner. The instruction accepts a timeout period after which a wakeup will happen regardless, but it can also accept a zero parameter. Zero is supposed to mean “never timeout” (i.e. always wait for the memory update). It turns out that existing Linux kernels do use zero on some occasions, incorrectly, and that this isn’t noticed on older microprocessors due to other events eventually triggering a wakeup regardless. On the new AMD Zen core, which behaves correctly, MWAITX may never wake up with a zero parameter, and this was causing NMI soft lockup warnings. The patch corrects Linux to do the right thing, removing the zero option.

Paul E. McKenney posted “Make SRCU be built by default”. SRCU (Sleepable) RCU (Read Copy Update) is an optional feature of the Linux kernel that provides an implementation of RCU which can sleep. Conventionally, RCU had spinlock semantics (it could not sleep). By definition, its purpose was to provide a cunning lockless update mechanism for data structures, relying upon the passage of a “grace period” defined by every processor having gone into the scheduler once (a gross simplification of RCU). But under some circumstances (for example, in a Real Time kernel) there is a need for a sleepable (and pre-emptable, but that’s another issue) RCU. And so SRCU was created more than 8 years ago. It has a companion in “Tiny SRCU” for embedded systems. A “surprisingly common case” exists now where parts of the kernel are including srcu.h so Paul’s patch builds it by default.

Laurent Dufour posted “BUG raised when onlining HWPoisoned page” in which he noted that the (being onlined) page “has already the mem_cgroup field set” (this is shown in the stack trace he posts with “page dumped because: page still charged to cgroup”). He cleans this up by clearing the mem_cgroup when a page is poisoned. His second patch skips poisoned pages altogether when performing a memory block onlining operation.

Laurent also posted an RFC (Request For Comment) patch series entitled “Replace mmap_sem by a range lock” which “implements the first step of the attempt to replace the mmap_sem by a range lock”. We will summarize this patch series in more detail the next time it is posted upstream.

Christian König posted version 4 of his “Resizable PCI BAR support” patches. PCI (and its derivatives, such as PCI Express) use BARs (Base Address Registers) to convey regions of the host physical memory map that the device will use to map in its memory. BARs themselves are just registers, but the memory they refer to must be linearly placed into the physical map (or interim IOVA map in the case that the BAR is within a virtual machine). Fitting large, multi GB windows can be a challenge, sometimes resulting in failure, but many devices can also manage with smaller memory windows. Christian’s patches attempt to provide for the best of both by adding support for a contemporary feature of PCI (Express) that allows devices with such an ability to convey a minimal BAR size and then increase the allocation if that is available. His changes since version 3 include “Fail if any BAR is still in use…”.

Ying Huang posted version 10 of his “THP swap: Delay splitting THP during swapping out” which allows for swapping of Transparent Huge Pages directly. We have previously covered iterations of this patch series. The latest changes are minimal, suggesting this is close to being merged.

Jérôme Glisse posted version 21 of his “Heterogeneous Memory Management” (HMM) patch series. This is very similar to the version we covered last week. As a reminder, HMM provides an API through which the kernel can manage devices that want to share memory with a host processing environment in a more seamless fashion, using shared address spaces and regular pointers. His latest version changes the concept of “device unaddressable” memory to “device private” (MEMORY_DEVICE_PRIVATE vs MEMORY_DEVICE_PUBLIC) memory, following the feedback from Dan Nellans that devices are changing over time such that “memory may not remain CPU-unaddressable in the future” and that, even though this would likely result in subsequent changes to HMM, it was worthwhile starting out with nomenclature correctly referring to memory that is considered private to a device and will not be managed by HMM.

Intel’s test Robot noticed a 12.8% performance improvement in one of their scalability benchmarks when running with a recent linux-next tree containing Al Viro’s “amd64: get rid of zeroing” patch. This is patch of his larger “uccess unification” patch series that aims to simply and cleanup the process of copying data to/from kernel and userspace. In particular, when asking the kernel to copy data from one userspace virtual address to another, there is no need to apply the level of data zeroing that typically applies to buffers the kernel copies (for security purposes – preventing leakage of extra data beyond structures returned from kernel calls, as an example). When both source and destination are already in userspace, there is no security issue, but there was a performance degregation that Viro had noticed and fixed.

Julien Grall posted “Xen: Implement EFI reset_system callback”, which provides a means to correctly reboot and power off Dom0 host Xen Hypervisors when running on EFI systems for which reset_system is used by reference (ARM).

 

April 27, 2017 03:50 PM

April 26, 2017

Michael Kerrisk (manpages): Linux Security and Isolation APIs course in Munich (17-19 July 2017)

I've scheduled the first public instance of my "Linux Security and Isolation APIs" course to take place in Munich, Germany on 17-19 July 2017. (I've already run the course a few times very successfully in non-public settings.) This three-day course provides a deep understanding of the low-level Linux features (set-UID/set-GID programs, capabilities, namespaces, cgroups, and seccomp) used to build container, virtualization, and sandboxing technologies. The course format is a mixture of theory and practical.

The course is aimed at designers and programmers building privileged applications, container applications, and sandboxing applications. Systems administrators who are managing such applications are also likely to find the course of benefit.

You can find out more about the course (such as expected background and course pricing) at
http://man7.org/training/sec_isol_apis/
and see a detailed course outline at
http://man7.org/training/sec_isol_apis/sec_isol_apis_course_outline.html

April 26, 2017 07:38 PM

April 23, 2017

Pete Zaitcev: SDSC Petabyte scale Swift cluster

It was almost two years since last Swift numbers, but here are a few numbers from San Diego (whole presentation is available on Github:

> 5+ PB data
> 42 servers and 1000+ disks
> 3, 4, 6 TB SAS drives for objects, SATA SSD drives for Account/Container
> 10 GbE network

The 5 PB size is about quarter scale of the largest known Swift cluster. V.impressive. The 100 PB installation that RAX run consists of 6 federated clusters. Number of objects and request rate are unknown. 1000/42 comes comes to about 25-30 disks per server, but they mention 45-disk JBODs later, with plans to move to 90-disk JBODs. Nodes of large clusters continue getting fatter.

The cluster is in operation since 2011 (started with the Diablo release). They still use Pound for load-balancing.

April 23, 2017 03:04 AM

April 20, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/19

Audiohttp://traffic.libsyn.com/jcm/20170419.mp3

[ Apologies for the delay – I have been a little sick for the past day or so and was out on Monday volunteering at the Boston Marathon, so my evenings have been in scarse supply to get this week’s issue completed ]

In this week’s edition: Linus Torvalds announces Linux 4.11-rc7, a kernel security update bonanza, the end of Kconfig maintenance, automatic NUMA balancing, movable memory, a bug in synchronize_rcu_tasks, and ongoing development. The Linux 4.12 merge window should open before next week.

Linus Torvalds announced Linux 4.11-rc7, noting that “You all know the drill by now. We’re in the late rc phase, and this may be the last rc if nothing surprising happens”. He also pointed out how things had been calm, and then, “as usual Friday happened”, leading to a number of reverts for “things that didn’t work out and aren’t worth trying to fix at this point”. In anticipation of the imminent opening of the 4.12 merge window (period of time during which disruptive changes are allowed) Linux Weekly News posted their usual excellent summary of the 4.11 development cycle. If you want to support quality Linux journalism, you should subscribe to LWN today.

Ted (Theodore) Ts’o posted “[REGRESSION] 4.11-rc: systemd doesn’t see most devices” in which he noted that “[t]here is a frustrating regression in 4.11 that I’ve been trying to track down. The symptoms are that a large number of systemd devices don’t show up.” (which was affecting the encrypted device mapper target backing his filesystem). He had a back and forth with Greg K-H (Kroah Hartman) about it with Greg suggesting Ted watch with udevadm and Ted pointing out that this happens at boot and is hard to trace. Ted’s final comment was interesting: “I’d do more debugging, but there’s a lot of magic these days in the kernel to udev/systemd communications that I’m quite ignorant about. Is this a good place I can learn more about how this all works, other than diving into the udev and systemd sources?”. Indeed. In somewhat interesting timing, Enric Balletbo i Serra later posted a 5 part patch series entitled “dm: boot a mapped device without an initramfs”.

Rafael J. Wysocki posted some late breaking 4.11-rc7 fixes for ACPI, including one patch reverting a “recent ACPICA commit [to the ACPI – Advanced Configuration and Power Interface – Component Architecture aka reference code upon which the kernel’s runtime interpretor is based] targeted at catching firmware bugs” that did do so, but also caused “functional problems”.

Announcements

Jiri Slaby announced Linux 3.12.73.

Greg KH (Kroah-Hartman) announced Linux 3.18.49, 3.19.49 4.4.62, 4.9.23, and 4.10.11. As he noted in his review posting prior to announcing the latest 3.18 kernel, 3.18 was indeed “dead and forgotten and left to rot on the side of the road” but “unfortunately, there’s a few million or so devices out there in the wild that still rely on this kernel”. Important security fixes are included in all of these updates. Greg doesn’t commit to bring 3.18 out of retirement for very long, but he does note that Google is assisting a little for the moment to make sure 3.18 based devices get some updates.

Steven Rostedt announced “Real Time” (preempt-rt) kernels 3.2.88-rt126 (“just an update to the new stable 3.2.88 version”), 3.12.72-rt97, and 4.4.60-rt73. Separately, Paul E. McKenney noted “A Hannes Weisbach of TU Dresden published this master thesis on quasi-real-time scheduling:
http://os.inf.tu-dresden.de/papers_ps/weisbach-master.pdf

Rafael J. Wysocki announced a CFP (Call For Papers) targeting the upcoming LPC (Linux Plumbers Conference) Power Management and Energy-Awareness microconference “Call for topics”. Registration for LPC just opened.

Yann E. MORIN posted “MAINTAINERS: relinquish kconfig” in which he apologized for not having enough time to maintain Kconfig with “I’ve been almost entirely absent, which totally sucks, and there is no excuse for my behavior and for not having relinquished this earlier”. With such harsh friends as yourself, who needs enemies? Joking aside, this is sad news, since Kconfig is the core infrastructure used to configure the kernel. It wasn’t long before someone else (Randy Dunlap) posted a patch for Kconfig that no longer has a maintainer (Randy’s patch implements a sort method for config options)

[as an aside, as usual, I have pinged folks who might be looking for an opportunity to encourage them to consider stepping up to take this on].

Automatic NUMA balancing, movable memory, and more!

Mel Gorman posted “mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa”. Modern Linux kernels include a feature known as automatic numa balancing which relies upon marking regions of virtual memory as inaccessible via their page table entries (PTEs) and set a special prot_numa protection hinting bit. The idea is that a later “NUMA hinting fault” on access to the page will allow the Operating System to determine whether it should migrate the page to another NUMA node. Pages are simply small granular units of system memory that are managed by the kernel in setting up translations from virtual to physical memory. When an access to a virtual address occurs, hardware (or, on some architectures, special software) “walkers” navigate the “page tables” pointed to by a special system register. The walker will traverse various “directories” formed from collections of pages in a hierarchical fashion intended to require less space to store page tables than if entries were required for every possible virtual address in a 32 or 64-bit space.

Contemporary microprocessors also support multiple page (granule) sizes, with a fundamental size (commonly 4K or 64K) being supplemented by the ability for larger pages (aka “hugepages”) to be used for very large regions of contiguous virtual memory at less overhead. Common sizes of huge pages are 2MB, 4MB, 512M, and even multi-GB, with “contiguous hint bits” on some modern architectures allowing for even greater flexibility in the footprint of page table and TLB (Translation Lookaside Buffer) entries by only requiring physical entries for a fraction of a contiguous region. On Intel x86 Architecture, huge pages are implemented using the Page Size Extensions (PSE), which allows for a PMD (Page Middle Directory) to be replaced by an entry that effectively allocates the entire range to a single page entry. When a hardware walker sees this, a single TLB entry can be used for an entire range of a few MB instead of many 4K entries.

A bug known as a “race condition” exist(ed) in the automatic NUMA hinting code in which change_pmd_range would perform a number of checks without a lock being held to protect against a concurrent race againt a parallel protection updated (which does happen under a lock) that would clear the PMD and fill it with a prot_numa entry. Mel adds a new pmd_none_or_trans_huge_or_clear_bad function that correctly handles this rare corner case sequence, and documents it (in mm/mprotect.c). Michal Hocko responded with “you will probably win the_longer_function_name_contest but I do not have [a] much better suggestion”.

Speaking of Michal Hocko, he posted version 2 of a patch series entitled “mm: make movable onlining suck less” in which he described the current status quo of “Movable onlining” as “a real hack with many downsides”. Linux divides memory into regions describing zones with names like ZONE_NORMAL (for regular system memory) and ZONE_MOVABLE (for memory the contents of which is entirely pages that don’t contain unmovable system data, firmware data, or for other reasons cannot be trivially moved/offlined/etc.).

The existing implementation has a number of constraints around which pages can be onlined. In particular, around the relative placement of the memory being onlined vs the ZONE_NORMAL memory. This, Michal described as “mainly reintroduction of lowmem/highmem issues we used to have on 32b systems – but it is the only way to make the memory hotremove more reliable which is something that people are asking for”. His patch series aims to make “the onlining semantic more usable [especially when driven by udev]…it allows to online memory movable as long as it doesn’t clash with the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE cannot overlap”. He noted that he had discussed this patch series with Jérôme Glisse (author of the HMM – Heterogenous Memory Management – patches) which were to be rebased on top of this patch series. Michal said he would assist with resolving any conflicts.

Igor Mammedov (Red Hat) noted that he had “given [the movable onlining] series some dumb testing” and had found three issues with it, which he described fully. In summary, these were “unable to online memblock as NORMAL adjacent to onlined MOVABLE”, “dimm1 assigned to node 1 on qemu CLI memblock is onlined as movable by default”, and “removable flag flipped to non-removable state”. Michal wasn’t initially able to reproduce the second issue (because he didn’t have ACPI_HOTPLUG_MEMORY enabled in his kernel) but was then able to followup noting that it was similar to another bug he had already fixed. Jérôme subsequently followed up with an updated HMM patchset as well.

Joonsoo Kim (LGE) posted version 7 of a patch series entitled “Introduce ZONE_CMA” in which he reworks the CMA (Contiguous Memory Allocator) used by Linux to manage large regions of physcially contiguous memory that must be allocated (for device DMA buffers in cases where scatter gather DMA or an IOMMU are not available for managed translations). In the existing CMA implementation, physically contiguous pages are reserved at boot time, but they operate much as reserved memory that happens to fall within ZONE_NORMAL (but with a special “migratetype”, MIGRATE_CMA), and will not generally be used by the system for regular memory allocations unless there are no movable freepages available. In other words, only as a last possible resort.

This means that on a system with 1024MB of memory, kswapd “is mostly woke[n] up when roughly 512MB free memory is left”. The new patches instead create a distinct ZONE_CMA which has some special properties intended to address utilization issues with the existing implementation. As he notes, he had a lengthy discussion with Mel Gorman after the LSF/MM 2016 conference last year, in which Mel stated “I’m not going to outright NAK your series but I won’t ACK it either”. A lot of further discussion is anticipated. Michal Hocko might have summarized it best with, “the cover letter didn’t really help me to understand the basic concepts to have a good starting point before diving into the implementation details [to review the patches]”. Joonsoo followup up with an even longer set of answers to Michal.

A bug in synchronize_rcu_tasks()

Paul E. McKenney posted “There is a Tasks RCU stall warning” in which he noted that he and Steven Rostedt were seeing a stall that didn’t report until it had waited 10 minutes (and recommended that Steven try setting the kernel rcupdate.rcu_task_stall_timeout boot parameter). RCU (Read Copy Update) is a clever mechanism used by Linux (under a GPL license from IBM, who own a patent on the underlying technology) to perform lockless updates to certain types of data structure, by tracking versions of the structure and freeing the older version once references to it have reached an RCU quiescent state (defined by each CPU in the system having scheduled synchronize_rcu once).

Steven noted that for the issue under discussion there was a thread that “never goes to sleep, but will call cond_resched() periodically [a function that is intended to possibly call into the scheduler if there is work to be done there]”. On the RT (Real Time, “preempt-rt”) kernel, Steven noted that cond_resched() is a nop and that the code he had been working on should have made a call directly to the schedule() function. Which lead to him suggesting he had “found a bug in synchronize_rcu_tasks()” in the case that a task frequently calls schedule() but never actually performs a context switch. In that case, per Paul’s subsequent patch, the kernel is patched to specially handle calls to schedule() not due to regular preemption.

Ongoing Development

Anshuman Khandual posted “mm/madvise: Clean up MADV_SOFT_OFFLINE and MADV_HWPOISON” noting that “madvise_memory_failure() was misleading to accommodate handling of both memory_failure() as well as soft_offline_page() functions. Basically it handles memory error injection from user space which can go either way as memory failure or soft offline. Renamed as madvise_inject_error() instead.” The madvise infrastructure allows for coordination between kernel and userspace about how the latter intends to use regions of its virtual memory address space. Using this interface, it is possible for applications to provide hints as to their future usage patterns, relinquish memory that they no longer require, inject errors, and much more. This is particularly useful to KVM virtual machines, which appear as regular processes and can use madvise() to control their “RAM”.

Sricharan R (Codeaurora) posted version 11 of a patch series entitled “IOMMU probe deferral support”, which “calls the dma ops configuration for the devices at a generic place so that it works for all busses”.

Kishon Vijay Abraham sent a pull request to Greg K-H (Kroah Hartman) for Linux 4.12 that included individual patches in addition to the pull itself. This resulted in an interesting side discussion between Kishon and Lee Jones (Linaro) about how this was “a strange practice” Lee hadn’t seen before.

Thomas Garnier (Google) posted version 7 of a patch series entitled “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. Once again, he cites how this would have preemptively mitagated a Google Project Zero security bug.

Christopher Bostic posted version 6 of a patch series enabling support for the “Flexible Support Interface” (FSI) high fan out bus on IBM POWER systems.

Dan Williams (Intel) posted “x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions” in which he says “Before we rework the “pmem api” to stop abusing __copy_user_nocache() for memcpy_to_pmem() we need to fix cases where we may strand dirty data in the cpu cache.”

Leo Yan (Linaro) posted an RFC (Request For Comments) patch series entitled “coresight: support dump ETB RAM” which enables support for the Embedded Trace Buffer (ETB) on-chip storage of trace data. This is a small buffer (usually 2KB to 8KB) containing profiling data used for postmortem debug.

Thierry Escande posted “Google VPD sysfs driver”, which provides support for “accessing Google Vital Product Data (VPD) through the sysfs”.

Alex(ander) Graf posted version 6 of “kvm: better MWAIT emulation for guests”, which provides new capability information to user space in order for it to inform a KVM guest of the availability of native MWAIT instruction support. MWAIT allows a (guest) kernel to wake up a remote (v)CPU without an IPI – InterProcessor Interrupt – and the associated vmexit that would then occur to schedule the remote vCPU for execution. The availability of MWAIT is deliberately not provided in the normal CPUID bitmap since “most people will want to benefit from sleeping vCPUs to allow for over commit” (in other words with MWAIT support, one can arrange to keep virtual CPUs runnable for longer and this might impact the latency of hosting many tenants on the same machine).

David Woodhouse posted version 2 of his patch series entitled “PCI resource mmap cleanup” which “pursues my previous patch set all the way to its logical conclusion”, killing off “the legacy arch-provided pci_mmap_page_range() completely, along with its vile ‘address converted by pci_resource_ro_user()’ API and the various bugs and other strange behavior that various architectures had”. He noted that to “accommodate the ARM64 maintainers’ desire *not* to support [the legacy] mmap through /proc/bus/pci I have separated HAVE_PCI_MMAP from the sysfs implementation”. This had previously been called out since older versions of DPDK were looking for the legacy API and failing as a result on newer ARM server platforms.

Darren Hart posted an RFC (Request For Comments) patch series entitled “WMI Enhancements” that seeks to clean up the “parallel efforts involving the Windows Management Instrumentation (WMI) and dependent/related drivers”. He wanted to have a “round of discussion among those of you that have been invovled in this space before we decide on a direction”. The proposed direction is to “convert[] wmi into a platform device and a proper bus, providing devices for dependent drivers to bind to, and a mechanism for sibling devices to communicate with each other”. In particular, it includes a capability to expose WMI devices directly to userspace, which resulted in some pushback (from Pali Rohár) and a suggestion that some form of explicit whitelisting of wmi identifiers (GUIDS) should be used instead. Mario Limonciello (Dell) had many useful suggestions.

Wei Wang (Intel) posted version 9 of a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration” in which he “implements two optimizations”. The first “tranfer[s] pages in chunks between the guest and host”. The second “transfer[s] the guest unused pages to the host so that they can be skipped in live migration”.

Dmitry Safonov posted “ARM32: Support mremap() for sigpage/vDSO” which allows CRIU (Checkpoint and Restart in Userspace) to complete its process of restoring all application VMA (Virtual Memory Area) mappings on restart by adding the ability to move the vDSO (Virtual Dynamic Shared Object) and sigpage kernel pages (data explicitly mapped into every process by the kernel to accelerate certain operations) into “the same place where they were before C/R”.

Matias Bjørling (Cnex Labs) prepared a git pull request for “LightNVM” targeting Linux 4.12. This is “a new host-side translation layer that implements support for exposing Open-Channel SSDs as block devices”.

Greg Thelen (Google) posted “slab: avoid IPIs when creating kmem caches”. Linux’s SLAB memory allocator (see also the paper on the original Solaris memory allocator) can be used to pre-allocate small caches of objects that can then be efficiently used by various kernel code. When these are allocated, per-cpu array caches are created, and a call is made to kick_all_cpus_sync() which will schedule all processors to run code to ensure that that there are no stale references to the old array caches. This global call is performed using an IPI (InterProcessor Interrupt), which is relatively expensive, especially in the case that a new cache is being created (and not replacing an old one). In that case wasteful IPIs are generated on the order of 47,741 additional ones in the example given vs. 1,170 in a patched kernel.

April 20, 2017 08:32 AM

April 19, 2017

Kernel Podcast: One Day Delay Due to Boston Marathon

The Podcast is delayed until Wednesday evening this week. Usually, I try to get it out on a Monday night (or at least write it up then and actually post on Tuesday), but when holidays or other events fall on a Monday, I will generally delay the podcast by a day. This week, I was volunteering at the Marathon all of Monday, which means the prep is taking place Tuesday night instead.

April 19, 2017 04:15 AM

April 16, 2017

Paul E. Mc Kenney: Book review: "Fooled by Randomness" and "The Black Swan"

I avoided “The Black Swan” for some years because I was completely unimpressed with the reviews. However, I was sufficiently impressed a recent Nassim Taleb essay to purchase his “Incerto” series. I have read the first two books so far (“Fooled by Randomness” and “The Black Swan”), and highly recommend both of them.

The key point of these two books is that in real life, extremely low-probability events can have extreme effects, and such events are the black swans of the second book's title. This should be well in the realm of common sense: Things like earthquakes, volcanoes, tidal waves, and asteroid strikes should illustrate this point. A follow-on point is that low-probability events are inherently difficult to predict. This also should be non-controversial: The lower the probability, the less the frequency, and thus the less the experience with that event. And of my four examples, we are getting semi-OK at predicting volcanic eruptions (Mt. St. Helens being perhaps the best example of a predicted eruption), not bad at tidal waves (getting this information to those who need it still being a challenge), and hopeless at earthquakes and asteroid strikes.

Taleb then argues that the increasing winner-takes-all nature of our economy increases the frequency and severity of economic black-swan events, in part by rendering normal-distribution-based statistics impotent. If you doubt this point, feel free to review the economic events of the year 2008. He further argues that this process began with the invention of writing, which allowed one person to have an outsized effect on contemporaries and on history. I grant that modern transportation and communication systems can amplify black-swan events in ways that weren't possible in prehistoric times, but would argue that individual prehistoric people had just as much fun with the black swans of the time, including plague, animal attacks, raids by neighboring tribes, changes in the habits of prey, and so on. Nevertheless, I grant Taleb's point that most prehistoric black swans didn't threaten the human race as a whole, at least with the exception of asteroid strikes.

My favorite quote of the book is “As individuals, we should love free markets because operators in them can be as incompetent as they wish.” My favorite question is implied by his surprise that so few people embrace both sexual and economic freedom. Well, ask a stupid question around me and you are likely to get a stupid answer. Here goes: Contraceptives have not been in widespread use for long enough for human natural selection to have taken much account of their existence. Therefore, one should expect the deep subconscious to assume that sexual freedom will produce lots of babies, and that these babies will need care and feeding. Who will pay for this? The usual answer is “everyone” with consequent restrictions on economic freedom. If you don't like this answer, fine, but please consider that it is worth at least what you are paying for it. ;–)

So what does all of this have to do with parallel programming???

As it turns out, quite a lot.

But first, I will also point out my favorite misconception in the book, which is that NP has all that much to do with incomputability. On the other hand, the real surprise is that the trader-philosopher author would say anything at all about them. Furthermore, Taleb would likely point out that in the real world, the distinction between “infeasible to compute” and “impossible to compute” is a distinction without a difference.

The biggest surprise for me personally from these books is that one of the most feared category of bugs, race conditions, are not black-swan bugs, but are instead white-swan bugs. They are quite random, and very amenable to the Gaussian statistical tools that Taleb so rightly denigrates for black-swan situations. You can even do finite amounts of testing and derive good confidence bounds for the reliability of your software—but only with respect to white-swan bugs such as race conditions. So I once again feel lucky to have the privilege of working primarily on race conditions in concurrent code!

What is a black-swan bug? One class of such bugs caused me considerable pain at Sequent in the 1990s. You see, we didn't have many single-CPU systems, and we not infrequently produced software that worked only on systems with at least two CPUs. Arbitrarily large amounts of testing on multi-CPU systems would fail to spot such bugs. And perhaps you have encountered bugs that happened only at specific times in specific states, or as they are sometimes called, “new moon on Tuesdays” bugs.

Taleb talks about using mathematics from fractals to turn some classes of black-swan events into grey-swan events, and something roughly similar can be done with validation. We have an ever-increasing suite of bugs that people have injected in the past, and we can make some statements about how likely someone is to make that same error again. We can then use this experience to guide our testing efforts, as I try to do with the rcutorture test suite. That said, I expect to continue pursuing additional bug-spotting methods, including formal verification. After all, that fact that race conditions are not black swans does not necessarily make them easy, particularly in cases, such as the Linux kernel, where there are billions of users.

In short, ignore the reviews of “Fooled by Randomness” and “The Black Swan”, including this one, and go read the actual books. If you only have time to read one of them, your should of course pick one at random. ;–)

April 16, 2017 09:13 PM

April 14, 2017

Linux Plumbers Conference: Registration for Linux Plumbers Conference is Now Open

The 2017 Linux Plumbers Conference organizing committee is pleased to announce that the registration for this year’s conference is now open. Information on how to register can be found here. Registration prices and cutoff dates are published in the ATTEND page. A reminder that we are following a quota system to release registration slots. Therefore the early registration rate will remain in effect until early registration closes on June 18 2017, or the quota limit (150) is reached, whatever comes earlier. As usual, contact us if you have questions.

April 14, 2017 12:06 PM

April 13, 2017

Pete Zaitcev: Amazon Snowmobile

I don't know how I missed this, it should've been hyped. But here it is: an Amazon truck trailer, which is basically a giant Snowball, used to overcome the data inertia. Apparently, its capacity is 100 PB (the article is not clear about it, but it mentions that 10 Snowmobiles transfer an EB). The service apparently works one way only: you cannot use a Snomobile to download your data from Amazon.

P.S. Amazon's official page for Snowmobile confirms the 100 PB capacity.

April 13, 2017 04:59 AM

April 12, 2017

Matthew Garrett: Disabling SSL validation in binary apps

Reverse engineering protocols is a great deal easier when they're not encrypted. Thankfully most apps I've dealt with have been doing something convenient like using AES with a key embedded in the app, but others use remote protocols over HTTPS and that makes things much less straightforward. MITMProxy will solve this, as long as you're able to get the app to trust its certificate, but if there's a built-in pinned certificate that's going to be a pain. So, given an app written in C running on an embedded device, and without an easy way to inject new certificates into that device, what do you do?

First: The app is probably using libcurl, because it's free, works and is under a license that allows you to link it into proprietary apps. This is also bad news, because libcurl defaults to having sensible security settings. In the worst case we've got a statically linked binary with all the symbols stripped out, so we're left with the problem of (a) finding the relevant code and (b) replacing it with modified code. Fortuntely, this is much less difficult than you might imagine.

First, let's find where curl sets up its defaults. Curl_init_userdefined() in curl/lib/url.c has the following code:
set->ssl.primary.verifypeer = TRUE;
set->ssl.primary.verifyhost = TRUE;
#ifdef USE_TLS_SRP
set->ssl.authtype = CURL_TLSAUTH_NONE;
#endif
set->ssh_auth_types = CURLSSH_AUTH_DEFAULT; /* defaults to any auth
type */
set->general_ssl.sessionid = TRUE; /* session ID caching enabled by
default */
set->proxy_ssl = set->ssl;

set->new_file_perms = 0644; /* Default permissions */
set->new_directory_perms = 0755; /* Default permissions */

TRUE is defined as 1, so we want to change the code that currently sets verifypeer and verifyhost to 1 to instead set them to 0. How to find it? Look further down - new_file_perms is set to 0644 and new_directory_perms is set to 0755. The leading 0 indicates octal, so these correspond to decimal 420 and 493. Passing the file to objdump -d (assuming a build of objdump that supports this architecture) will give us a disassembled version of the code, so time to fix our problems with grep:
objdump -d target | grep --after=20 ,420 | grep ,493

This gives us the disassembly of target, searches for any occurrence of ",420" (indicating that 420 is being used as an argument in an instruction), prints the following 20 lines and then searches for a reference to 493. It spits out a single hit:
43e864: 240301ed li v1,493
Which is promising. Looking at the surrounding code gives:
43e820: 24030001 li v1,1
43e824: a0430138 sb v1,312(v0)
43e828: 8fc20018 lw v0,24(s8)
43e82c: 24030001 li v1,1
43e830: a0430139 sb v1,313(v0)
43e834: 8fc20018 lw v0,24(s8)
43e838: ac400170 sw zero,368(v0)
43e83c: 8fc20018 lw v0,24(s8)
43e840: 2403ffff li v1,-1
43e844: ac4301dc sw v1,476(v0)
43e848: 8fc20018 lw v0,24(s8)
43e84c: 24030001 li v1,1
43e850: a0430164 sb v1,356(v0)
43e854: 8fc20018 lw v0,24(s8)
43e858: 240301a4 li v1,420
43e85c: ac4301e4 sw v1,484(v0)
43e860: 8fc20018 lw v0,24(s8)
43e864: 240301ed li v1,493
43e868: ac4301e8 sw v1,488(v0)

Towards the end we can see 493 being loaded into v1, and v1 then being copied into an offset from v0. This looks like a structure member being set to 493, which is what we expected. Above that we see the same thing being done to 420. Further up we have some more stuff being set, including a -1 - that corresponds to CURLSSH_AUTH_DEFAULT, so we seem to be in the right place. There's a zero above that, which corresponds to CURL_TLSAUTH_NONE. That means that the two 1 operations above the -1 are the code we want, and simply changing 43e820 and 43e82c to 24030000 instead of 24030001 means that our targets will be set to 0 (ie, FALSE) rather than 1 (ie, TRUE). Copy the modified binary back to the device, run it and now it happily talks to MITMProxy. Huge success.

(If the app calls Curl_setopt() to reconfigure the state of these values, you'll need to stub those out as well - thankfully, recent versions of curl include a convenient string "CURLOPT_SSL_VERIFYHOST no longer supports 1 as value!" in this function, so if the code in question is using semi-recent curl it's easy to find. Then it's just a matter of looking for the constants that CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER are set to, following the jumps and hacking the code to always set them to 0 regardless of the argument)

comment count unavailable comments

April 12, 2017 06:10 PM

April 11, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/11

Audiohttp://traffic.libsyn.com/jcm/20170411.mp3

In this week’s edition: Linus Torvalds announces Linux 4.11-rc6, Intel Memory Bandwidth Allocation (MBA), Coherent Device Memory (CDM), Paravirtualized Remote TLB Flushing,kernel lockdown, the latest on Intel 5-level paging, and other assorted ongoing development activities.

Linus Torvalds announced Linux 4.11-rc6. In his mail, Linus notes that “Things are looking fairly normal [for this point in the development cycle]…The only slightly unusual thing is how the patches are spread out, with almost equal parts of arch updates, drivers, filesystems, networking and “misc”.” He ends “Go and get it”. Thorsten Leemhuis followed up with “Linux 4.11: Reported regressions as of Sunday, 2017-04-09”, his third regression report for 4.11. Which “lists 15 regressions I’m currently aware of. 5 regressions mentioned in last week[‘]s report got fixed”. Most appear to be driver problems, but there is one relating to audit, and one in inet6_fill_ifaddr that is stalled waiting for “feedback from reporter”.

Stable kernels

Greg K-H (Kroah-Hartman) announced Linux kernels 4.4.60, 4.9.21, and 4.10.9

Ben Hutchings announced Linux 3.2.88 and 3.16.43

Jason A. Donenfeld pointed out that Linux 3.10 “is inexplicably missing crypto_memneq, making all crypto mac [Message Authentication Code] comparisons use non constant-time comparisons. Bad news bears [presumably due to side channel attack]. Willy followed up noting that he would “check if the 3.12 patches…can be safely backported”.

Memory Bandwidth Allocation (Intel Resource Director Technology, RDT)

Vikas Shivappa (Intel) posted version 4 of a patch series entitled “x86/intel_rdt: Intel Memory bandwidth allocation”, addressing feedback from the previous iteration that he had received from Thomas Gleixner. The MBA (Memory Bandwidth Allocation) technology is described both in the kernel Documentation patch (provided) as well as in various Intel papers and materials available online. Intel provide a construct known as a “Class of Service” (CLOS) on certain contemporary Xeon processors, as part of their CAT (Cache Allocation Technology) feature, which is itself part of a larger family of technologies known as “Intel Resource Directory Technology” (RDT). These CLOSes “act as a resource control tag into which a thread/app/VM/container can be grouped”.

It appears that a feature of Intel’s L3 cache (LLC in Intel-speak) in these parts is that they can not only assign specific proportions of the L3 cache slices on the Xeon’s ring interconnect to specific resources (e.g. “tasks” – otherwise known as processes, or applications) but also can control the amount of memory bandwidth granted to these. This is easier than it sounds. From a technical perspective, Intel integrate their memory controller onto their dies, and contemporary memory controllers already perform fine grained scheduling (this is how they bias memory reads for speculative loads of the instruction stream in among the other traffic, as just one simple example). Therefore, exposing memory bandwidth control to the cache slices isn’t all that more complex. But it is cute, and looks great in marketing materials.

Coherent Device Memory (CDM) on top of HMM

Jérôme Glisse posted and RFC [Request for Comments] patch series entitled “Coherent Device Memory (CDM) on top of HMM”. His previous HMM (Heterogenous Memory Management) patch series, now in version 19, implemented support for (non-coherent) device memory to be mapped into regular process address space, by leveraging the ability for certain contempory devices to fault on access to untranslated addresses managed in device page tables thus allowing for a kind of pageable device memory and transparent management of ownership of the memory pages between application processor cores and (e.g.) a GPU or other acceleration device. The latest patch series builds upon HMM to also support coherent device memory (via a new ZONE_DEVICE memory – see also the recent postings from IBM in this area). As Jérôme notes, “Unlike the unaddressable memory type added with HMM patchset, the CDM [Coherent Device Memory] type can be access[ed] by [the] CPU.” He notes that he wanted to kick off this RFC more for the conversation it might provoke.

In his mail, Jérôme says, “My personal belief is that the hierarchy of memory is getting deeper (DDR, HBM stack memory, persistent memory, device memory, …) and it may make sense to try to mirror this complexity within mm concept. Generalizing the NUMA abstraction is probably the best starting point for this. I know there are strong feelings against changing NUMA so i believe now is the time to pick a direction”. He’s right of course. There have been a number of patch series recently also targeting accelerators (such as FPGAs), and more can be anticipated for coherently attached devices in the future. [This author is personally involved in CCIX]

Hyper-V: Paravirtualized Remote TLB Flushing and Hypercall Improvements

Vitaly Kuznetsov (Red Hat) posted “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements”. It turns out that Microsoft’s Hyper-V hypervisor supports hypercalls (calls into the hypervisor from the guest OS) for “doing local and remote TLB [Translation Lookaside Buffer] flushing”. Translation Lookaside Buffers [TLBs] are caches built into microprocessors that store a translation of a CPU virtual address to “physical” (or, for a virtual machine, into an intermediate hypervisor) address. They save an unnecessary page table walk (of the software managed hardware/software structure – depending upon architecture – that “walkers” navigate to perform a translation during a “page fault” or unhandled memory access, such as happens constantly when demand loading/faulting in application code and data, or sharing read-only data provided by shared libraries, etc.). TLBs are generally transparent to the OS, except that they must be explicitly managed under certain conditions – such as when invlidating regions of virtual memory or performing certain context switches (depending upon the provisioning of address and virtual memory space tag IDs in the architecture).

TLB invalidates on local processor cores normally use special CPU instructions, and this is certainly also true under virtualization. But virtual addresses used by a particular process (known as a task within the kernel) might be also used by other cores that have touched the same virtual memory space. And those translations need to be invalidated too. Some architectures include sophisticated hardware broadcast invalidation of TLBs, but some other legacy architectures don’t provide these kinds of capabilities. On those architectures that don’t provide for a hardware broadcast, it is typically necessary to use a construct known as an IPI (Inter Processor Interrupt) to cause an IRQ (interrupt message) to be delivered to the remote interrupt controller CPU interface (e.g. LAPIC on Intel x86 architecture) of the destination core, which will run an IPI handler in response that does the TLB teardown.

As Vitaly notes, nobody is recommending doing local TLB flash using a hypercall, but there can be significant performance improvement in using a hypercall for the remote invalidates. In the example cited, which uses “a special ‘TLB trasher'” he demonstrates how a 16 vCPU guest experienced a greater than 25% performance improvement using the hypercall approach.

Ongoing Development

David Howells posted an magnum opus entitled “Kernel lockdown”, which aims to “provide a facility by which a variety of avenues by which userspace can feasibly modify the running kernel image can be locked down”. As he says, “The lock-down can be configured to be triggered by the EFI secure boot status, provided the shim isn’t insecure. The lock-down can be lifted by typing SysRq+x on a keyboard attached to the system [physcial presence]. Among the many other things, these patches (versions of which have been in distribution kernels for a while) change kernel behavior to include “No unsigned modules and no modules for which [we] can’t validate the signature”, disable many hardware access functions, turn off hibernation, prevent kexec_load(), and limit some debugging features. Justin Forbes of the Fedora Project noted that he had (obviously) tested these. One of the many interesting sets of patches included a feature to “Annotate hardware config module parameters” which allows modules to mark unsafe options. Following some pushback, David also followed up with a rationale for doing kernel lockdown, entitled “Why kernel lockdown?”. Worth reading.

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he (bravely) took Ingo’s request to “rewrite assembly parts of boot process into C before bringing 5-level paging support”. He says, “The only part where I succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is now in C.” He also renames the level 4 page tables “init_level4_pgt” and “early_level4_pgt” to “init_top_pgt” and “early_top_pgt”. There was another lengthy discussion around his “Allow to have userspace mappings above 47-bits”, a patch which tells the kernel to prefer to do memory allocations below 47-bits (the previous “Canonical Addressing” limit of Intel x86 processors, which some JITs and other code exploit by abusing the top bits of the address space in pointers for illegal tags, breaking compatibility with an extended virtual address space). The patch allows mmap calls ith MAP_FIXED hints to cause larger allocations. There was some concern that larger VM space is ABI and must be handled with care. A footnote here is that (apparently, from the patch) Intel MPX (Memory Protection Extension) doesn’t yet work with LA57 (the larger address space feature) and so Kirill avoids both in the same process.

Christopher Bostic posted version 5 of a patch series entitled “FSI driver implementation”. This is support for the POWER’s [Performance Optimization With Enhanced RISC, for those who ever wondered – this author used to have a lot of interest in PowerPC back in the day] “Flexible Support Interface” (FSI), a “high fan out serial bus” whose specification seems to have appeared on the OpenPower Foundation website recently also.

Kishon Vijay Abraham posted “PCI: Support for configurable PCI endpoint”, which Bjorn finally pulled into his tree in anticipation of the upcoming 4.12 merge cycle. For those who haven’t see Kishon’s awesome presentation “Overview of PCI(e) Subsystem” for Embedded Linux Conference Europe, you are encouraged to watch it at least several times. He really knows his stuff, and has done an excellent job producing a high quality generic PCIe endpoint driver for Linux: https://www.youtube.com/watch?v=uccPR6X8vy8

Ard Biesheuvel posted “EFI fixes for v4.11”, which among other goodies includes a fix for EFI GOP (Graphics Output Protocol) support on systems built using the 64-bit ARM Architecture, which uses firmware assignment of PCIe BAR resources. Ard and Alex Graf have done some really fun work with graphics cards on 64-bit ARM lately – including emulating x86 option ROMs. Ard also had some fixes prepared for v4.12 that he announced, including a bunch of cleanup to the handling of FDT (Flattened Device Tree) memory allocation. Finally, he added support for the kernel’s “quiet” command line option, to remove extraneous output from the EFI stub on boot.

Srikar Dronamraju and Michal Hocko had a back and forth on the former’s “sched: Fix numabalancing to work with isolated cpus” patch, which does what it says on the tin. Michal was a little concered that NUMA balancing wasn’t automatically applied even to isolated CPUs, but others (including Peter Zjilsta) noted that this absolutely is the intended behavior.

Ying Huang (Intel) posted version 8 of his “THP swap: Delay splitting THP during swapping out”, which essentially allows paging of (certain) huge pages. He also posted version 2 of “mm, swap: Sort swap entries before free”, which sorts consecutive swap entires in a per-CPU buffer into order accoring to their backing swap deivce before freeing those entries. This reduces needless acquiring/releasing of locks and improves performance.

Will Deacon posted version 2 of a patch series entitled “drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension”. The “SPE” (Statistical Profiling Extension) “can be used to profile a population of operations in the CPU pipeline after instruction decode. These are either architected instructions (i.e. a dynamic instruction trace) or CPU-specific uops and the choice is fixed statically in the hardware and advertised to userpace via caps. Sampling is controlled using a sampling interval, similar to a regular PMU counter, but also with an optional random perturbation”. He notes that the “in-memory buffer is linear and virtually addressed, raising an interrupt when it fills up” [which makes using it nice for software folks].

Binoy Jayan posted “IV [Initial Vector] Generation algorithms for dm-crypt”, the goal of which “is to move these algorithms from the dm layer to the kernel crypto layer by implementing them as template ciphers”.

Joerg Roedel posted “PCI: Add ATS-disable quirk for AMD Stoney GPUs”. Then, he posted a followup with a minor fix based upon feedback. This should close the issue of certain bug reports posted by those using an IOMMU on a Stoney platform and seeing lockups under high TLB invalidation.

Born Helgass posted “PCI fixes for v4.11”, which includes “fix ThunderX legacy firmware resources”, a PCI quirk for certain ARM server platforms.

Paul Menzel reported “`pci_apply_final_quirks()` taking half a second”, which David Woodhouse (who wrote the code to match PCIe devices against the quick list “back in the mists of time”) posited was perhaps down to “spending a fair amount of time just attempting to match each device against the list”. He wondered “if it’s worth sorting the list by vendor ID or somthing, at least for the common case of the quirks which match on vendor/device”. There was a general consensus that cleanup would be nice, if only someone had the time and the inclination to take a poke at it.

Seth Forshee (Canonical) posted “audit regressions in 4.11”, in which he noted that ever since the merging of “audit: fix auditd/kernel connection state tracking”, the kernel will now queue up indefintely audit messages for delivery to the (userspace) audit daemon if it is not running – ultimately crashing the machine. Paul Moore thanked him for the report and there was a back and forth on the best way to handle the case of no audit running.

Neil Brown posted a patch entitled “NFS: fix usage of mempools”. As he notes in his patch, “When passed GFP [Get Free Page] flags that allow sleeping (such as GFP_NOIO), mempool_alloc() will never return NULL, it will wait until memory is available…This means that we don’t need to handle falure, but that we do need to ensure one thread doesn’t call mempool_alloc twice on the one pool without queuing or freeing the first allocation”. He then cites “pnfs_generic_alloc_ds_commits” as an unsafe function and provides a fix.

Finally, Kees Cook followed up (as he had promised) on a discussion from last week, with an RFC (Request for Comments) patch series entitiled “mm: Tighten x86 /dev/mem with zeroing”, including the suggestion from Linus that reads from /dev/mem that aren’t permitted simply return zero data. This was just one of many security discussions he was involved in (as usual). Another included having suggested a patch posted by Eddie Kovsky entitled “module: verify address is read-only”, which modifies kernel functions that use modules to verify that they are in the correct kernel ro_after_init memory area and “reject structures not marked ro_after_init”.

April 11, 2017 03:15 PM

April 10, 2017

Daniel Vetter: Review, not Rocket Science

About a week ago there where 2 articles on LWN, the first coverging memory management patch review and the second covering the trouble with making review happen. The take away from these two articles seems to be that review is hard, there’s a constant lack of capable and willing reviewers, and this has been the state of review since forever. I’d like to counter pose this with our experiences in the graphics subsystem, where we’ve rolled out a well-working review process for the Intel driver, core subsystem and now the co-maintained small driver efforts with success, and not all that much pain.

tl;dr: require review, no exceptions, but document your expectations

Aside: This is written with a kernel focus, from the point of view of a maintainer or group of maintainers trying to establish review within their subsystem. But the principles really work anywhere.

Require Review

When review doesn’t happen, that generally means no one regards it as important enough. You can try to improve the situation by highlighting review work more, and giving less focus for top committer stats. But that only goes so far, in the end when you want to make review happen, the one way to get there is to require it.

Of course if that then results in massive screaming, then maybe you need to start with improving the recognition of review and value it more. Trying to put a new process into place over the persistent resistance is not going to work. But as long as there’s general agreement that review is good, this is the easy part.

No Exceptions

The trouble is that there’s a few really easy way to torpedo reviews before you event started, and they’re all around special priviledges and exceptions. From one of the LWN articles:

… requiring reviews might be fair, but there should be one exception: when developers modify their own code.

Another similar exception is often demanded by maintainers for applying their own patches to code they maintain - in the Linux kernel only about 25% of all maintainer patches have any kind of review tag attached when they land. This is in contrast to other contributors, who always have to get past at least their direct maintainer to get a patch applied.

There’s a few reasons why having exceptions for the original developer of some code, or a maintainer of a subsystem, is a really bad idea:

On the flip side, requiring review from all your main contributors is a really easy way to kickstart a working review economy: Instantly you both have a big demand for review. And capable reviewers who are very much willing to trade a bit of review for getting reviews on their own patches.

Another easy pitfall is maintainers who demand unconditional NAck rights for the code they maintain, sometimes spiced up by claiming they don’t even need to provide reasons for the rejection. Of course more experienced people know more about the pitfalls of a code base, and hence are more likely to find serios defects in a change. But most often these rejections aren’t about clear bugs, but more design dogmas once established (and perhaps no longer valid), or just plain personal style preferences. Again, this is a great way to prevent review from happening:

And again, I haven’t seen unicorns who write perfect code yet, neither have I seen someone who’s review feedback was consistently impeccable.

But Document your Expectations

Training reviews through direct mentoring is great, but it doesn’t scale. Document what you expect from a review as much as possible. This includes everything from coding style, to how much and in which detail code correctness should be checked. But also related things like documentation, test-cases, and process details on how exactly, when and where review is happening.

And like always, executable documentation is much better, hence try to script as much as possible. That’s why build-bots, CI bots, coding style bots, and all these things are possible - it frees the carbon-based reviewers from wasting time on the easy things and instead concentrate on the harder parts of review like code design and overall architecture, and how to best get there from the current code base. But please make sure your scripting and automated testing is of high-quality, because if the results need interpretation by someone experienced you haven’t gained anything. The kernel’s checkpatch.pl coding style checker is a pretty bad example here, since it’s widely accepted that it’s too opinionated and it’s suggestions can’t be blindly followed.

As some examples we have the dim ingloriuos maintainer scripts and some fairly extensive documentation on what is expected from reviewers for Intel graphics driver patches. Contrast that to the comparetively lax review guidelines for small drivers in drm-misc. At Intel we’ve also done internal trainings on review best practices and guidelines. Another big thing we’re working on on the automation front is CI crunching through patchwork series to properly regression test new patches before they land.

April 10, 2017 12:00 AM

April 09, 2017

Matthew Garrett: A quick look at the Ikea Trådfri lighting platform

Ikea recently launched their Trådfri smart lighting platform in the US. The idea of Ikea plus internet security together at last seems like a pretty terrible one, but having taken a look it's surprisingly competent. Hardware-wise, the device is pretty minimal - it seems to be based on the Cypress[1] WICED IoT platform, with 100MBit ethernet and a Silicon Labs Zigbee chipset. It's running the Express Logic ThreadX RTOS, has no running services on any TCP ports and appears to listen on two single UDP ports. As IoT devices go, it's pleasingly minimal.

That single port seems to be a COAP server running with DTLS and a pre-shared key that's printed on the bottom of the device. When you start the app for the first time it prompts you to scan a QR code that's just a machine-readable version of that key. The Android app has code for using the insecure COAP port rather than the encrypted one, but the device doesn't respond to queries there so it's presumably disabled in release builds. It's also local only, with no cloud support. You can program timers, but they run on the device. The only other service it seems to run is an mdns responder, which responds to the _coap._udp.local query to allow for discovery.

From a security perspective, this is pretty close to ideal. Having no remote APIs means that security is limited to what's exposed locally. The local traffic is all encrypted. You can only authenticate with the device if you have physical access to read the (decently long) key off the bottom. I haven't checked whether the DTLS server is actually well-implemented, but it doesn't seem to respond unless you authenticate first which probably covers off a lot of potential risks. The SoC has wireless support, but it seems to be disabled - there's no antenna on board and no mechanism for configuring it.

However, there's one minor issue. On boot the device grabs the current time from pool.ntp.org (fine) but also hits http://fw.ota.homesmart.ikea.net/feed/version_info.json . That file contains a bunch of links to firmware updates, all of which are also downloaded over http (and not https). The firmware images themselves appear to be signed, but downloading untrusted objects and then parsing them isn't ideal. Realistically, this is only a problem if someone already has enough control over your network to mess with your DNS, and being wired-only makes this pretty unlikely. I'd be surprised if it's ever used as a real avenue of attack.

Overall: as far as design goes, this is one of the most secure IoT-style devices I've looked at. I haven't examined the COAP stack in detail to figure out whether it has any exploitable bugs, but the attack surface is pretty much as minimal as it could be while still retaining any functionality at all. I'm impressed.

[1] Formerly Broadcom

comment count unavailable comments

April 09, 2017 12:16 AM

April 07, 2017

Andi Kleen: Cheat sheet for Intel Processor Trace with Linux perf and gdb

What is Processor Trace

Intel Processor Trace (PT) traces program execution (every branch) with low overhead.

This is a cheat sheet of how to use PT with perf for common tasks

It is not a full introduction to PT. Please read Adding PT to Linux perf or the links from the general PT reference page.

PT support in hardware

CPU Support
Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
Skylake (6th generation Core, Xeon v5) Fine grained timing. Address filtering.
Goldmont (Apollo Lake, Denverton) Fine grained timing. Address filtering.

PT support in Linux

PT is supported in Linux perf, which is integrated in the Linux kernel.
It can be used through the “perf” command or through gdb.

There are also other tools that support PT: VTune, simple-pt, gdb, JTAG debuggers.

In general it is best to use the newest kernel and the newest Linux perf tools. If that is not possible older tools and kernels can be used. Newer tools can be used on an older kernel, but may not support all features

Linux version Support
Linux 4.1 Initial PT driver
Linux 4.2 Support for Skylake and Goldmont
Linux 4.3 Initial user tools support in Linux perf
Linux 4.5 Support for JIT decoding using agent
Linux 4.6 Bug fixes. Support address filtering.
Linux 4.8 Bug fixes.
Linux 4.10 Bug fixes. Support for PTWRITE and power tracing

Many commands require recent perf tools, you may need to update them rom a recent kernel tree.

This article covers mainly Linux perf and briefly gdb.

Preparations

Only needed once.

Allow seeing kernel symbols (as root)


echo kernel.kptr_restrict=0' >> /etc/sysctl.conf
sysctl -p

Basic perf command lines for recording PT

ls /sys/devices/intel_pt/format

Check if PT is supported and what capabilities.

perf record -e intel_pt// program

Trace program

perf record -e intel_pt// -a sleep 1

Trace whole system for 1 second

perf record -C 0 -e intel_pt// -a sleep 1

Trace CPU 0 for 1 second

perf record --pid $(pidof program) -e intel_pt//

Trace already running program.

perf has to save the data to disk. The CPU can execute branches much faster than than the disk can keep up, so there will be some data loss for code that executes
many instructions. perf has no way to slow down the CPU, so when trace bandwidth > disk bandwidth there will be gaps in the trace. Because of this it is usually not a good idea
to try to save a long trace, but work with shorter traces. Long traces also take a lot of time to decode.

When decoding kernel data the decoder usually has to run as root.
An alternative is to use the perf-with-kcore.sh script included with perf

perf script --ns --itrace=cr

Record program execution and display function call graph.

perf script by defaults “samples” the data (only dumps a sample every 100us).
This can be configured using the –itrace option (see reference below)

Install xed first.

perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64

Show every assembly instruction executed with disassembler.

For this it is also useful to get more accurate time stamps (see below)

perf script --itrace=i0ns --ns -F time,sym,srcline,ip

Show source lines executed (requires debug information)

perf script --itrace=s1Mi0ns ....

Often initialization code is not interesting. Skip initial 1M instructions while decoding:

perf script --time 1.000,2.000 ...

Slice trace into different time regions Generally the time stamps need to be looked up first in the trace, as they are absolute.

perf report --itrace=g32l64i100us --branch-history

Print hot paths every 100us as call graph histograms

Install Flame graph tools first.


perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
google-chrome workload.svg

Generate flame graph from execution, sampled every 100us

Other ways to record data

perf record -a -e intel_pt// sleep 1

Capture whole system for 1 second

Use snapshot mode

This collects data, but does not continuously save it all to disk. When an event of interest happens a data dump of the current buffer can be triggered by sending a SIGUSR2 signal to the perf process.


perf record -a -e --snapshot intel_pt// sleep 1
PERF_PID=$!
*execute workload*

*event happens*
kill -USR2 $PERF_PID

*end of recording*
kill $PERF_PID>

Record kernel only, complete system

perf record -a -e intel_pt//k sleep 1

Record user space only, complete system

perf record -a -e intel_pt//u

Enable fine grained timing (needs Skylake/Goldmont, adds more overhead)

perf record -a -e intel_pt/cyc=1,cyc_thresh=2/ ...


echo $[100*1024*1024] > /proc/sys/kernel/perf_event_mlock_kb
perf record -m 512,100000 -e intel_pt// ...

Increase perf buffer to limit data loss


perf record -e intel_pt// --filter 'filter main @ /path/to/program' ...

Only record main function in program


perf record -e intel_pt// -a --filter 'filter sys_write' program

Filter kernel code (needs 4.11+ kernel)


perf record -e intel_pt// -a --filter 'start func1 @ program' --filter 'stop func2 @ program' program

Start tracing in program at main and stop tracing at func2.


perf archive
rsync -r ~/.debug perf.data other-system:

Transfer data to a trace on another system. May also require using perf-with-kcore.sh if decoding
kernel.

Using gdb

Requires a new enough gdb built with libipt. For user space only.


gdb program
start
record btrace pt
cont

record instruction-history /m # show instructions
record function-history # show functions executed
prev # step backwards in time

For more information on gdb pt see the gdb documentation

References

The perf PT documentation

Reference for –itrace option (from perf documentation)


i synthesize "instructions" events
b synthesize "branches" events
x synthesize "transactions" events
c synthesize branches events (calls only)
r synthesize branches events (returns only)
e synthesize tracing error events
d create a debug log
g synthesize a call chain (use with i or x)
l synthesize last branch entries (use with i or x)
s skip initial number of events

Reference for –filter option (from perf documentation)

A hardware trace PMU advertises its ability to accept a number of
address filters by specifying a non-zero value in
/sys/bus/event_source/devices/ /nr_addr_filters.

Address filters have the format:

filter|start|stop|tracestop [/ ] [@]

Where:
- 'filter': defines a region that will be traced.
- 'start': defines an address at which tracing will begin.
- 'stop': defines an address at which tracing will stop.
- 'tracestop': defines a region in which tracing will stop.

is the name of the object file, is the offset to the
code to trace in that file, and is the size of the region to
trace. 'start' and 'stop' filters need not specify a .

If no object file is specified then the kernel is assumed, in which case
the start address must be a current kernel memory address.

can also be specified by providing the name of a symbol. If the
symbol name is not unique, it can be disambiguated by inserting #n where
'n' selects the n'th symbol in address order. Alternately #0, #g or #G
select only a global symbol. can also be specified by providing
the name of a symbol, in which case the size is calculated to the end
of that symbol. For 'filter' and 'tracestop' filters, if is
omitted and is a symbol, then the size is calculated to the end
of that symbol.

If is omitted and is '*', then the start and size will
be calculated from the first and last symbols, i.e. to trace the whole
file.
If symbol names (or '*') are provided, they must be surrounded by white
space.

The filter passed to the kernel is not necessarily the same as entered.
To see the filter that is passed, use the -v option.

The kernel may not be able to configure a trace region if it is not
within a single mapping. MMAP events (or /proc/ /maps) can be
examined to determine if that is a possibility.

Multiple filters can be separated with space or comma.

v2: Fix some typos/broken links

April 07, 2017 08:55 PM

April 05, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/04

Audiohttp://traffic.libsyn.com/jcm/20170404v2.mp3

Linus Torvalds announces Linux 4.11-rc5, Donald Drumpf drains the maintainer swamp in April, Intel FPGA Device Drivers, FPU state cacheing, /dev/mem access crashing machines, and assorted ongoing development.

Linus Torvalds announced Linux 4.11-rc5. In his announcement mail, Linus notes that “things have definitely started to calm down, let’s hope it stays this way and it wasn’t just a fluke this week”. He calls out the oddity that “half the arch updates are to parisc” due to parisc user copy fixes.

It’s worth noting that rc5 includes a fix for virtio_pci which removes an “out of bounds access for msix_names” (the “name strings for interrupts” provided in the virtio_pci_device structure. According to Jason Wang (Red Hat), “Fedora has received multiple reports of crashes when running 4.11 as a guest” (in fact, your author has seen this one too). Quoting Jason, “The crashes are not always consistent but they are generally some flavor of oops or GPF [General Protection Fault – Intel x86 term referring to the general case of an access violation into memory by an offending instruction in various different ISAs – Instruction Set Architectures] in virtio related code. Multiple people have done bisections (Thank you Thorsten Leemhuis and Richard W.M. Jones)”. An example rediscovery of this issue came from a Mellanox engineer who reported that their test and regression VMs were crashing occasionally with 4.11 kernels.

Announcements

Sebastian Andrzej Siewior announced preempt-rt Linux version 4.9.20-rt16. This includes a “Re-write of the R/W semaphores code. In RT we did not allow multiple readers because a writer blocking on the semaphore would have [to] deal with all the readers in terms of priority or budget inheritance [by which he is refering to the Priority Inheritance or “PI” feature common to “real time” kernels]. It’s obvious that the single reader restriction has severe performance problems for situations with heavy reader contention.” He notes that CPU hotplug got “better but can deadlock”

Greg Kroah-Hartman posted Linux stable kernels 4.4.59, 4.9.20, and 4.10.8.

Draining the Swamp (in April)

Donald Drumpf (trump.kremlin.gov@gmail.com) posted “MAINTAINERS: Drain the swamp”, an inspired patch aiming to finally address the problem of having “a small group of elites listed in the corrupt MAINTAINERS file” who, “For too long” have “reaped the rewards of maintainership”. He notes that over the past year the world has seen a great Linux Exit (“Lexit”) movement in which “People all of the Internet have come together and demanded that power be restored to the developers”, creating “a historic fork based on Linux 2.4, back to a better time, before Linux was controlled by corporate interests”. He notes that the “FAKE NEWS site LWN.net said it wouldn’t happen, but we knew better”.

Donald says that all of the groundwork laid over the past year was just an “important first step”. And that “now, we are taking back what’s rightfully ours. We are transferring power from “Lyin’ Linus” and giving it back to you, the people. With the below patch, the job-killing MAINTAINERS file is finally being ROLLED BACK.” He also notes his intention to return “LAW and ORDER” to the Linux kernel repository by building a wall around kernel.org and “THE LINUX FOUNDATION IS GOING TO PAY FOR IT”. Additional changes will include the repeal and replacement of the “bloated merge window”, the introduction of a distribution import tax, and other key innovations that will serve to improve the world and to MAKE LINUX GREAT AGAIN!

Everyone around the world immediately and enthusiastically leaped upon this inspired and life altering patch, which was of course perfect from the moment of its inception. It was then immediately merged without so much as a dissenting voice (or any review). The private email servers used to host Linus’s deleted patch emails were investigated and a special administrator appointed to investigate the investigators.

Intel FPGA Device Drivers

Wu Hao (Intel) posted a sixteen part patch series entitled “Intel FPGA Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open, and access FPGA [Field Programmable Gate Arrays, flexible logic fabrics containing millions of gates that can be connected programmatically by bitstreams describing the intended configuration] accelerators on platforms equipped with Intel(R) FPGA solutions and enables system level management functions such as FPGA partial reconfiguration [the dynamic updating of partial regions of the FPGA fabric with new logic], power management, and virtualization. This support differs from the existing in-kernel fpga-mgr from Alan Tull in that it seems to relate to the so-called Xeon-FPGA hybrid designs that Intel have presented on in various forums.

The first patch (01/16) provides a lengthy summary of their proposed design in the form of documentation that is added to the kernel’s Documentation directory, specifically in the file Documentation/fpga/intel-fpga.txt. It notes that “From the OS’s point of view, the FPGA hardware appears as a regular PCIe device. The FPGA device memory is organized using a predefined structure [Device Feature List). Features supported by the particular FPGA device are exposed throughg these data structures. An FME (FPGA Management Engine) is provided which “performs power and thermal management, error reporting, reconfiguration, performance reporting, and other infrastructure functions. Each FPGA has one FME, which is always access through the physical function (PF)”. The FPGA also provides a series of Virtual Functions that can be individually mapped into virtual machines using SR-IOV.

This design allows a CPU attached using PCIe to communicate with various Accelerated Function Units (AFUs) contained within the FPGA, and which are individually assignable into VMs or used in aggregate by the host CPU. One presumes that a series of userspace management utilities will follow this posting. It’s actually quite nice to see how they implemented the discovery of individual AFU features, since this is very close to something a certain author has proposed for use elsewhere for similar purposes. It’s always nicely validating to see different groups having similar thoughts.

Copy Offload with Peer-to-Peer PCI Memory

Logan Gunthorpe posted an RFC (Request for Comments) patch series entitled “Copy Offload with Peer-to-Peer PCI Memory” which relates to work discussed at the recent LSF/MM (Linux Storage Filesystem and Memory Management) conference, in Cambridge MA (side note: I did find some of you haha!). To quote Logan, “The concept here is to use memory that’s exposed on a PCI BAR [Base Address Register – a configuration register that tells the device where in the physical memory map of a system to place memory owned by the device, under the control of the Operating System or the platform firmware, or both] as data buffers in the NVMe target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely”. He notes a number of positives from this, including better QoS (Quality of Service), and a need for fewer (relatively still quite precious even in 2017) PCIe lanes from the CPU into a PCIe switch placed downstream of its Root Complex on which peer-to-peer PCIe devices talk to one another without the intervening step of hopping through the Root Complex and into the system memory via the CPU. As a consequence, Logan has focused his work on “cases where the NIC, NVMe devices and memory are all behind the same PCI switch”.

To facilitate this new feature, Logan has a second patch in the series, entitled “Introduce Peer-to-Peer memory (p2mem) device”, which supports partitioning and management of memory used in direct peer-to-peer transfers between two PCIe devices (endpoints, or “cards”) with a BAR that “points to regular memory”. As Logan notes, “Depending on hardware, this may reduce the bandwidth of the transfer but could significantly reduce pressure on system memory” (again by not hopping up through the PCIe topology). In his patch, Logan had also noted that “older PCI root complexes” might have problems with peer-to-peer memory operations, so he had decided to limit the feature to be only available for devices behind the same PCIe switch. This lead to a back and forth with Sinan Kaya who asked (rhetorically) “What is so special about being connected to the same switch?”. Sinan noted that there are plenty of ways in Linux to handle blacklisting known older bad hardware and platforms, such as requiring that the DMI/SMBIOS-provided BIOS date of manufacture of the system be greater than a certain date in combination with all devices exposing the p2p capability and a fallback blacklist. Ultimately, however, it was discovered that the feature peer-to-peer feature isn’t enabled by default, leading Sinan to suggest “Push the decision all the way to the user. Let them decide whether they want this feature to work on a root port connected port or under the switch”.

FPU state cacheing

Kees Cook (Google) posted a patch entitled “x86/fpu: move FPU state into separate cache”, which aims to remove the dependency within the Intel x86 Architecture port upon an internal kernel config setting known as ARCH_WANTS_DYNAMIC_TASK_STRUCT. This configuration setting (set by each architecture’s code automatically, not by the person building the kernel in the configuration file) says that the true size of the task_struct cannot be known in advance on Intel x86 Architecture because it contains a variable sized array (VSA) within the thread_struct that is at the end of the task_struct to support context save/restore of the CPU’s FPU (Floating Point Unit) co-processor. Indeed, the kernel definition of task_struct (see include/linux/sched.h) includes a scary and ominous warning “on x88, ‘thread_struct’ contains a variable-sized structure. It *MUST* be at the end of ‘task_struct'”. Which is fairly explicit.

The reason to remove the dependency upon dynamic task_struct sizing is because this “support[s] future structure layout randomization of the task_struct”, which requires that “none of the structure fields are allowed to have a specific position or a dynamic size”. The idea is to leverage a GCC (GNU Compiler Collection) plugin that will change the ordering of C structure members (such as task_struct) randomly at compile time, in order to reduce the ability for an attacker to guess the layout of the structure (highly useful in various exploits). In the case of distribution kernels of course, an attacker has access to the same kernel binaries that may be running on a system, and could use those to calculate likely structure layout for use in a compromise. But the same is not true of the big hyperscale service providers like Google and Facebook. They don’t have to publish the binaries for their own internal kernels running on their public infrastructure servers.

This patch lead to a back and forth with Linus, who was concerned about why the task_struct would need changing in order to prevent the GCC struct layout randomization plugin from blowing up. In particular, he was worried that it sounded like the plugin was moving variable sized arrays from the last member of structures (not legally permitted). Kees, Linus, and Andy Lutomirski went through the fact that, yes, the plugin can handle trailing VSAs and so forth. In the end, it was suggested that Kees look at making task_struct “be something that contains a fixed beginning and end, and just have an unnamed randomized part in the middle”. Kees said “That could work. I’ll play around with it”.

/dev/mem access crashing machines

Dave Jones (x86info maintainer) had a back and forth with Kees Cook, Linus, and Tommi Rantala about the latter’s discovery that running Dave’s “x86info” tool crashed his machine with an illegal memory access. In turns out that x86info reads /dev/mem (a requirement to get the data it needs), which is a special file representing the contents of physical memory. Normally, when access is granted to this file, it is restricted to the root user, and then only certain parts of memory as determined by STRICT_DEVMEM. The latter is intended only to allow reads of “reserved RAM” (normal system memory reserved for specific device purposes, not that allocated for use by programs). But in Tommi’s case, he was running a kernel that didn’t have STRICT_DEVMEM set on a system booting with EFI for which the legacy “EBDA” (Extended BIOS Data Area) that normally lives at a fixed location in the sub-1MB memory window on x86 was not provided by the platform. This meant that the x86info tool was trying to read memory that was a legal address but which wasn’t reserved in the EFI System Table (memory map), and was mapped for use elsewhere.

All of this lead Linus to point out that simply doing a “dd” read on the first MB of the memory on the offending system would be enough to crash it. He noted that (on x86 systems) the kernel allows access to the sub-1MB region of physical memory unconditionally (regardless of the setting of the kernel STRICT_DEVMEM option) because of the wealth of platform data that lives there and which is expected to be read by various tools. He proposed effectively changing the logic for this region such that memory not explicitly marked as reserved would simple “just read zero” rather than trying to read random kernel data in the case that the memory is used for other purposes.

This author certainly welcomes a day when /dev/mem dies a death. We’ve gone to great lengths on 64-bit ARM systems to kill it, in part because it is so legacy, but in another part because there are two possible ways we might trap a bad access – one as in this case (synchronous exception) but another in which the access might manifest as a System Error due to hitting in the memory controller or other SoC logic later as an errant access.

Ongoing Development

Steve Longerbeam posted version 6 of a patch series entitled “i.MX Media Driver”, which implements a V4L2 (Video for Linux 2) driver for i.MX6.

David Gstir (on behalf of Daniel Walter) posted “fscrypt: Add support for AES-128-CBC” which “adds support for using AES-128-CBC for file contents and AES-128-CBC-CTS for file name encryption. To mitigae watermarking attacks, IVs [Initial Vectors] are generated using the ESSIV algorthim.”

Djalal Harouni posted an RFC (Request for Comments) patch entitled “proc: support multiple separate proc instances per pidnamespace”. In his patch, Djala notes that “Historically procfs was tied to pid namespaces, and moun options were propagated to all other procfs instances in the same pid namespace. This solved several use cases in that time. However today we face new problems, there are multiple container implementations there, some of them want to hide pid entries, others want to hide non-pid entries, others want to have sysctlfs, others want to share pid namespace with private procfs mounts. All these with current implementation won’t work since all options will be propagated to all procfs mounts. This series allow to have new instances of procfs per pid namespace where each intance can have its own mount option”.

Zhou Chengming (Hauwei) posted “reduce the time of finding symbols for module” which aims to reduce the time taken for the Kernel Live Patch (klp) module to be loaded on a system in which the module uses many static local variables. The patch replaces the use of kallsyms_on_each_symbol with a variant that limits the search to those needed for the module (rather than every symbol in the kernel). As Jessica Yu notes, “it means that you have a lot of relocation records with reference your out-of-tree module. Then for each such entry klp_resolve_symbol() is called and then klp_find_object_symbol() to actually resolve it. So if you have 20k entries, you walk through vmlinux kallsyms table 20k times…But if there were 20k modules loaded, the problem would still be there”. She would like to see a more generic fix, but was also interested to see that the Huawei report referenced live patching support for AArch64 (64-bit ARM Architecture), which isn’t in upstream. She had a number of questions about whether this code was public, and in what form, to which links to works in progress from several years ago were posted. It appears that Huawei have been maintaining an internal version of these in their kernels ever since.

Ying Huang (Intel) posted version 7 of “THP swap: Delay splitting THP during swapping out”, which as we previously noted aims to swap out actual whole “huge” (within certain limits) pages rather than splitting them down to the smallest atom of size supported by the architecture during swap. There was a specific request to various maintainers that they review the patch.

Andi Kleen posted a patch removing the printing of MCEs to the kernel log when the “mcelog” daemon is running (and hopefully logging these events).

Laura Abbott posted a RESEND of “config: Add Fedora config fragments”, which does what it says on the tin. Quoting her mail, “Fedora is a popular distribution for people who like to build their own kernels. To make this easier, add a set of reasonable common config options for Fedora”. She adds files in kernel/configs for “fedora-core.config”, “fedora-fs.config” and “fedora-networking.config” which should prove very useful next time someone complains at me that “building kernels for Red Hat distributions is hard”.

Eric Biggers posted “KEYS: encrypted: avoid encrypting/decrypting stack buffers”, which notes that “Since [Linux] v4.9, the crypto PI cannot (normally) be used to encrypt/decrypt stack buffers because the stack may be virtually mapped. Fix this or the padding buffers in encrypted-keys by using ZERO_PAGE for the encryption padding and by allocating a temporary heap buffer for the decryption padding. Eric is referring to the virtually mapped stack support introduced by Andy Lutomirski which has the side effect of incidentally flagging up various previous missuse of stacks.

Mark Rutland posted an RFC (Request For Comments) patch series entitled “ARMv8.3 pointer authentication userspace support”. ARMv8.3 includes a new architectural extension that “adds functionality to detect modification of pointer values, mitigating certain classes of attack such as stack smashing, and making return oriented [ROP] programming attacks harder”. [aside: If you’re bored, and want some really interesting (well, I think so) bedtime reading, and you haven’t already read all about ROP, you really should do so]. Continuing to quote Mark, the “extension introduces the concept of a pointer authentication code (PAC), which is stored in some upper bits of pointers. Each PAC is derived from the original pointer, another 64-bit value (e.g. the stack pointer), and a secret 128-bit key”. The extension includes new instructions to “insert a PAC into a pointer”, to “strip a PAC from a pointer”, and to “authenticate strip a PAC from a pointer” (which has the side effect of poisoning the pointer and causing a later fault if the authentication fails – allowing for detection of malicious intent).

Mark’s patch makes for great reading and summarizes this feature well. It notes that it has various counterparts in userspace to add ELF (Executable and Linking Format, the executable container used on modern Linux and Unix systems) notes sections to programs to provide the necessary annotations and presumably other data necessary to implement pointer authentication in application programs. It will be great to see those posted too.

Joerg Roedel followed up to a posting from Samuel Sieb entitled “AMD IOMMU causing filesystem corruption” to note that it has recently been discovered (and was documented in another thread this past week entitled “PCI: Blacklist AMD Stoney GPU devices for ATS”) that the AMD “Stoney” platform features a GPU for which PCI-ATS is known to be broken. ATS (Address Translation Services) is the mechanism by which PCIe endpoint devices (such as plugin adapter cards, including AMD GPUs) may obtain virtual to physical address translations for use in inbound DMA operations initiated by a PCIe device into a virtual machine (VM’s) memory (the VM talks the other way through the CPU MMU).

In ATS, the device utilizes an Address Translation Cache (ATC) which is essentially a TLB (Translation Lookaside Buffer) but not called that because of handwavy reasons intended not to confuse CPU and non-CPU TLBs. When a device sitting behind an IOMMU needs to perform an address translation, it asks a Translation Agent (TA) typically contained within the PCIe Root Complex to which it is ultimately attached. In the case of AMD’s Stoney Platform, this blows up under address invalidation load: “the GPU does not reply to invalidations anymore, causing Completion-wait loop timeouts on the AMD IOMMU driver side”. Somehow (but this isn’t clear) this is suspected as the possible cause of the filesystem corruption seen by Samuel, who is waiting to rebuild a system that ate its disk testing this.

Calvin Owens (Facebook) posted “printk: Introduce per-console filtering of messages by loglevel”, which notes that “Not all consoles are created equal”. It essentially allows the user to set a different loglevel for consoles that might each be capable of very different performance. For example, a serial console might be severely limited in its baud rate (115,200 in many cases, but perhaps as low as 9,600 or lower is still commonplace in 2017), while a graphics console might be capable of much higher. Calvin mentions netconsole as the preferred (higher speed) console that Facebook use to “monitor our fleet” but that “we still have serial consoles attached on each host for live debugging, and the latter has caused problems”. He doesn’t specifically mention USB debug consoles, or the EFI console, but one assumes that listeners are possibly aware of the many console types.

Christopher Bostic (IBM) posted version 5 of a patch series entitled “FSI device driver implementation”. FSI stands for “Flexible Support Interface”, a “high fan out [a term referring to splitting of digital signals into many additional outputs] serial bus consisting of a clock and a serial data line capable of running at speeds up to 166MHz”. His patches add core support to the Linux bus and device models (including “probing and discovery of slaves and slave engines”), along with additional handling for CFAM (Common Field Replacable Unit Access Macro) – an ASIC (chip) “residing in any device requiring FSI communications” that provides these various “engines”, and an FSI engine driver that manages devices on the FSI bus.

Finally, Adam Borowski posted “n_tty: don’t mangle tty codes in OLCUC mode” which aims to correct a bug which is “reproducible as of Linux 0.11” and all the way back to 0.01. OLCUC is not part of POSIX, but this terminios structure flag tells Linux to map lowercase characters to uppercase ones. The posting cites an obvious desire by Linus to support “Great Runes” (archiac Operating Systems in which everything was uppercase), to which Linus (obviously in jest, and in keeping with the April 1 date) asked Adam why he “didn’t make this the default state of a tty?”.

April 05, 2017 07:31 AM

April 03, 2017

Arnaldo Carvalho de Melo: Looking for users of new syscalls

Recently Linux got a new syscall to get extended information about files, a super ‘stat’, if you will, read more about it at LWN.

So I grabbed the headers with the definitions for the statx arguments to tools/include/ so that ‘perf trace’ can use them to beautify, i.e. to appear as
a bitmap of strings, as described in this cset.

To test it I used one of things ‘perf trace’ can do and that ‘strace’ does not: system wide stracing. To look if any of the programs running on my machine was using the new syscall I simply did, using strace-like syntax:

# perf trace -e statx

After a few minutes, nothing… So this fedora 25 system isn’t using it in any of the utilities I used in these moments, not surprising, glibc still needs wiring statx up.

So I found out about samples/statx/test-statx.c, and after installing the kernel headers and pointing the compiler to where those files were installed, I restarted that system wide ‘perf trace’ session and ran the test program, much better:

# trace -e statx
16612.967 ( 0.028 ms): statx/562 statx(dfd: CWD, filename: /etc/passwd, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffef195d660) = 0
33064.447 ( 0.011 ms): statx/569 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW|STATX_FORCE_SYNC, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffc5484c790) = 0
36050.891 ( 0.023 ms): statx/576 statx(dfd: CWD, filename: /etc/motd, flags: SYMLINK_NOFOLLOW, mask: BTIME, buffer: 0x7ffeb18b66e0) = 0
38039.889 ( 0.023 ms): statx/584 statx(dfd: CWD, filename: /home/acme/.bashrc, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7fff1db0ea90) = 0
^C#

Ah, to get filenames fetched we need to put in place a special probe, that will collect filenames passed to the kernel right after the kernel copies it from user memory:

[root@jouet ~]# perf probe 'vfs_getname=getname_flags:72 pathname=result->name:string'
Added new event:
probe:vfs_getname    (on getname_flags:72 with pathname=result->name:string)

You can now use it in all perf tools, such as:

perf record -e probe:vfs_getname -aR sleep 1

[root@jouet ~]# trace -e open touch /etc/passwd
0.024 ( 0.011 ms): touch/649 open(filename: /etc/ld.so.cache, flags: CLOEXEC) = 3
0.056 ( 0.018 ms): touch/649 open(filename: /lib64/libc.so.6, flags: CLOEXEC) = 3
0.481 ( 0.014 ms): touch/649 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC) = 3
0.553 ( 0.012 ms): touch/6649 open(filename: /etc/passwd, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3
[root@jouet ~]#

Make sure you have CONFIG_DEBUG_INFO set in your kernel build or that the matching debuginfo packages are installed. This needs to be done just once per boot, ‘perf trace’ will find it in place and use it.

Lastly, if ‘perf’ is hardlinked to ‘trace’, then the later will be the same as ‘perf trace’.


April 03, 2017 03:23 PM

March 31, 2017

Daniel Vetter: X.org Foundation Election - Vote Now!

It is election season again for the X.org Foundation. Beside electing half of the board seats we again have some paperwork changes - after updating the bylaws last year we realized that the membership agreement hasn’t been changed since over 10 years. It talks about the previous-previous legal org, has old addresses and a bunch of other things that just don’t fit anymore. In the board we’ve updated it to reflect our latest bylaws (thanks a lot to Rob Clark doing the editing), with no material changes intended.

Like bylaw changes any change to the membership agreement needs a qualified supermajority of all members, every vote counts and not voting essentially means voting no.

To vote, please go to https://members.x.org, log in and hit the “Cast” button on the listed ballot.

Voting closes by  23:59 UTC on 11 April 2017, but please don’t cut it short, it’s a computer that decides when it’s over …

March 31, 2017 12:00 AM

March 28, 2017

Arnaldo Carvalho de Melo: Getting backtraces from arbitrary places

Needs debuginfo, either in a package-debuginfo rpm or equivalent or by building with ‘cc -g’:

[root@jouet ~]# perf probe -L icmp_rcv:52 | head -15

  52  	if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
      		/*
      		 * RFC 1122: 3.2.2.6 An ICMP_ECHO to broadcast MAY be
      		 *  silently ignored (we let user decide with a sysctl).
      		 * RFC 1122: 3.2.2.8 An ICMP_TIMESTAMP MAY be silently
      		 *  discarded if to broadcast/multicast.
      		 */
  59  		if ((icmph->type == ICMP_ECHO ||
  60  		     icmph->type == ICMP_TIMESTAMP) &&
      		    net->ipv4.sysctl_icmp_echo_ignore_broadcasts) {
      			goto error;
      		}
      		if (icmph->type != ICMP_ECHO &&
      		    icmph->type != ICMP_TIMESTAMP &&
[root@jouet ~]# perf probe icmp_rcv:59
Added new event:
  probe:icmp_rcv       (on icmp_rcv:59)

You can now use it in all perf tools, such as:

	perf record -e probe:icmp_rcv -aR sleep 1

[root@jouet ~]# perf trace --no-syscalls --event probe:icmp_rcv/max-stack=5/
     0.000 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
  1025.876 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
^C[root@jouet ~]#

Humm, lots of redundant info, guess we could do away with those ([kernel.kallsyms) in all the callchain lines…


March 28, 2017 08:23 PM

Kernel Podcast: Linux Kernel Podcast for 2017/03/28

Audiohttp://traffic.libsyn.com/jcm/20170328v2.mp3

Author’s Note: Apologies to Ulrich Drepper for incorrectly attributing his paper “Futexes are Tricky” to Rusty. Oops. In any case, everyone should probably read Uli’s paper: https://www.akkadia.org/drepper/futex.pdf

In this week’s edition: Linus Torvalds announces Linux 4.11-rc4, early debug with USB3 earlycon, upcoming support for USB-C in 4.12, and ongoing development including various work on boot time speed ups, logging, futexes, and IOMMUs.

Linus Torvalds announced Linux 4.11-rc4, noting that “So last week, I said that I was hoping that rc3 was the point where we’d start to shrink the rc’s, and yes, rc4 is smaller than rc3. By a tiny tiny sidgen. It does touch a few more files, but it has a couple fewer commits, and fewer lines changed overall. But on the whole the two are almost identical in size. Which isn’t actually all that bad, considering that rc4 has both a networking merge and the usual driver suspects from Greg [Kroah Hartman], _and_ some drm fixes”.

Announcements

Junio C Hamano announced Git v2.12.2.

Greg Kroah-Hartman announced Linux 4.4.57, 4.9.18, and 4.10.6.

Sebastian Andrezej Siewior announced Linux v4.9.18-rt14, which includes a “larger rework of the futex / rtmutex code. In v4.8-rt1 we added a workaround so we don’t de-boost too early in the unlock path. A small window remained in which the locking thread could de-boost the unlocking thread. This rework by Peter Zijlstra fixes the issue.”

Upcoming features

Greg K-H finally accepted the latest “USB Type-C Connector class” patch series from Heikki Krogerus (Intel). This patch series aims to provide various control over the capability for USB-C to be used both as a power source and as a delivery interface to supply to power to external devices (enabling the oft-cited use case of selecting between charging your cellphone/mobile device or using said device to charge your laptop). This will land a new generic management framework exposed to userspace in Linux 4.12, including a driver for “Intel Whiskey Cove PMIC [Power Management IC] USB Type-C PHY”. Your author looks forward to playing. Greg thanked Heikki for the 18(!) iterations this patch went through prior to being merged – not quite a record, but a lot of effort!

Kishon Vijay Abraham (TI) posted “PCI: Support for configurable PCI endpoint”, which provides generic infrastructure to handle PCI endpoint devices (Linux operating as a PCI endpoint “device”), such as those based upon IP blocks from DesignWare (DW). He’s only tested the design on his “dra7xx” boards and requires “the help of others to test the platforms they have access to”. The driver adds a configfs interface including an entry to which userspace should write “start” to bring up an endpoint device. He adds himself as the maintainer for this new kernel feature.

Rob Herring posted “dtc updates for 4.12”, which “syncs dtc [Device Tree Compiler] with current mainline [dtc]”. His “primary motivation is to pull in the new checks [he’s] worked on. This gives lots of new warnings which are turned off by default”.

60Hz vs 59.94Hz (Handling of reduced FPS in V4L2)

Jose Abreu (Synopsys) posted a patch series entitled “Handling of reduced FPS in V4L2”, which aims to provide a mechanism for the kernel to measure (in a generic way) the actual Frames Per Second for a Video For Linux (V4L) video device. The patches rely upon hardware drivers being able to signal that they can distinguish “between regular fps and 1000/1001 fps”.

This took your author on a journey of discovery. It turns out that (most of the time), when a video device claims to be “60fps” it’s actually running at 59.94fps, but not always. The latter frame rate is an artifact of the NTSC (National Television System Committee) color television standard in the United States. Early televisions used the 60Hz frequency (which is nationally synchronized, at least in each of the traditional three independent grids operated in the US, which are now interconnected using HVDC interconnects but presumably are still not directly in phase with one another – feel free to educate me!) of the AC supply to lock individual frame scan times. When color TV was introduced, a small frequency offset was used to make room in each frame for a color sub-carrier signal while retaining backward compatibility for black and white transmissions. This is where frequencies of 29.97 and 59.95 frames per second originate. In case you always wondered.

Jose and Hans Verkuil had a back and forth discussion about various real- world measured pixelclock frequencies that they had obtained using a variety of equipment (signal analyzers, certified HDMI analyzer, and the Synopsys IP supported by the patch series under discussion) to see whether it was in reality possible to reliably distinguish frame rates.

Early Debug with USB3 earlycon (early printk)

Lu Baolu (Intel) posted version 8 of a patch series entitled “usb: early: add support for early printk through USB3 debug port”. Contemporary (especially x86) desktop and server class systems don’t expose low level hardware debug interfaces, such as JTAG debug chains, which are used during chip bringup and early firmware and OS enablement activities, and which allow developers with suitable tools to directly control and interrogate hardware state. Or just dump out the kernel ringbuffer (the dmesg “log”).

Actually, all such systems do have low level debug capabilities, they’re just fused out during the production process (by blowing efuses embedded into the processor) and either not exposed on the external pins of the chip at all, or are simply disabled in the chip logic. Probably most of these can be re-enabled by writing the magic cryptographically signed hashes to undocumented memory regions in on-chip coprocessor spaces. In any case, vendors such as Intel aren’t going to tell you how.

Yet it is often desirable to have certain low level debug functionality for systems that are deployed into field settings, even to reliably dump out the kernel console log DEBUG log level messages somewhere. Traditionally this was done using PC serial ports, but most desktop (and all laptop) systems no longer ship with those exposed on the rear panel. If you’re lucky you’ll see an IDC10 connector on your motherboard to which you can attach a DB9 breakout cable. Consumers and end users have no idea what any of this means, and in the case that they don’t know what this means, they probably shouldn’t be encouraged to open the machine up and poke things. Yet even in the case that IDC10 connectors exist and can be hooked up, this is still a cumbersome interface that cannot be relied upon today.

Microsoft (who are often criticized but actually are full of many good ideas and usually help to drive industry standardization for the broader market) instituted sanity years ago by working with the USB Implementors Forum (IF) to ensure that the USB3 specification included a standardized feature known as xHCI debug capability (DbC), an “optional but standalone functionality by an xHCI hosst controller”. This suited Windows, which traditionally requires two UARTs (serial ports) for kernel development, and uses one of them for simple direct control of the running kernel without going through complex driver frameworks. Debug port (which also existed on USB2) traditionally required a special external partner hardware dongle but is cleaner in USB3, requiring only a USB A-to-A crossover cable connecting USB3.0 data lines.

As Lu Baolu notes in his patch, “With DbC hardware initialized, the system will present a debug device through the USB3 debug port (normally the first USB3 port)”. The patch series enables this as a high speed console log target on Linux, but it could be used for much more interesting purposes via KDB.

[Separately, but only really related to console drivers and not debugging, Thierry Escande posted “firmware: google memconsole” which adds support for importing the boot time BIOS memory based console into the kernel ringbuffer on Google Coreboot systems].

Ongoing Development

Pavel Tatashin (Oracle) posted “parallelized “struct page” zeroing”, which improves boot time performance significantly in the case that the “deferred struct page initialization feature is enabled”. In this case, zeroing out of the kernel’s vmemmap (Virtual Memory Map) is delayed until after the secondary CPU cores on a machine have been started. When this is done, those cores can be used to run zeroing threads that write to memory, taking one SPARC system down from 97.89 seconds to boot down to 46.91. Pavel notes that the savings are also considerable on x86 systems too.

Thomas Gleixner had a lengthy back and forth with Pasha Tatashin (Oracle) over the latter’s posting of “Early boot time stamps for x86” which use the TSC (Time Stamp Counter) on Intel x86 Architecture. The goal is to log how long the machine actually took to boot, including firmware, rather than just how long Linux took to boot from the time it was started. Peter Zijlstra responded (to Pasha), “Lol, how cute. You assume TSC starts at 0 on reset” (alluding to the fact that firmware often does crazy things playing with the TSC offset or directly writing to it). Thomas was unimpressed with Pavel’s posting of a v2 patch series, noting “Did you actually read my last reply on V1 of this? I made it clear that the way this is done, i.e. hacking it into the earliest boo[]t stage is not going to happen…I don’t care about you wasting your time, but I very much care about my time”. He provided a further more lengthy response, including various commentary on the best ways to handle feedback.

Peter Zijlstra posted version 6 of a patch series entitled “The arduous story of FUTEX_UNLOCK_PI” in which he adds “Another installment of the futex patches that give you nightmares”. Futexes (Fast User-space Mutexes) are a mechanism provided by the Linux kernel which leverage shared memory to provide a low overhead mutex (mutual exclusion primitave) to userspace in the case that such mutexes are uncontended (no conflicts between processes – tasks within the kernel – exist trying to acquire the same resource) but with a “slow path” through the kernel in the case of contention. They are used by many userspace applications, including extensively in the C library (see the famous paper by Rusty Russell entitled “Futexes are Tricky”). Peter is working on solving problems introduced by having to have Priority Inheritance (PI) aware futexes in Real Time kernels. These adjust priority of the associated tasks holding mutexes for short periods in order to prevent Priority Inversion (see Mars Pathfinder study papers) in which a low priority task holds a mutex that a high priority task wants to acquire. Peter’s patches “rework[] and document[] the locking” of existing code.

Separately, Waiman Long (Red Hat) posted version 6 of “futex” Introducing throughput-optimized (TP) futexes which “introduces a new futex implementation called throughput-optmized (TP) futexes. It is similar to PI futexes in its calling convention, but provides better throughput than the wait-wake (WW) futexes by encouraging lock stealing and optimistic spinning. The new TP futexes an be used in implementing both userspace mutexes and rwlocks. The provide[] better performance while simplifying the userspace locking implementation at the same time. The WW futexes are still needed to implement other synchronization primitives like conditional variables and semaphores that cannot be handled by the TP futexes”.

David Woodhouse posted “PCI resource mmap cleanup” which aims to clean up the use of various kernel interfaces that provide “user visible” resource addresses through (legacy) proc and (contemporary) sysfs. The purpose of these interfaces is to provide information about regions of PCI address space memory that can be directly mapped by userspace applications such as those linked against the DPDK (Data Plane Development Kit) library. An example of his cleanup included “Only allow WC [Write Combining] mmap on prefetchable resources” for the /proc/bus/pci mmap interface because this was the case for the preferred sysfs interface already. This lead some to debate why the 64-bit ARM Architecture didn’t provide the legacy procfs interface (since there was a little confusion about the dependencies for DPDK) but ultimately re-concluded that it shouldn’t.

Tyler Baicar (Codeaurora) posted version 13 of a patch series entitled “Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64”, which aims to introduce support to the 64-bit ARM Architecture for logging of RAS events using the shared “GHES” (Generic Hardware Error Source) memory location “with the proper GHES structures to notify the OS of the error”. This dovetails nicely with platforms performing “firmware first” error handling in which errors are trapped to secure firmare which first handles them and subsequently informs the Operating System using this ACPI feature.

Shaohua Li (Facebook) posted a patch entitled “add an option to disable iommu force on” in the case of the (x86) Trusted Boot (TBOOT) feature being enabled. The reason cited was that under a certain 40GBit networking load XDP (eXpress Data Path) test there were high numbers of IOTLB (IO Translation Look Aside Buffer) misses “which kills the performance”. What he is refering to is the mechanism through which an IOMMU (which sits logically between a hardware device, such as a network card, and memory, often as part of an integrated PCI Root Complex) translates underlying memory accesses by the adapter card into real host memory transactions. These are cached by the IOMMU in small caches (known as IOTLBS) after it performs such translations using its “page tables” (similar to how a host CPU’s MMU – Memory Management Unit – performs host memory translations). Badly designed IOMMU implementations or poor utilization can result in large numbers of misses that result in users disabling the feature. Alas, without an IOMMU, there’s little protection during boot from rogue devices that maliciously want to trash host memory. Nobody has noted this in the RFC (Request For Comments) discussion, yet.

Bodong Wang (Mellanox) posted a patch entitled “Add an option to probe VFs or not before enabling SR-IOV”, which aims to allow administrators to limit the probing of (PCIe) Virtual Functions (VFs) on adapters that will have those resources passed through to Virtual Machines (VMs) (using VFIO). This “can save host side resource usage by VF instances which would be eventually probed to VMs”. It adds a new sysfs interface to control this.

Viresh Kumar posted a patch entitled “cpufreq: Restore policy min/max limits on CPU online”. Apparently, existing code behavior was that “On CPU online the cpufreq core restores the previous governor [the in kernel logic that determines CPU frequency transitions based upon various metrics, such as saving energy, or prioritizing performance]…but it does not restore min/max limits at the same time”. The patch addresses this shortcoming.

Wanpeng Li posted a patch entitled “KVM: nVMX: Fix nested VPID vmx exec control” that aims to “hide and forbid” Virtual Processor IDentifiers in nested virtualization contexts where the hardware doesn’t support this. Apparently it was unconditionally being enabled (based upon real hardware realities of existing implementation) regardless of feature information (INVVPID) provided in the “vmx” capabilities.

Joerg Roedel posted a patch entitled “ACPI: Don’t create a platform_device for IOAPIC/IOxAPIC” since this was causing problems during hot remove (of CPUs). Rafael J. Wysocki noted that “it’s better to avoid using platform_device for hot-removable stuff” since it is “inherently fragile”.

Kees Cook (Google) posted a patch disabling hibernation support on 32-bit systems in the case that KASLR (Kernel Address Space Layout Randomization) was enabled at boot time, but allowing for “nokaslr” on the kernel command line to change this. Evgenii Shatokhin initially noted that “nokaslr” didn’t re-enable hibernation support correctly, but eventually determined that the ordering and placement of the “nokaslr” on the command line was to blame, which lead to Kees saying he would look into the command line parsing sequence and interaction with other options, such as “resume=”.

Separately, Baoquan He (Red Hat) noted that with KASLR an implicit assumption that EFI_VA_START < EFI_VA_END existed, while “In fact [the] EFI [(Unified) Extensible Firmware Interface] region reserved for runtime services [these are callbacks into firmware from Linux] virtual mapping will be allocated using a top-down schema”. His patches addressed this problem, and being “RESEND”s, he was keen to see that they get taken up soon.

Also separately, Kees posted “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. He cites a bug it would have prevented.

Kan Liang (Intel) posted “measure SMI cost”. This patch series aims to leverage hardware counters to inform perf of the amount of time spent (on Intel x86 Architecture systems) inside System Management Mode (SMM). SMIs (System Management Interrups) are events that are generated (usually) by Intel Platform Control Hub and similar chipset logic which can be programmed by firmare to generate regular interrupts that target a secure execution context known as SMM (System Management Mode). It is here that firmware temporarily steals CPU cycles from the Operating System (without its knowledge) to perform such things as CPU fan control, errata handling, and wholesale VGA graphics emulation in BMC “value add” from OEMs). Over the years, the amount of gunk hidden in SMIs has grown that this author even once wrote a latency detector (hwlat) and has a patent on SMI detection without using such dedicated counters…due to the impact of such on system performance. SMM is necessary on x86 due to its lack of a standardized on-SoC platform management controller, but so is accounting for bloat.

Finally, yes, Kirill A. Shutemov snuck in another iteration of his Intel “5-level paging support” in preparation for a 4.12 merge.

 

March 28, 2017 05:23 PM

March 27, 2017

Matthew Garrett: Buying a Utah teapot

The Utah teapot was one of the early 3D reference objects. It's canonically a Melitta but hasn't been part of their range in a long time, so I'd been watching Ebay in the hope of one turning up. Until last week, when I discovered that a company called Friesland had apparently bought a chunk of Melitta's range some years ago and sell the original teapot[1]. I've just ordered one, and am utterly unreasonably excited about this.

Update: Friesland have apparently always produced the Utah teapot, but were part of the Melitta group for some time - they didn't buy the range from Melitta.

[1] They have them in 0.35, 0.85 and 1.4 litre sizes. I believe (based on the measurements here) that the 1.4 litre one matches the Utah teapot.

comment count unavailable comments

March 27, 2017 11:45 PM

Pete Zaitcev: It was surprising

So here I was watching ACCA at Crunchyroll, when a commercial comes up... of VMware OpenStack.

I still remember times in 2010 when VMware was going to have their own cloud (possibly called VxCloud), with blackjack and hookers, as they say, or much better that OpenStack anyway. Looks like things have changed.

Also, what's up with this targeting? How did they link my account at Crunchyroll with OpenStack?

{Update: Thanks, Andreas!}

March 27, 2017 06:33 PM

March 26, 2017

Vegard Nossum: Writing a reverb filter from first principles

WARNING/DISCLAIMER: Audio programming always carries the risk of damaging your speakers and/or your ears if you make a mistake. Therefore, remember to always turn down the volume completely before and after testing your program. And whatever you do, don't use headphones or earphones. I take no responsibility for damage that may occur as a result of this blog post!

Have you ever wondered how a reverb filter works? I have... and here's what I came up with.

Reverb is the sound effect you commonly get when you make sound inside a room or building, as opposed to when you are outdoors. The stairwell in my old apartment building had an excellent reverb. Most live musicians hate reverb because it muddles the sound they're trying to create and can even throw them off while playing. On the other hand, reverb is very often used (and overused) in the studio vocals because it also has the effect of smoothing out rough edges and imperfections in a recording.

We typically distinguish reverb from echo in that an echo is a single delayed "replay" of the original sound you made. The delay is also typically rather large (think yelling into a distant hill- or mountainside and hearing your HEY! come back a second or more later). In more detail, the two things that distinguish reverb from an echo are:

  1. The reverb inside a room or a hall has a much shorter delay than an echo. The speed of sound is roughly 340 meters/second, so if you're in the middle of a room that is 20 meters by 20 meters, the sound will come back to you (from one wall) after (20 / 2) / 340 = ~0.029 seconds, which is such a short duration of time that we can hardly notice it (by comparison, a 30 FPS video would display each frame for ~0.033 seconds).
  2. After bouncing off one wall, the sound reflects back and reflects off the other wall. It also reflects off the perpendicular walls and any and all objects that are in the room. Even more, the sound has to travel slightly longer to reach the corners of the room (~14 meters instead of 10). All these echoes themselves go on to combine and echo off all the other surfaces in the room until all the energy of the original sound has dissipated.

Intuitively, it should be possible to use multiple echoes at different delays to simulate reverb.

We can implement a single echo using a very simple ring buffer:

    class FeedbackBuffer {
    public:
        unsigned int nr_samples;
        int16_t *samples;

        unsigned int pos;

        FeedbackBuffer(unsigned int nr_samples):
            nr_samples(nr_samples),
            samples(new int16_t[nr_samples]),
            pos(0)
        {
        }

        ~FeedbackBuffer()
        {
            delete[] samples;
        }

        int16_t get() const
        {
            return samples[pos];
        }

        void add(int16_t sample)
        {
            samples[pos] = sample;

            /* If we reach the end of the buffer, wrap around */
            if (++pos == nr_samples)
                pos = 0;
        }
    };

The constructor takes one argument: the number of samples in the buffer, which is exactly how much time we will delay the signal by; when we write a sample to the buffer using the add() function, it will come back after a delay of exactly nr_samples using the get() function. Easy, right?

Since this is an audio filter, we need to be able to read an input signal and write an output signal. For simplicity, I'm going to use stdin and stdout for this -- we will read 8 KiB at a time using read(), process that, and then use write() to output the result. It will look something like this:

    #include <cstdio>
    #include <cstdint>
    #include <cstdlib>
    #include <cstring>
    #include <unistd.h>


    int main(int argc, char *argv[])
    {
        while (true) {
            int16_t buf[8192];
            ssize_t in = read(STDIN_FILENO, buf, sizeof(buf));
            if (in == -1) {
                /* Error */
                return 1;
            }
            if (in == 0) {
                /* EOF */
                break;
            }

            for (unsigned int j = 0; j < in / sizeof(*buf); ++j) {
                /* TODO: Apply filter to each sample here */
            }

            write(STDOUT_FILENO, buf, in);
        }

        return 0;
    }

On Linux you can use e.g. 'arecord' to get samples from the microphone and 'aplay' to play samples on the speakers, and you can do the whole thing on the command line:

    $ arecord -t raw -c 1 -f s16 -r 44100 |\
        ./reverb | aplay -t raw -c 1 -f s16 -r 44100

(-c means 1 channel; -f s16 means "signed 16-bit" which corresponds to the int16_t type we've used for our buffers; -r 44100 means a sample rate of 44100 samples per second; and ./reverb is the name of our executable.)

So how do we use class FeedbackBuffer to generate the reverb effect?

Remember how I said that reverb is essentially many echoes? Let's add a few of them at the top of main():

    FeedbackBuffer fb0(1229);
    FeedbackBuffer fb1(1559);
    FeedbackBuffer fb2(1907);
    FeedbackBuffer fb3(4057);
    FeedbackBuffer fb4(8117);
    FeedbackBuffer fb5(8311);
    FeedbackBuffer fb6(9931);

The buffer sizes that I've chosen here are somewhat arbitrary (I played with a bunch of different combinations and this sounded okay to me). But I used this as a rough guideline: simulating the 20m-by-20m room at a sample rate of 44100 samples per second means we would need delays roughly on the order of 44100 / (20 / 340) = 2594 samples.

Another thing to keep in mind is that we generally do not want our feedback buffers to be multiples of each other. The reason for this is that it creates a consonance between them and will cause certain frequencies to be amplified much more than others. As an example, if you count from 1 to 500 (and continue again from 1), and you have a friend who counts from 1 to 1000 (and continues again from 1), then you would start out 1-1, 2-2, 3-3, etc. up to 500-500, then you would go 1-501, 2-502, 3-504, etc. up to 500-1000. But then, as you both wrap around, you start at 1-1 again. And your friend will always be on 1 when you are on 1. This has everything to do with periodicity and -- in fact -- prime numbers! If you want to maximise the combined period of two counters, you have to make sure that they are relatively coprime, i.e. that they don't share any common factors. The easiest way to achieve this is to only pick prime numbers to start with, so that's what I did for my feedback buffers above.

Having created the feedback buffers (which each represent one echo of the original sound), it's time to put them to use. The effect I want to create is not simply overlaying echoes at fixed intervals, but to have the echos bounce off each other and feed back into each other. The way we do this is by first combining them into the output signal... (since we have 8 signals to combine including the original one, I give each one a 1/8 weight)

    float x = .125 * buf[j];
    x += .125 * fb0.get();
    x += .125 * fb1.get();
    x += .125 * fb2.get();
    x += .125 * fb3.get();
    x += .125 * fb4.get();
    x += .125 * fb5.get();
    x += .125 * fb6.get();
    int16_t out = x;

...then feeding the result back into each of them:

    fb0.add(out);
    fb1.add(out);
    fb2.add(out);
    fb3.add(out);
    fb4.add(out);
    fb5.add(out);
    fb6.add(out);

And finally we also write the result back into the buffer. I found that the original signal loses some of its power, so I use a factor 4 gain to bring it roughly back to its original strength; this number is an arbitrary choice by me, I don't have any specific calculations to support it:

    buf[j] = 4 * out;

That's it! 88 lines of code is enough to write a very basic reverb filter from first principles. Be careful when you run it, though, even the smallest mistake could cause very loud and unpleasant sounds to be played.

If you play with different buffer sizes or a different number of feedback buffers, let me know if you discover anything interesting :-)

March 26, 2017 10:07 AM

Vegard Nossum: Fuzzing the OpenSSH daemon using AFL

(EDIT 2017-03-25: All my patches to make OpenSSH more amenable to fuzzing with AFL are available at https://github.com/vegard/openssh-portable. This also includes improvements to the patches found in this post.)

American Fuzzy Lop is a great tool. It does take a little bit of extra setup and tweaking if you want to go into advanced usage, but mostly it just works out of the box.

In this post, I’ll detail some of the steps you need to get started with fuzzing the OpenSSH daemon (sshd) and show you some tricks that will help get results more quickly.

The AFL home page already displays 4 OpenSSH bugs in its trophy case; these were found by Hanno Böck who used an approach similar to that outlined by Jonathan Foote on how to fuzz servers with AFL.

I take a slightly different approach, which I think is simpler: instead of intercepting system calls to fake network activity, we just run the daemon in “inetd mode”. The inet daemon is not used very much anymore on modern Linux distributions, but the short story is that it sets up the listening network socket for you and launches a new process to handle each new incoming connection. inetd then passes the network socket to the target program as stdin/stdout. Thus, when sshd is started in inet mode, it communicates with a single client over stdin/stdout, which is exactly what we need for AFL.

Configuring and building AFL

If you are just starting out with AFL, you can probably just type make in the top-level AFL directory to compile everything, and it will just work. However, I want to use some more advanced features, in particular I would like to compile sshd using LLVM-based instrumentation (which is slightly faster than the “assembly transformation by sed” that AFL uses by default). Using LLVM also allows us to move the target program’s “fork point” from just before entering main() to an arbitrary location (known as “deferred forkserver mode” in AFL-speak); this means that we can skip some of the setup operations in OpenSSH, most notably reading/parsing configs and loading private keys.

Most of the steps for using LLVM mode are detailed in AFL’s llvm_mode/README.llvm. On Ubuntu, you should install the clang and llvm packages, then run make -C llvm_mode from the top-level AFL directory, and that’s pretty much it. You should get a binary called afl-clang-fast, which is what we’re going to use to compile sshd.

Configuring and building OpenSSH

I’m on Linux so I use the “portable” flavour of OpenSSH, which conveniently also uses git (as opposed to the OpenBSD version which still uses CVS – WTF!?). Go ahead and clone it from git://anongit.mindrot.org/openssh.git.

Run autoreconf to generate the configure script. This is how I run the config script:

./configure \
CC="$PWD/afl-2.39b/afl-clang-fast" \
CFLAGS="-g -O3" \
--prefix=$PWD/install \
--with-privsep-path=$PWD/var-empty \
--with-sandbox=no \
--with-privsep-user=vegard

You obviously need to pass the right path to afl-clang-fast. I’ve also created two directories in the current (top-level OpenSSH directory), install and var-empty. This is so that we can run make install without being root (although var-empty needs to have mode 700 and be owned by root) and without risking clobbering any system files (which would be extremely bad, as we’re later going to disable authentication and encryption!). We really do need to run make install, even though we’re not going to be running sshd from the installation directory. This is because sshd needs some private keys to run, and that is where it will look for them.

(EDIT 2017-03-25: Passing --without-pie to configure may help make the resulting binaries easier to debug since instruction pointers will not be randomised.)

If everything goes well, running make should display the AFL banner as OpenSSH is compiled.

You may need some extra libraries (zlib1g-dev and libssl-dev on Ubuntu) for the build to succeeed.

Run make install to install sshd into the install/ subdirectory (and again, please don’t run this as root).

We will have to rebuild OpenSSH a few times as we apply some patches to it, but this gives you the basic ingredients for a build. One particular annoying thing I’ve noticed is that OpenSSH doesn’t always detect source changes when you run make (and so your changes may not actually make it into the binary). For this reason I just adopted the habit of always running make clean before recompiling anything. Just a heads up!

Running sshd

Before we can actually run sshd under AFL, we need to figure out exactly how to invoke it with all the right flags and options. This is what I use:

./sshd -d -e -p 2200 -r -f sshd_config -i

This is what it means:

-d
“Debug mode”. Keeps the daemon from forking, makes it accept only a single connection, and keeps it from putting itself in the background. All useful things that we need.
-e
This makes it log to stderr instead of syslog; this first of all prevents clobbering your system log with debug messages from our fuzzing instance, and also gives a small speed boost.
-p 2200
The TCP port to listen to. This is not really used in inetd mode (-i), but is useful later on when we want to generate our first input testcase.
-r
This option is not documented (or not in my man page, at least), but you can find it in the source code, which should hopefully also explain what it does: preventing sshd from re-execing itself. I think this is a security feature, since it allows the process to isolate itself from the original environment. In our case, it complicates and slows things down unnecessarily, so we disable it by passing -r.
-f sshd_config
Use the sshd_config from the current directory. This just allows us to customise the config later without having to reinstall it or be unsure about which location it’s really loaded from.
-i
“Inetd mode”. As already mentioned, inetd mode will make the server process a single connection on stdin/stdout, which is a perfect fit for AFL (as it will write testcases on the program’s stdin by default).

Go ahead and run it. It should hopefully print something like this:

$ ./sshd -d -e -p 2200 -r -f sshd_config -i
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2g 1 Mar 2016
debug1: private host key #0: ssh-rsa SHA256:f9xyp3dC+9jCajEBOdhjVRAhxp4RU0amQoj0QJAI9J0
debug1: private host key #1: ssh-dss SHA256:sGRlJclqfI2z63JzwjNlHtCmT4D1WkfPmW3Zdof7SGw
debug1: private host key #2: ecdsa-sha2-nistp256 SHA256:02NDjij34MUhDnifUDVESUdJ14jbzkusoerBq1ghS0s
debug1: private host key #3: ssh-ed25519 SHA256:RsHu96ANXZ+Rk3KL8VUu1DBzxwfZAPF9AxhVANkekNE
debug1: setgroups() failed: Operation not permitted
debug1: inetd sockets after dupping: 3, 4
Connection from UNKNOWN port 65535 on UNKNOWN port 65535
SSH-2.0-OpenSSH_7.4

If you type some garbage and press enter, it will probably give you “Protocol mismatch.” and exit. This is good!

Detecting crashes/disabling privilege separation mode

One of the first obstacles I ran into was the fact that I saw sshd crashing in my system logs, but AFL wasn’t detecting them as crashes:

[726976.333225] sshd[29691]: segfault at 0 ip 000055d3f3139890 sp 00007fff21faa268 error 4 in sshd[55d3f30ca000+bf000]
[726984.822798] sshd[29702]: segfault at 4 ip 00007f503b4f3435 sp 00007fff84c05248 error 4 in libc-2.23.so[7f503b3a6000+1bf000]

The problem is that OpenSSH comes with a “privilege separation mode” that forks a child process and runs most of the code inside the child. If the child segfaults, the parent still exits normally, so it masks the segfault from AFL (which only observes the parent process directly).

In version 7.4 and earlier, privilege separation mode can easily be disabled by adding “UsePrivilegeSeparation no” to sshd_config or passing -o UsePrivilegeSeaparation=no on the command line.

Unfortunately it looks like the OpenSSH developers are removing the ability to disable privilege separation mode in version 7.5 and onwards. This is not a big deal, as OpenSSH maintainer Damien Miller writes on Twitter: “the infrastructure will be there for a while and it’s a 1-line change to turn privsep off”. So you may have to dive into sshd.c to disable it in the future.

(EDIT 2017-03-25: I’ve pushed the source tweak for disabling privilege separation for 7.5 and newer to my OpenSSH GitHub repo. This also obsoletes the need for a config change.)

Reducing randomness

OpenSSH uses random nonces during the handshake to prevent “replay attacks” where you would record somebody’s (encrypted) SSH session and then you feed the same data to the server again to authenticate again. When random numbers are used, the server and the client will calculate a new set of keys and thus thwart the replay attack.

In our case, we explicitly want to be able to replay traffic and obtain the same result two times in a row; otherwise, the fuzzer would not be able to gain any useful data from a single connection attempt (as the testcase it found would not be usable for further fuzzing).

There’s also the possibility that randomness introduces variabilities in other code paths not related to the handshake, but I don’t really know. In any case, we can easily disable the random number generator. On my system, with the configure line above, all or most random numbers seem to come from arc4random_buf() in openbsd-compat/arc4random.c, so to make random numbers very predictable, we can apply this patch:

diff --git openbsd-compat/arc4random.c openbsd-compat/arc4random.c
--- openbsd-compat/arc4random.c
+++ openbsd-compat/arc4random.c
@@ -242,7 +242,7 @@ void
arc4random_buf(void *buf, size_t n)
{
_ARC4_LOCK();
- _rs_random_buf(buf, n);
+ memset(buf, 0, n);
_ARC4_UNLOCK();
}
# endif /* !HAVE_ARC4RANDOM_BUF */

One way to test whether this patch is effective is to try to packet-capture an SSH session and see if it can be replayed successfully. We’re going to have to do that later anyway in order to create our first input testcase, so skip below if you want to see how that’s done. In any case, AFL would also tell us using its “stability” indicator if something was really off with regards to random numbers (>95% stability is generally good, <90% would indicate that something is off and needs to be fixed).

Increasing coverage

Disabling message CRCs

When fuzzing, we really want to disable as many checksums as we can, as Damien Miller also wrote on twitter: “fuzzing usually wants other code changes too, like ignoring MAC/signature failures to make more stuff reachable”. This may sound a little strange at first, but makes perfect sense: In a real attack scenario, we can always1 fix up CRCs and other checksums to match what the program expects.

If we don’t disable checksums (and we don’t try to fix them up), then the fuzzer will make very little progress. A single bit flip in a checksum-protected area will just fail the checksum test and never allow the fuzzer to proceed.

We could of course also fix the checksum up before passing the data to the SSH server, but this is slow and complicated. It’s better to disable the checksum test in the server and then try to fix it up if we do happen to find a testcase which can crash the modified server.

The first thing we can disable is the packet CRC test:

diff --git a/packet.c b/packet.c
--- a/packet.c
+++ b/packet.c
@@ -1635,7 +1635,7 @@ ssh_packet_read_poll1(struct ssh *ssh, u_char *typep)

cp = sshbuf_ptr(state->incoming_packet) + len - 4;
stored_checksum = PEEK_U32(cp);
- if (checksum != stored_checksum) {
+ if (0 && checksum != stored_checksum) {
error("Corrupted check bytes on input");
if ((r = sshpkt_disconnect(ssh, "connection corrupted")) != 0 ||
(r = ssh_packet_write_wait(ssh)) != 0)

As far as I understand, this is a simple (non-cryptographic) integrity check meant just as a sanity check against bit flips or incorrectly encoded data.

Disabling MACs

We can also disable Message Authentication Codes (MACs), which are the cryptographic equivalent of checksums, but which also guarantees that the message came from the expected sender:

diff --git mac.c mac.c
index 5ba7fae1..ced66fe6 100644
--- mac.c
+++ mac.c
@@ -229,8 +229,10 @@ mac_check(struct sshmac *mac, u_int32_t seqno,
if ((r = mac_compute(mac, seqno, data, dlen,
ourmac, sizeof(ourmac))) != 0)
return r;
+#if 0
if (timingsafe_bcmp(ourmac, theirmac, mac->mac_len) != 0)
return SSH_ERR_MAC_INVALID;
+#endif
return 0;
}

We do have to be very careful when making these changes. We want to try to preserve the original behaviour of the program as much as we can, in the sense that we have to be very careful not to introduce bugs of our own. For example, we have to be very sure that we don’t accidentally skip the test which checks that the packet is large enough to contain a checksum in the first place. If we had accidentally skipped that, it is possible that the program being fuzzed would try to access memory beyond the end of the buffer, which would be a bug which is not present in the original program.

This is also a good reason to never submit crashing testcases to the developers of a program unless you can show that they also crash a completely unmodified program.

Disabling encryption

The last thing we can do, unless you wish to only fuzz the unencrypted initial protocol handshake and key exchange, is to disable encryption altogether.

The reason for doing this is exactly the same as the reason for disabling checksums and MACs, namely that the fuzzer would have no hope of being able to fuzz the protocol itself if it had to work with the encrypted data (since touching the encrypted data with overwhelming probability will just cause it to decrypt to random and utter garbage).

Making the change is surprisingly simple, as OpenSSH already comes with a psuedo-cipher that just passes data through without actually encrypting/decrypting it. All we have to do is to make it available as a cipher that can be negotiated between the client and the server. We can use this patch:

diff --git a/cipher.c b/cipher.c
index 2def333..64cdadf 100644
--- a/cipher.c
+++ b/cipher.c
@@ -95,7 +95,7 @@ static const struct sshcipher ciphers[] = {
# endif /* OPENSSL_NO_BF */
#endif /* WITH_SSH1 */
#ifdef WITH_OPENSSL
- { "none", SSH_CIPHER_NONE, 8, 0, 0, 0, 0, 0, EVP_enc_null },
+ { "none", SSH_CIPHER_SSH2, 8, 0, 0, 0, 0, 0, EVP_enc_null },
{ "3des-cbc", SSH_CIPHER_SSH2, 8, 24, 0, 0, 0, 1, EVP_des_ede3_cbc },
# ifndef OPENSSL_NO_BF
{ "blowfish-cbc",

To use this cipher by default, just put “Ciphers none” in your sshd_config. Of course, the client doesn’t support it out of the box either, so if you make any test connections, you have to have to use the ssh binary compiled with the patched cipher.c above as well.

You may have to pass pass -o Ciphers=none from the client as well if it prefers to use a different cipher by default. Use strace or wireshark to verify that communication beyond the initial protocol setup happens in plaintext.

Making it fast

afl-clang-fast/LLVM “deferred forkserver mode”

I mentioned above that using afl-clang-fast (i.e. AFL’s LLVM deferred forkserver mode) allows us to move the “fork point” to skip some of the sshd initialisation steps which are the same for every single testcase we can throw at it.

To make a long story short, we need to put a call to __AFL_INIT() at the right spot in the program, separating the stuff that doesn’t depend on a specific input to happen before it and the testcase-specific handling to happen after it. I’ve used this patch:

diff --git a/sshd.c b/sshd.c
--- a/sshd.c
+++ b/sshd.c
@@ -1840,6 +1840,8 @@ main(int ac, char **av)
/* ignore SIGPIPE */
signal(SIGPIPE, SIG_IGN);

+ __AFL_INIT();
+
/* Get a connection, either from inetd or a listening TCP socket */
if (inetd_flag) {
server_accept_inetd(&sock_in, &sock_out);

AFL should be able to automatically detect that you no longer wish to start the program from the top of main() every time. However, with only the patch above, I got this scary-looking error message:

Hmm, looks like the target binary terminated before we could complete a
handshake with the injected code. Perhaps there is a horrible bug in the
fuzzer. Poke <lcamtuf@coredump.cx> for troubleshooting tips.

So there is obviously some AFL magic code here to make the fuzzer and the fuzzed program communicate. After poking around in afl-fuzz.c, I found FORKSRV_FD, which is a file descriptor pointing to a pipe used for this purpose. The value is 198 (and the other end of the pipe is 199).

To try to figure out what was going wrong, I ran afl-fuzz under strace, and it showed that file descriptors 198 and 199 were getting closed by sshd. With some more digging, I found the call to closefrom(), which is a function that closes all inherited (and presumed unused) file descriptors starting at a given number. Again, the reason for this code to exist in the first place is probably in order to reduce the attack surface in case an attacker is able to gain control the process. Anyway, the solution is to protect these special file descriptors using a patch like this:

diff --git openbsd-compat/bsd-closefrom.c openbsd-compat/bsd-closefrom.c
--- openbsd-compat/bsd-closefrom.c
+++ openbsd-compat/bsd-closefrom.c
@@ -81,7 +81,7 @@ closefrom(int lowfd)
while ((dent = readdir(dirp)) != NULL) {
fd = strtol(dent->d_name, &endp, 10);
if (dent->d_name != endp && *endp == '\0' &&
- fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp))
+ fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp) && fd != 198 && fd != 199)
(void) close((int) fd);
}
(void) closedir(dirp);

Skipping expensive DH/curve and key derivation operations

At this point, I still wasn’t happy with the execution speed: Some testcases were as low as 10 execs/second, which is really slow.

I tried compiling sshd with -pg (for gprof) to try to figure out where the time was going, but there are many obstacles to getting this to work properly: First of all, sshd exits using _exit(255) through its cleanup_exit() function. This is not a “normal” exit and so the gmon.out file (containing the profile data) is not written out at all. Applying a source patch to fix that, sshd will give you a “Permission denied” error as it tries to open the file for writing. The problem now is that sshd does a chdir("/"), meaning that it’s trying to write the profile data in a directory where it doesn’t have access. The solution is again simple, just add another chdir() to a writable location before calling exit(). Even with this in place, the profile came out completely empty for me. Maybe it’s another one of those privilege separation things. In any case, I decided to just use valgrind and its “cachegrind” tool to obtain the profile. It’s much easier and gives me the data I need without hassles of reconfiguring, patching, and recompiling.

The profile showed one very specific hot spot, coming from two different locations: elliptic curve point multiplication.

I don’t really know too much about elliptic curve cryptography, but apparently it’s pretty expensive to calculate. However, we don’t really need to deal with it; we can assume that the key exchange between the server and the client succeeds. Similar to how we increased coverage above by skipping message CRC checks and replacing the encryption with a dummy cipher, we can simply skip the expensive operations and assume they always succeed. This is a trade-off; we are no longer fuzzing all the verification steps, but allows the fuzzer to concentrate more on the protocol parsing itself. I applied this patch:

diff --git kexc25519.c kexc25519.c
--- kexc25519.c
+++ kexc25519.c
@@ -68,10 +68,13 @@ kexc25519_shared_key(const u_char key[CURVE25519_SIZE],

/* Check for all-zero public key */
explicit_bzero(shared_key, CURVE25519_SIZE);
+#if 0
if (timingsafe_bcmp(pub, shared_key, CURVE25519_SIZE) == 0)
return SSH_ERR_KEY_INVALID_EC_VALUE;

crypto_scalarmult_curve25519(shared_key, key, pub);
+#endif
+
#ifdef DEBUG_KEXECDH
dump_digest("shared secret", shared_key, CURVE25519_SIZE);
#endif
diff --git kexc25519s.c kexc25519s.c
--- kexc25519s.c
+++ kexc25519s.c
@@ -67,7 +67,12 @@ input_kex_c25519_init(int type, u_int32_t seq, void *ctxt)
int r;

/* generate private key */
+#if 0
kexc25519_keygen(server_key, server_pubkey);
+#else
+ explicit_bzero(server_key, sizeof(server_key));
+ explicit_bzero(server_pubkey, sizeof(server_pubkey));
+#endif
#ifdef DEBUG_KEXECDH
dump_digest("server private key:", server_key, sizeof(server_key));
#endif

With this patch in place, execs/second went to ~2,000 per core, which is a much better speed to be fuzzing at.

(EDIT 2017-03-25: As it turns out, this patch is not very good, because it causes a later key validity check to fail (dh_pub_is_valid() in input_kex_dh_init()). We could perhaps make dh_pub_is_valid() always return true, but then there is a question of whether this in turn makes something else fail down the line.)

Creating the first input testcases

Before we can start fuzzing for real, we have to create the first few input testcases. Actually, a single one is enough to get started, but if you know how to create different ones taking different code paths in the server, that may help jumpstart the fuzzing process. A few possibilities I can think of:

The way I created the first testcase was to record the traffic from the client to the server using strace. Start the server without -i:

./sshd -d -e -p 2200 -r -f sshd_config
[...]
Server listening on :: port 2200.

Then start a client (using the ssh binary you’ve just compiled) under strace:

$ strace -e trace=write -o strace.log -f -s 8192 ./ssh -c none -p 2200 localhost

This should hopefully log you in (if not, you may have to fiddle with users, keys, and passwords until you succeed in logging in to the server you just started).

The first few lines of the strace log should read something like this:

2945  write(3, "SSH-2.0-OpenSSH_7.4\r\n", 21) = 21
2945 write(3, "\0\0\4|\5\24\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0010curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c\0\0\1\"ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa\0\0\0\4none\0\0\0\4none\0\0\0\325umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\325umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\32none,zlib@openssh.com,zlib\0\0\0\32none,zlib@openssh.com,zlib\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1152) = 1152

We see here that the client is communicating over file descriptor 3. You will have to delete all the writes happening on other file descriptors. Then take the strings and paste them into a Python script, something like:

import sys
for x in [
"SSH-2.0-OpenSSH_7.4\r\n"
"\0\0\4..."
...
]:
sys.stdout.write(x)

When you run this, it will print a byte-perfect copy of everything that the client sent to stdout. Just redirect this to a file. That file will be your first input testcase.

You can do a test run (without AFL) by passing the same data to the server again (this time using -i):

./sshd -d -e -p 2200 -r -f sshd_config -i < testcase 2>&1 > /dev/null

Hopefully it will show that your testcase replay was able to log in successfully.

Before starting the fuzzer you can also double check that the instrumentation works as expected using afl-analyze:

~/afl-2.39b/afl-analyze -i testcase -- ./sshd -d -e -p 2200 -r -f sshd_config -i

This may take a few seconds to run, but should eventually show you a map of the file and what it thinks each byte means. If there is too much red, that’s an indication you were not able to disable checksumming/encryption properly (maybe you have to make clean and rebuild?). You may also see other errors, including that AFL didn’t detect any instrumentation (did you compile sshd with afl-clang-fast?). This is general AFL troubleshooting territory, so I’d recommend checking out the AFL documentation.

Creating an OpenSSH dictionary

I created an AFL “dictionary” for OpenSSH, which is basically just a list of strings with special meaning to the program being fuzzed. I just used a few of the strings found by running ssh -Q cipher, etc. to allow the fuzzer to use these strings without having to discover them all at once (which is pretty unlikely to happen by chance).

s0="3des-cbc"
s1="aes128-cbc"
s2="aes128-ctr"
s3="aes128-gcm@openssh.com"
s4="aes192-cbc"
s5="aes192-ctr"
s6="aes256-cbc"
s7="aes256-ctr"
s8="aes256-gcm@openssh.com"
s9="arcfour"
s10="arcfour128"
s11="arcfour256"
s12="blowfish-cbc"
s13="cast128-cbc"
s14="chacha20-poly1305@openssh.com"
s15="curve25519-sha256@libssh.org"
s16="diffie-hellman-group14-sha1"
s17="diffie-hellman-group1-sha1"
s18="diffie-hellman-group-exchange-sha1"
s19="diffie-hellman-group-exchange-sha256"
s20="ecdh-sha2-nistp256"
s21="ecdh-sha2-nistp384"
s22="ecdh-sha2-nistp521"
s23="ecdsa-sha2-nistp256"
s24="ecdsa-sha2-nistp256-cert-v01@openssh.com"
s25="ecdsa-sha2-nistp384"
s26="ecdsa-sha2-nistp384-cert-v01@openssh.com"
s27="ecdsa-sha2-nistp521"
s28="ecdsa-sha2-nistp521-cert-v01@openssh.com"
s29="hmac-md5"
s30="hmac-md5-96"
s31="hmac-md5-96-etm@openssh.com"
s32="hmac-md5-etm@openssh.com"
s33="hmac-ripemd160"
s34="hmac-ripemd160-etm@openssh.com"
s35="hmac-ripemd160@openssh.com"
s36="hmac-sha1"
s37="hmac-sha1-96"
s38="hmac-sha1-96-etm@openssh.com"
s39="hmac-sha1-etm@openssh.com"
s40="hmac-sha2-256"
s41="hmac-sha2-256-etm@openssh.com"
s42="hmac-sha2-512"
s43="hmac-sha2-512-etm@openssh.com"
s44="rijndael-cbc@lysator.liu.se"
s45="ssh-dss"
s46="ssh-dss-cert-v01@openssh.com"
s47="ssh-ed25519"
s48="ssh-ed25519-cert-v01@openssh.com"
s49="ssh-rsa"
s50="ssh-rsa-cert-v01@openssh.com"
s51="umac-128-etm@openssh.com"
s52="umac-128@openssh.com"
s53="umac-64-etm@openssh.com"
s54="umac-64@openssh.com"

Just save it as openssh.dict; to use it, we will pass the filename to the -x option of afl-fuzz.

Running AFL

Whew, it’s finally time to start the fuzzing!

First, create two directories, input and output. Place your initial testcase in the input directory. Then, for the output directory, we’re going to use a little hack that I’ve found to speed up the fuzzing process and keep AFL from hitting the disk all the time: mount a tmpfs RAM-disk on output with:

sudo mount -t tmpfs none output/

Of course, if you shut down (or crash) your machine without copying the data out of this directory, it will be gone, so you should make a backup of it every once in a while. I personally just use a bash one-liner that just tars it up to the real on-disk filesystem every few hours.

To start a single fuzzer, you can use something like:

~/afl-2.39b/afl-fuzz -x sshd.dict -i input -o output -M 0 -- ./sshd -d -e -p 2100 -r -f sshd_config -i

Again, see the AFL docs on how to do parallel fuzzing. I have a simple bash script that just launches a bunch of the line above (with different values to the -M or -S option) in different screen windows.

Hopefully you should see something like this:

                         american fuzzy lop 2.39b (31)

┌─ process timing ─────────────────────────────────────┬─ overall results ─────┐
│ run time : 0 days, 13 hrs, 22 min, 40 sec │ cycles done : 152 │
│ last new path : 0 days, 0 hrs, 14 min, 57 sec │ total paths : 1577 │
│ last uniq crash : none seen yet │ uniq crashes : 0 │
│ last uniq hang : none seen yet │ uniq hangs : 0 │
├─ cycle progress ────────────────────┬─ map coverage ─┴───────────────────────┤
│ now processing : 717* (45.47%) │ map density : 3.98% / 6.67% │
│ paths timed out : 0 (0.00%) │ count coverage : 3.80 bits/tuple │
├─ stage progress ────────────────────┼─ findings in depth ────────────────────┤
│ now trying : splice 4 │ favored paths : 117 (7.42%) │
│ stage execs : 74/128 (57.81%) │ new edges on : 178 (11.29%) │
│ total execs : 74.3M │ total crashes : 0 (0 unique) │
│ exec speed : 1888/sec │ total hangs : 0 (0 unique) │
├─ fuzzing strategy yields ───────────┴───────────────┬─ path geometry ────────┤
│ bit flips : n/a, n/a, n/a │ levels : 7 │
│ byte flips : n/a, n/a, n/a │ pending : 2 │
│ arithmetics : n/a, n/a, n/a │ pend fav : 0 │
│ known ints : n/a, n/a, n/a │ own finds : 59 │
│ dictionary : n/a, n/a, n/a │ imported : 245 │
│ havoc : 39/25.3M, 20/47.2M │ stability : 97.55% │
│ trim : 2.81%/1.84M, n/a ├────────────────────────┘
└─────────────────────────────────────────────────────┘ [cpu015: 62%]

Crashes found

In about a day of fuzzing (even before disabling encryption), I found a couple of NULL pointer dereferences during key exchange. Fortunately, these crashes are not harmful in practice because of OpenSSH’s privilege separation code, so at most we’re crashing an unprivileged child process and leaving a scary segfault message in the system log. The fix made it in CVS here: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/kex.c?rev=1.131&content-type=text/x-cvsweb-markup.

Conclusion

Apart from the two harmless NULL pointer dereferences I found, I haven’t been able to find anything else yet, which seems to indicate that OpenSSH is fairly robust (which is good!).

I hope some of the techniques and patches I used here will help more people get into fuzzing OpenSSH.

Other things to do from here include doing some fuzzing rounds using ASAN or running the corpus through valgrind, although it’s probably easier to do this once you already have a good sized corpus found without them, as both ASAN and valgrind have a performance penalty.

It could also be useful to look into ./configure options to configure the build more like a typical distro build; I haven’t done anything here except to get it to build in a minimal environment.

Please let me know in the comments if you have other ideas on how to expand coverage or make fuzzing OpenSSH faster!

Thanks

I’d like to thank Oracle (my employer) for providing the hardware on which to run lots of AFL instances in parallel :-)


  1. Well, we can’t fix up signatures we don’t have the private key for, so in those cases we’ll just assume the attacker does have the private key. You can still do damage e.g. in an otherwise locked down environment; as an example, GitHub uses the SSH protocol to allow pushing to your repositories. These SSH accounts are heavily locked down, as you can’t run arbitrary commands on them. In this case, however, we do have have the secret key used to authenticate and sign messages.

March 26, 2017 10:07 AM