Kernel Planet

March 21, 2017

Matthew Garrett: Announcing the Shim review process

Shim has been hugely successful, to the point of being used by the majority of significant Linux distributions and many other third party products (even, apparently, Solaris). The aim was to ensure that it would remain possible to install free operating systems on UEFI Secure Boot platforms while still allowing machine owners to replace their bootloaders and kernels, and it's achieved this goal.

However, a legitimate criticism has been that there's very little transparency in Microsoft's signing process. Some people have waited for significant periods of time before being receiving a response. A large part of this is simply that demand has been greater than expected, and Microsoft aren't in the best position to review code that they didn't write in the first place.

To that end, we're adopting a new model. A mailing list has been created at shim-review@lists.freedesktop.org, and members of this list will review submissions and provide a recommendation to Microsoft on whether these should be signed or not. The current set of expectations around binaries to be signed documented here and the current process here - it is expected that this will evolve slightly as we get used to the process, and we'll provide a more formal set of documentation once things have settled down.

This is a new initiative and one that will probably take a little while to get working smoothly, but we hope it'll make it much easier to get signed releases of Shim out without compromising security in the process.

comment count unavailable comments

March 21, 2017 08:29 PM

Kernel Podcast: Linux Kernel Podcast for 2017/03/21

Audiohttp://traffic.libsyn.com/jcm/20170321.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc3, this week’s exciting installment of “5-level paging weekly”, the 2038 doomsday compliance “statx” systemcall, and heterogenous memory management. Also a summary of all ongoing active kernel development toward 4.12 onwards.

Linus Torvalds announced Linux 4.11-rc3. In his announcement, Linus noted that “rc3 is larger than rc2, but this is hopefully the point where things start to shrink and calm down. We had a late typo in rc2 that affected arm and powerpc (the prep code for the 5-level page tables [on x86 systems]), and hopefully there are no similar brown-paper-bugs in rc3.”

Announcements

Kent Overstreet announced the latest developments in Bcachefs, in a post entitled “Bcachefs – encryption, fsck, and more”. One of the key new features is that “We now have whole filesystem encryption – and this is modern authenticated encryption”. He notes that they can’t currently encrypt only part of the filesystem (as is the case, for example, with ext4 – as used on Android devices, and of course with Apple’s multi-layered iOS filesystem implementation) but “it’s more of a better dm-crypt” in removing the layers between the filesystem and the underlying hardware. He also notes that there’s a “New inode format”, and many other changes. Further details at: https://bcache.evilpiepirate.org/Bcachefs/

Hongbo Wang (Intel) announced the 2016-Q4 release of XenGT and 2016-Q4 release of KVMGT. These are both “full GPU virtualization solution[s] with mediated pass-through”…of the hardware graphics resources into guest virtual machines. Further information is available from Intel’s github: https://github.com/01org/ (igvtg-xen for the Xen tree, and igvtg-kernel, and igvtg-qemu for the pieces needed for KVM support)

Julia Cartwright announced the Linux preempt-rt (Real Time) kernel version 4.1.39-rt47 stable kernel release.

Junio C Hamano announced Git v2.12.1. In his announcement, he noted that the tarballs “are NOT YET found at” the typical URL since “I am having trouble reaching there”. It’s unclear if this is due to recent changes in the architecture of kernel.org and its mirroring, or a local issue.

Intel 5-level paging

In this week’s episode of “merging Intel 5-level paging support” the fun but unexpected plot twist resulting in a “will it merge or not” cliffhanger comes from Linus. Kirill A. Shutemov (Intel) has been diligently posting this series for some time, and if you recall from last week’s episode, the foundational pieces needed to land this in 4.12 were merged after the closure of the 4.11 merge window following a special request from Linus. Kirill has since posted “x86: 5-level paging enabling for v4.12, Part 1”. In response to a comment from Kirill that “Let’s see if I’m on the right track addressing Ingo’s [Molnar’s] feedback”, Linus stated, “Considering the bug we just had with the HAVE_GENERIC_RCU_GUP code, I’m wondering if people would be willing to look at what it would take to make x86 use the generic version?”, and “The x86 version of __get_user_pages_fast() seems to be quite similar to the generic one. And it would be lovely if all the main architectures shared the same core gup code”.

The Linux kernel implements a set of code functions for pinning of usermode (userspace) pages (the smallest granule size upon which contemporary hardware operates via a Memory Management Unit under the control of software provided and (co-)maintained “page tables”, and the size tracked by the Operating System in its page table management code) whenever they must be shared between userspace (which has dynamically pageable memory that can come and go as the kernel needs to free up RAM temporarily for other tasks by “paging” those pages out to “swap”) and code running within a kernel driver (the Linux kernel does not have pageable memory). GUP (get_user_pages) handles this operation, which takes a set of pointers to the individual pages that should be present and marked as in use. It has a variant usually referred to as “fast GUP” which aims to perform this operation without taking an expensive lock in the corresponding userspace processes’ “mm” struct (an object that forms part of a task’s – the in-kernel term for a process – metadata, and linked from the corresponding task_struct). Fast GUP doesn’t always work, but when it doesn’t need to fallback to an expensive slow path, it can save considerable time. So Linus was expressing a desire for x86 to share the same generic code as used by other architectures for this operation.

Linus further added three “subtle issues” that he saw with switching over x86 to the generic GUP code:

“(a) we need to make sure that x86 actually matches the required semantics for the generic GUP.

(b) we need to make sure the atomicity of the page table reads is ok.

(c) need to verify the maximum VM address properly”

He said “I _think_ (a) is ok”. But he wanted to see “real work to make sure” that (b) is “ok on 32-bit PAE”. PAE means Physical Address Extension, a mechanism used on certain 32-bit Intel x86 systems to address greater than a 32-bit physical address space by leveraging the fact that many individual applications don’t need larger than a 32-bit address space but that an overall system might in aggregate use multiple such 32-bit applications. It was a hack that bought time before the widespread adoption of the 64-bit architecture, and one that others (such as ARM) have implemented in a similar sense of end purpose in “LPAE” and friends as well. PAE moved the x86 architecture from 32-bit PTE (Page Table Entries) to 64-bit hardware entries, which means that on 32-bit systems there are real concerns around atomicity of updates to these structures without very careful handling. And as this author can attest, you don’t want to have to debug that situation.

This discussion lead Kirill to point out that there were some obvious looking bugs in the existing x86 GUP code that needed fixing for PAE anyway. The thread is ongoing, and Kirill is certain to be enjoying this week’s episode of “so you thought you were only adding 5-level paging?”. Michal Hocko noted that he had pulled the current version of the 5-level paging patch series into the mmotm (mm of the moment) VM (Virtual Memory) subsystem development tree as co-maintained with Andrew Morton and others.

Borislav Petkov posted “x86/mce: Handle broadcasted MCE gracefully with kexec” which (as we covered previously) seeks to handle the unfortunate case of an MCE (Machine Check Exception) on Intel x86 systems arriving during the process of handoff from the crash kernel into “pergatory” prior to the new kernel beginning. At this phase, the old kernel’s MCE handler is running and will never complete a synchronization with other cores in the system that are waiting in a holding spinloop (probably MWAIT one would assume) for the new kernel to take over.

statx

Various subsystems gained support for the new “statx” system call, which is part of the ongoing “Year 2038” doomsday avoidance work to prevent a Y2K style disaster when 32-bit Unix time wraps in 2038 (this being an actual potential “disaster” in the making, unlike the much hyped Y2K nonsense). Many of us have aspiriations to be retired and living on boats by then, but this is neither assured, nor a prudent means to guarantee we won’t have to deal with this later (but presumably with at least some kind of lucrative consulting contract to bring us out of our early or late retirements).

The “statx” call adds 64-bit timestamps and replaces “stat”. It also does a lot more than just “make large” (David Howell’s words) the various fields in the previous stat structutures. The overall system call was covered much more generally by Linux Weekly News (which you should support as a purveyor of actual in-depth journalism on such topics) as recently as last week. Stafford Horne posted one example of the patches we refer to here, for the “asm-generic” reference includes used by emerging architectures, such as the OpenRISC architecture that he is maintaining. Another statx patch came from David Howells, for the ext4 filesytem, which lead to a longer discussion of how to implement various underlying flag changes required to ext4.

Eric Biggers noted that David used the ext4_get_inode_flags function “to sync the generic inode flags (inode->i_flags) to the ext4-specific inode flags (ei->i_flags)” bu that a problem can exist when doing this without holding an underlying lock due to “flag syncs…in both directions concurrently” which could “cause an update to be lost”. He walked an example of how this could occur, and then suggested that for ->getattr() it might be easier to skip the call to the offending function and “instead populating the generic attributes like STATX_ATTR_APPEND and STATX_ATTR_IMMUTABLE from the generic inode flags, rather than from the ext4-specific flags?”. Andreas Dilger suggested the other way around, pulling the flags directly from the ext4 flags rather than the generic ones. He also raised the eneral question of “when/where are the VFS inode flags changed that they need to be propagated into the ext4 disk inode?”.

Jan Kara replied that “you seem to be right. And actually I have checked and XFS does not bother to copy inode->i_flags to its on-disk flags so it seems generally we are not expected to reflect inode->i_flags in on-disk state”. Jan suggested to Andreas that it might be “better…to have ext4_quota_on() and ext4_quota_off() just update the flags as needed and avoid doing it anywhere else…I’ll have a look into it”.

Heterogeneous Memory Management

Jérôme Glisse posted version 18 of his patch series entitled “HMM (Heterogenous Memory Management)” which aims to serve two generic use cases: “First it allows to use device memory transparently inside any process without modifications to process program code. Second it allows to mirror process address space on a device”. His intro described these summaries as a “Cliff node” (a brand of examination-time study materials often used by students for preparation), which lead to an objection from Andrew Morton that “Cliff’s notes” “isn’t appropriate for a large feature such as this. Where’s the long-form description? One which permits readers to fully understand the requirements, design, alternative designs, the implementation, the interface(s), etc?”. He also asked for clarifcation of which was meant by “device memory” since “That’s very vague. What are the characteristics of this memory? Why is it a requirement that userspace code be unaltered? What are the security implications – does the process need particular permissions to access this memory? What is the proposed interface to set up this access?”

In a followup, Jérôme noted that he had previously given a longer form summary, which he attached, in the earlier revisions of the now version 18 patch series. In his summary, he makes clear his intent is to ease the overall management and programming of hybrid systems involving GPUs and other accelerators by introducing “a new kind of ZONE_DEVICE memory that does allow to allocate a struct page for each page of the device memory. Those page are special because the CPU can not map them. They however allow to migrate main memory to device memory using ex[]isting migration mechanism[s] and everything looks like it page was swap[ped] out to disk from CPU point of view. Using a struct page gives the easiest and cleanest integration with existing mm mechanisms”. He notes that he isn’t trying to solve other problems, and in fact one could summarize HMM using the buzzword du jour: “mediated”.

In an HMM world, devices and host-side application software can share what appears to them as a “unified” memory map. One in which pointer addresses from within an application can be deferenced by code running on a GPU, and vice versa, through cunning use of page tables and a new underlying system framework for the device drivers touching the hardware. It’s not magic, but it does help to treat device memory “like regular memory” and accommodates “Advance in high level language construct (in C++ but others too) gives opportunities to compiler to leverage GPU transparently without programmer knowledge. But for this to happen we need a share[d] address space”.

This means that, if a host application (processor side of the equation) performs an access to part of a process (known as a “task” within the kernel) address space that is currently under control of a device, then the associated page fault will trigger generic framework code to handle handoff of that page back to the host CPU side. On the flip side, the framework still requires device drivers to use a new framework to manage their access to memory since few devices have generic page fault mechanisms today that can be leveraged to make this more transparent, and a lot of other device specific gunk is needed. It’s not a perfect solution, but it does arguably advance the state of the art, and is useful. Jérôme also states that “I do not wish to compete for the patchset with the highest revision count and i would like a clear cut position on w[h]ether it can be merge[d] or not. If not i would like to know why because i am more than willing to address any issues people might have. I just don’t want to keep submitting it over and over until i end up in hell…So please consider applying for 4.12”.

This author’s own personal opinion is that, while HMM is certainly useful, many such shared device/host memory situations can be greatly simplified by introducing coherent shared virtual memory between device and host. That model allows for direct address space sharing without some of the heavy lifting required in this patch set. Yet, as is noted in the posting, few devices today have such features (and there is no reason to presume that all future devices suddenly will implement shared virtual memory, not that every device will want to expand the energy required to maintain coherent memory for communication). So the HMM patches provide a means of tracking who owns memory shared between device and “host”, and they exploit split device and “host” system page tables as well as associated faults to ensure pages are handed off as cleanly as can be achieved with technology available in the market today.

Ongoing Development

Michal Hocko posted a patch entitled “rework memory hotplug onlining”, which seeks to rework the semantics for memory hotplug since the current implementation is “awkward and hard/impossible to use from the udev to online memory as movable. The main problem is that only the last memblock or the adjacent to highest movable memblock can be onlined as movable”. He posted a number of examples showing how things fall down today, as well as a patch (“just for x86 now but I will address other arches once there is an agreement this is the right approach”) removing “all the zone specific operations from __add_pages (aka arch_add_memory) path. Instead we do page->zone association from move_pfn_range which is called from online_pages. This criterion for movable/normal zone association is really simple now. We just have to guarantee that zone Normal is always lower than zone Movable”. This lead to a lengthy discussion around the ideal longer term approach and is likely to be a topic at the LSF/MM conference this week (one assumes?). [ It’s happening down the street from me…I’ll smile and wave at you 😉 ]

Gustavo Padovan posted “V4L2 explicit synchronization support”, an RFC (Request For Comments) that “adds support for Explicit Synchronization of shared buffers in V4L2” (Video For Linux 2, the general purpose video framework API used on Linux machines for certain multimedia purposes). This new RFC leverages the “Sync File Framework” as a means to “communicate the fences between kernel and userspace”. In English, what this means is that it’s often necessary to communicate using shared buffers between userspace, kernel, and hardware. And some (most) hardware might not guarantee that these buffers are fully coherent (observed identically between multiple concurrently operating agents that are manipulating it). The use of “fences” (barriers) enables explicit communication of certain points in time during which the state of a buffer is consistent and ready for access to be handed off between different parts of the system. The RFC is quite interesting and has a lot more detail, including the observation that it is intended to be a PoC (Proof of Concept) to get the conversation moving more than the eventual end result of that conversation that might actually be merged.

Wei Wang (Intel) posted a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration. Balloons aren’t just helium filled goodies that all of us love to play with from a young age. Well, they are that, but, they’re also a concept applied to the memory management of virtual machines, which “inflate” the amount of memory available to them by requesting more from a hypervisor during their lifetime (that they might also return). In Linux, the same concept is applied to the migration of virtual machines, which can use the virtio-balloon abstraction over the virtio bus (a hypervisor communications channel) to transfer “guest unused pages to the host so that they can be skipped to migrate in live migration”. One of the patches in his version 3 series (patch number 3 of 4), entitled “mm: add in[t]erface to offer info about unused pages” had some detailed discussion with Michael S. Tsirkin commenting on better documentation and Andrew Morton suggesting that it might be better for the code to live in the virtio-balloon driver rather than being made too generic as its use case is very targeted.

Elena Reshetova continued her work toward conversion of Linux kernel subsystems to her newer “refcount” explicit reference counting API with a posting entitled “net subsystem refcount conversions”.

Suzuki K Poulose posted a bunch of patches implementing support for detection and reporting of new ARMv8.3 architecture features, including one patch that was entitled “arm64: v8.3: Support for Javascript conversion instruction” (which really means a new double precision float to integer conversion instruction that will likely be used by high performance JavaScript JITs…). He also posted “arm64: v8.3: Support for weaker release consistency”. The new revision of the architecture adds new instructions to “support Release Consistent processor consistent (RCpc) model, which is weaker than the RCsc [Release Consistent sequential consistency] model”. Listeners are encouraged to read the C++ memory model and other fascinating bedtime literature for much more detail on the available RC options.

Markus Mayer (Broadcom) posted “Basic divider clock”, an RFC which aims to provide a generic means of expressing clock dividers that can be leveraged in an embedded system’s “DeviceTree”, for which he also posted bindings (descriptions to be used in creating these textual description “trees”). Stephen Boyd pushed back that the community had so far avoided generic implementations but instead preferred to keep things at the level of having drivers that target certain hardware IP from certain vendors based upon the compatible matching strings.

Michael S. Tsirkin posted “kvm: better MWAIT emulation for guests”. We have previously explained this patchset and the dynamics of MWAIT implementations. His goal for this patch is to handle guests that assume the presence of the (x86) MWAIT feature, which isn’t present on all x86 CPUs. If you were running (for example) MacOS inside a VM on an 86 machine, it would generally assume the presence of MWAIT without checking for it, because it’s present in all x86-based Apple Macs. Emulating MWAIT is useful in such situations.

Romain Perier posted “Replace PCI pool by DMA pool API”. As he notes in his posting, “The current PCI pool API are simple macro functions direct expanded to the appropriate dma pool functions. The prototypes are almost the same and semantically, they are very similar. I propose to use the DMA pool API directly and get rid of the old API”.

Daeseok Youn posted “staging: atomisp: use k{v}zalloc instead of k{v}alloc and memset”. Alan Cox replied “…please don’t apply this. There are about five other layers of indirection for memory allocators that want removing first so that the driver just uses the correct kmalloc/kzalloc/kv* functions in the right places”. Now does seem like a good time not to add more layers.

Peter Zijlstra posted various “x86 optimizations” that aimed to “shrink the kernel and generate better code”.

March 21, 2017 04:22 PM

March 20, 2017

Matthew Garrett: Buying a Utah teapot

The Utah teapot was one of the early 3D reference objects. It's canonically a Melitta but hasn't been part of their range in a long time, so I'd been watching Ebay in the hope of one turning up. Until last week, when I discovered that a company called Friesland had apparently bought a chunk of Melitta's range some years ago and sell the original teapot[1]. I've just ordered one, and am utterly unreasonably excited about this.

[1] They have them in 0.35, 0.85 and 1.4 litre sizes. I believe (based on the measurements here) that the 1.4 litre one matches the Utah teapot.

comment count unavailable comments

March 20, 2017 08:45 PM

Dave Airlie: how close to conformant is radv?

I spent some time staring into the results of the VK-GL-CTS test suite on radv, which contains the Vulkan 1.0 conformance tests.

In order to be conformant you have to pass all the tests on the mustpass list for the Vulkan version you want to conform to, from the branch of the test suite for that version.

The latest CTS tests for 1.0 is the vulkan-cts-1.0.2 branch, and the mustpass list is in external/vulkancts/mustpass/1.0.2/vk-default.txt

Using some WIP radv patches in my github radv-wip-conform branch and the 1.0.2 test suite, today's results are on my Tonga GPU:

Test run totals:
Passed: 82551/150950 (54.7%)
Failed: 0/150950 (0.0%)
Not supported: 68397/150950 (45.3%)
Warnings: 2/150950 (0.0%)

That is pretty conformant (in fact it would pass as-is). However I need to clean up the patches in the branch and maybe figure out how to do some bits properly without hacks (particularly some semaphore wait tweaks), but that is most of the work done.

Thanks again to Bas and all other radv contributors.

March 20, 2017 07:26 AM

March 19, 2017

Pete Zaitcev: Standards for ARM computers in 2017

I wrote this 7 years ago, in 2009:

Until ARM comes up with a full computer instead of just a CPU, it's no contender in Linux server space.

So, how are things nowadays?

The final question had to do with cross-platform drivers. There is an interpreted executable format known as EFI Byte Code (EBC); drivers compiled to that format can run on multiple architectures. [...]

Graf asked whether drivers could, instead, be shipped as a multiple-architecture binary. Progress is being made in this direction, and EFI supports multiple binary formats.

Hopeless.

P.S. Jon reminded me about SBSA, and indeed it's a solid advancement. But using EFI drivers is an idea so monstrously dumb that I don't even know what to say.

March 19, 2017 04:34 PM

Pete Zaitcev: Standards for ARM computers and Linaro

A year ago I posted the following comment at LWN in the context of excitement about Yet Another Little Linux Server:

If you buy an Atom or Geode barebones, it's guaranteed to boot a normal Fedora or Ubuntu. If you buy ARM, you're a hostage of vendor support. If that fails (which is the default), you're all alone in the bizarre maze of incompatible bootloaders and out-of-tree patches which quickly become obsolete. Until ARM comes up with a full computer instead of just a CPU, it's no contender in Linux server space. Instead, the vicious cycle of "make a product, patch something to boot on it, ship it, forget about it immediately" will continue forever, with publications hyping up the next wonderful widget and the platform going nowhere.

It looks like someone else figured it out, ergo Linaro. Unfortunately, they do not seem to be eager to create a real platform, but rather slap a veneer of something OpenFirmware-like on top of exising systems. Also, they are buddying with Ubuntu. So, a half-hearted effort and a top-down deal. But it's a step in the right direction.

March 19, 2017 04:27 PM

March 17, 2017

Andi Kleen: Intel Processor Trace resources

Intel Processor Trace (PT) can be used on modern Intel CPUs to trace execution. This page contains references for learning about and using Intel PT.

Basic information:

Implementations

JTAG support

Other presentations

Usages

Research papers using PT (subset):

 

March 17, 2017 04:46 PM

Vegard Nossum: Fuzzing the OpenSSH daemon using AFL

American Fuzzy Lop is a great tool. It does take a little bit of extra setup and tweaking if you want to go into advanced usage, but mostly it just works out of the box.

In this post, I’ll detail some of the steps you need to get started with fuzzing the OpenSSH daemon (sshd) and show you some tricks that will help get results more quickly.

The AFL home page already displays 4 OpenSSH bugs in its trophy case; these were found by Hanno Böck who used an approach similar to that outlined by Jonathan Foote on how to fuzz servers with AFL.

I take a slightly different approach, which I think is simpler: instead of intercepting system calls to fake network activity, we just run the daemon in “inetd mode”. The inet daemon is not used very much anymore on modern Linux distributions, but the short story is that it sets up the listening network socket for you and launches a new process to handle each new incoming connection. inetd then passes the network socket to the target program as stdin/stdout. Thus, when sshd is started in inet mode, it communicates with a single client over stdin/stdout, which is exactly what we need for AFL.

Configuring and building AFL

If you are just starting out with AFL, you can probably just type make in the top-level AFL directory to compile everything, and it will just work. However, I want to use some more advanced features, in particular I would like to compile sshd using LLVM-based instrumentation (which is slightly faster than the “assembly transformation by sed” that AFL uses by default). Using LLVM also allows us to move the target program’s “fork point” from just before entering main() to an arbitrary location (known as “deferred forkserver mode” in AFL-speak); this means that we can skip some of the setup operations in OpenSSH, most notably reading/parsing configs and loading private keys.

Most of the steps for using LLVM mode are detailed in AFL’s llvm_mode/README.llvm. On Ubuntu, you should install the clang and llvm packages, then run make -C llvm_mode from the top-level AFL directory, and that’s pretty much it. You should get a binary called afl-clang-fast, which is what we’re going to use to compile sshd.

Configuring and building OpenSSH

I’m on Linux so I use the “portable” flavour of OpenSSH, which conveniently also uses git (as opposed to the OpenBSD version which still uses CVS – WTF!?). Go ahead and clone it from git://anongit.mindrot.org/openssh.git.

Run autoreconf to generate the configure script. This is how I run the config script:

./configure \
CC="$PWD/afl-2.39b/afl-clang-fast" \
CFLAGS="-g -O3" \
--prefix=$PWD/install \
--with-privsep-path=$PWD/var-empty \
--with-sandbox=no \
--with-privsep-user=vegard

You obviously need to pass the right path to afl-clang-fast. I’ve also created two directories in the current (top-level OpenSSH directory), install and var-empty. This is so that we can run make install without being root (although var-empty needs to have mode 700 and be owned by root) and without risking clobbering any system files (which would be extremely bad, as we’re later going to disable authentication and encryption!). We really do need to run make install, even though we’re not going to be running sshd from the installation directory. This is because sshd needs some private keys to run, and that is where it will look for them.

If everything goes well, running make should display the AFL banner as OpenSSH is compiled.

You may need some extra libraries (zlib1g-dev and libssl-dev on Ubuntu) for the build to succeeed.

Run make install to install sshd into the install/ subdirectory (and again, please don’t run this as root).

We will have to rebuild OpenSSH a few times as we apply some patches to it, but this gives you the basic ingredients for a build. One particular annoying thing I’ve noticed is that OpenSSH doesn’t always detect source changes when you run make (and so your changes may not actually make it into the binary). For this reason I just adopted the habit of always running make clean before recompiling anything. Just a heads up!

Running sshd

Before we can actually run sshd under AFL, we need to figure out exactly how to invoke it with all the right flags and options. This is what I use:

./sshd -d -e -p 2200 -r -f sshd_config -i

This is what it means:

-d
“Debug mode”. Keeps the daemon from forking, makes it accept only a single connection, and keeps it from putting itself in the background. All useful things that we need.
-e
This makes it log to stderr instead of syslog; this first of all prevents clobbering your system log with debug messages from our fuzzing instance, and also gives a small speed boost.
-p 2200
The TCP port to listen to. This is not really used in inetd mode (-i), but is useful later on when we want to generate our first input testcase.
-r
This option is not documented (or not in my man page, at least), but you can find it in the source code, which should hopefully also explain what it does: preventing sshd from re-execing itself. I think this is a security feature, since it allows the process to isolate itself from the original environment. In our case, it complicates and slows things down unnecessarily, so we disable it by passing -r.
-f sshd_config
Use the sshd_config from the current directory. This just allows us to customise the config later without having to reinstall it or be unsure about which location it’s really loaded from.
-i
“Inetd mode”. As already mentioned, inetd mode will make the server process a single connection on stdin/stdout, which is a perfect fit for AFL (as it will write testcases on the program’s stdin by default).

Go ahead and run it. It should hopefully print something like this:

$ ./sshd -d -e -p 2200 -r -f sshd_config -i
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2g 1 Mar 2016
debug1: private host key #0: ssh-rsa SHA256:f9xyp3dC+9jCajEBOdhjVRAhxp4RU0amQoj0QJAI9J0
debug1: private host key #1: ssh-dss SHA256:sGRlJclqfI2z63JzwjNlHtCmT4D1WkfPmW3Zdof7SGw
debug1: private host key #2: ecdsa-sha2-nistp256 SHA256:02NDjij34MUhDnifUDVESUdJ14jbzkusoerBq1ghS0s
debug1: private host key #3: ssh-ed25519 SHA256:RsHu96ANXZ+Rk3KL8VUu1DBzxwfZAPF9AxhVANkekNE
debug1: setgroups() failed: Operation not permitted
debug1: inetd sockets after dupping: 3, 4
Connection from UNKNOWN port 65535 on UNKNOWN port 65535
SSH-2.0-OpenSSH_7.4

If you type some garbage and press enter, it will probably give you “Protocol mismatch.” and exit. This is good!

Detecting crashes/disabling privilege separation mode

One of the first obstacles I ran into was the fact that I saw sshd crashing in my system logs, but AFL wasn’t detecting them as crashes:

[726976.333225] sshd[29691]: segfault at 0 ip 000055d3f3139890 sp 00007fff21faa268 error 4 in sshd[55d3f30ca000+bf000]
[726984.822798] sshd[29702]: segfault at 4 ip 00007f503b4f3435 sp 00007fff84c05248 error 4 in libc-2.23.so[7f503b3a6000+1bf000]

The problem is that OpenSSH comes with a “privilege separation mode” that forks a child process and runs most of the code inside the child. If the child segfaults, the parent still exits normally, so it masks the segfault from AFL (which only observes the parent process directly).

In version 7.4 and earlier, privilege separation mode can easily be disabled by adding “UsePrivilegeSeparation no” to sshd_config or passing -o UsePrivilegeSeaparation=no on the command line.

Unfortunately it looks like the OpenSSH developers are removing the ability to disable privilege separation mode in version 7.5 and onwards. This is not a big deal, as OpenSSH maintainer Damien Miller writes on Twitter: “the infrastructure will be there for a while and it’s a 1-line change to turn privsep off”. So you may have to dive into sshd.c to disable it in the future.

Reducing randomness

OpenSSH uses random nonces during the handshake to prevent “replay attacks” where you would record somebody’s (encrypted) SSH session and then you feed the same data to the server again to authenticate again. When random numbers are used, the server and the client will calculate a new set of keys and thus thwart the replay attack.

In our case, we explicitly want to be able to replay traffic and obtain the same result two times in a row; otherwise, the fuzzer would not be able to gain any useful data from a single connection attempt (as the testcase it found would not be usable for further fuzzing).

There’s also the possibility that randomness introduces variabilities in other code paths not related to the handshake, but I don’t really know. In any case, we can easily disable the random number generator. On my system, with the configure line above, all or most random numbers seem to come from arc4random_buf() in openbsd-compat/arc4random.c, so to make random numbers very predictable, we can apply this patch:

diff --git openbsd-compat/arc4random.c openbsd-compat/arc4random.c
--- openbsd-compat/arc4random.c
+++ openbsd-compat/arc4random.c
@@ -242,7 +242,7 @@ void
arc4random_buf(void *buf, size_t n)
{
_ARC4_LOCK();
- _rs_random_buf(buf, n);
+ memset(buf, 0, n);
_ARC4_UNLOCK();
}
# endif /* !HAVE_ARC4RANDOM_BUF */

One way to test whether this patch is effective is to try to packet-capture an SSH session and see if it can be replayed successfully. We’re going to have to do that later anyway in order to create our first input testcase, so skip below if you want to see how that’s done. In any case, AFL would also tell us using its “stability” indicator if something was really off with regards to random numbers (>95% stability is generally good, <90% would indicate that something is off and needs to be fixed).

Increasing coverage

Disabling message CRCs

When fuzzing, we really want to disable as many checksums as we can, as Damien Miller also wrote on twitter: “fuzzing usually wants other code changes too, like ignoring MAC/signature failures to make more stuff reachable”. This may sound a little strange at first, but makes perfect sense: In a real attack scenario, we can always1 fix up CRCs and other checksums to match what the program expects.

If we don’t disable checksums (and we don’t try to fix them up), then the fuzzer will make very little progress. A single bit flip in a checksum-protected area will just fail the checksum test and never allow the fuzzer to proceed.

We could of course also fix the checksum up before passing the data to the SSH server, but this is slow and complicated. It’s better to disable the checksum test in the server and then try to fix it up if we do happen to find a testcase which can crash the modified server.

The first thing we can disable is the packet CRC test:

diff --git a/packet.c b/packet.c
--- a/packet.c
+++ b/packet.c
@@ -1635,7 +1635,7 @@ ssh_packet_read_poll1(struct ssh *ssh, u_char *typep)

cp = sshbuf_ptr(state->incoming_packet) + len - 4;
stored_checksum = PEEK_U32(cp);
- if (checksum != stored_checksum) {
+ if (0 && checksum != stored_checksum) {
error("Corrupted check bytes on input");
if ((r = sshpkt_disconnect(ssh, "connection corrupted")) != 0 ||
(r = ssh_packet_write_wait(ssh)) != 0)

As far as I understand, this is a simple (non-cryptographic) integrity check meant just as a sanity check against bit flips or incorrectly encoded data.

Disabling MACs

We can also disable Message Authentication Codes (MACs), which are the cryptographic equivalent of checksums, but which also guarantees that the message came from the expected sender:

diff --git mac.c mac.c
index 5ba7fae1..ced66fe6 100644
--- mac.c
+++ mac.c
@@ -229,8 +229,10 @@ mac_check(struct sshmac *mac, u_int32_t seqno,
if ((r = mac_compute(mac, seqno, data, dlen,
ourmac, sizeof(ourmac))) != 0)
return r;
+#if 0
if (timingsafe_bcmp(ourmac, theirmac, mac->mac_len) != 0)
return SSH_ERR_MAC_INVALID;
+#endif
return 0;
}

We do have to be very careful when making these changes. We want to try to preserve the original behaviour of the program as much as we can, in the sense that we have to be very careful not to introduce bugs of our own. For example, we have to be very sure that we don’t accidentally skip the test which checks that the packet is large enough to contain a checksum in the first place. If we had accidentally skipped that, it is possible that the program being fuzzed would try to access memory beyond the end of the buffer, which would be a bug which is not present in the original program.

This is also a good reason to never submit crashing testcases to the developers of a program unless you can show that they also crash a completely unmodified program.

Disabling encryption

The last thing we can do, unless you wish to only fuzz the unencrypted initial protocol handshake and key exchange, is to disable encryption altogether.

The reason for doing this is exactly the same as the reason for disabling checksums and MACs, namely that the fuzzer would have no hope of being able to fuzz the protocol itself if it had to work with the encrypted data (since touching the encrypted data with overwhelming probability will just cause it to decrypt to random and utter garbage).

Making the change is surprisingly simple, as OpenSSH already comes with a psuedo-cipher that just passes data through without actually encrypting/decrypting it. All we have to do is to make it available as a cipher that can be negotiated between the client and the server. We can use this patch:

diff --git a/cipher.c b/cipher.c
index 2def333..64cdadf 100644
--- a/cipher.c
+++ b/cipher.c
@@ -95,7 +95,7 @@ static const struct sshcipher ciphers[] = {
# endif /* OPENSSL_NO_BF */
#endif /* WITH_SSH1 */
#ifdef WITH_OPENSSL
- { "none", SSH_CIPHER_NONE, 8, 0, 0, 0, 0, 0, EVP_enc_null },
+ { "none", SSH_CIPHER_SSH2, 8, 0, 0, 0, 0, 0, EVP_enc_null },
{ "3des-cbc", SSH_CIPHER_SSH2, 8, 24, 0, 0, 0, 1, EVP_des_ede3_cbc },
# ifndef OPENSSL_NO_BF
{ "blowfish-cbc",

To use this cipher by default, just put “Ciphers none” in your sshd_config. Of course, the client doesn’t support it out of the box either, so if you make any test connections, you have to have to use the ssh binary compiled with the patched cipher.c above as well.

You may have to pass pass -o Ciphers=none from the client as well if it prefers to use a different cipher by default. Use strace or wireshark to verify that communication beyond the initial protocol setup happens in plaintext.

Making it fast

afl-clang-fast/LLVM “deferred forkserver mode”

I mentioned above that using afl-clang-fast (i.e. AFL’s LLVM deferred forkserver mode) allows us to move the “fork point” to skip some of the sshd initialisation steps which are the same for every single testcase we can throw at it.

To make a long story short, we need to put a call to __AFL_INIT() at the right spot in the program, separating the stuff that doesn’t depend on a specific input to happen before it and the testcase-specific handling to happen after it. I’ve used this patch:

diff --git a/sshd.c b/sshd.c
--- a/sshd.c
+++ b/sshd.c
@@ -1840,6 +1840,8 @@ main(int ac, char **av)
/* ignore SIGPIPE */
signal(SIGPIPE, SIG_IGN);

+ __AFL_INIT();
+
/* Get a connection, either from inetd or a listening TCP socket */
if (inetd_flag) {
server_accept_inetd(&sock_in, &sock_out);

AFL should be able to automatically detect that you no longer wish to start the program from the top of main() every time. However, with only the patch above, I got this scary-looking error message:

Hmm, looks like the target binary terminated before we could complete a
handshake with the injected code. Perhaps there is a horrible bug in the
fuzzer. Poke <lcamtuf@coredump.cx> for troubleshooting tips.

So there is obviously some AFL magic code here to make the fuzzer and the fuzzed program communicate. After poking around in afl-fuzz.c, I found FORKSRV_FD, which is a file descriptor pointing to a pipe used for this purpose. The value is 198 (and the other end of the pipe is 199).

To try to figure out what was going wrong, I ran afl-fuzz under strace, and it showed that file descriptors 198 and 199 were getting closed by sshd. With some more digging, I found the call to closefrom(), which is a function that closes all inherited (and presumed unused) file descriptors starting at a given number. Again, the reason for this code to exist in the first place is probably in order to reduce the attack surface in case an attacker is able to gain control the process. Anyway, the solution is to protect these special file descriptors using a patch like this:

diff --git openbsd-compat/bsd-closefrom.c openbsd-compat/bsd-closefrom.c
--- openbsd-compat/bsd-closefrom.c
+++ openbsd-compat/bsd-closefrom.c
@@ -81,7 +81,7 @@ closefrom(int lowfd)
while ((dent = readdir(dirp)) != NULL) {
fd = strtol(dent->d_name, &endp, 10);
if (dent->d_name != endp && *endp == '\0' &&
- fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp))
+ fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp) && fd != 198 && fd != 199)
(void) close((int) fd);
}
(void) closedir(dirp);

Skipping expensive DH/curve and key derivation operations

At this point, I still wasn’t happy with the execution speed: Some testcases were as low as 10 execs/second, which is really slow.

I tried compiling sshd with -pg (for gprof) to try to figure out where the time was going, but there are many obstacles to getting this to work properly: First of all, sshd exits using _exit(255) through its cleanup_exit() function. This is not a “normal” exit and so the gmon.out file (containing the profile data) is not written out at all. Applying a source patch to fix that, sshd will give you a “Permission denied” error as it tries to open the file for writing. The problem now is that sshd does a chdir("/"), meaning that it’s trying to write the profile data in a directory where it doesn’t have access. The solution is again simple, just add another chdir() to a writable location before calling exit(). Even with this in place, the profile came out completely empty for me. Maybe it’s another one of those privilege separation things. In any case, I decided to just use valgrind and its “cachegrind” tool to obtain the profile. It’s much easier and gives me the data I need without hassles of reconfiguring, patching, and recompiling.

The profile showed one very specific hot spot, coming from two different locations: elliptic curve point multiplication.

I don’t really know too much about elliptic curve cryptography, but apparently it’s pretty expensive to calculate. However, we don’t really need to deal with it; we can assume that the key exchange between the server and the client succeeds. Similar to how we increased coverage above by skipping message CRC checks and replacing the encryption with a dummy cipher, we can simply skip the expensive operations and assume they always succeed. This is a trade-off; we are no longer fuzzing all the verification steps, but allows the fuzzer to concentrate more on the protocol parsing itself. I applied this patch:

diff --git kexc25519.c kexc25519.c
--- kexc25519.c
+++ kexc25519.c
@@ -68,10 +68,13 @@ kexc25519_shared_key(const u_char key[CURVE25519_SIZE],

/* Check for all-zero public key */
explicit_bzero(shared_key, CURVE25519_SIZE);
+#if 0
if (timingsafe_bcmp(pub, shared_key, CURVE25519_SIZE) == 0)
return SSH_ERR_KEY_INVALID_EC_VALUE;

crypto_scalarmult_curve25519(shared_key, key, pub);
+#endif
+
#ifdef DEBUG_KEXECDH
dump_digest("shared secret", shared_key, CURVE25519_SIZE);
#endif
diff --git kexc25519s.c kexc25519s.c
--- kexc25519s.c
+++ kexc25519s.c
@@ -67,7 +67,12 @@ input_kex_c25519_init(int type, u_int32_t seq, void *ctxt)
int r;

/* generate private key */
+#if 0
kexc25519_keygen(server_key, server_pubkey);
+#else
+ explicit_bzero(server_key, sizeof(server_key));
+ explicit_bzero(server_pubkey, sizeof(server_pubkey));
+#endif
#ifdef DEBUG_KEXECDH
dump_digest("server private key:", server_key, sizeof(server_key));
#endif

With this patch in place, execs/second went to ~2,000 per core, which is a much better speed to be fuzzing at.

Creating the first input testcases

Before we can start fuzzing for real, we have to create the first few input testcases. Actually, a single one is enough to get started, but if you know how to create different ones taking different code paths in the server, that may help jumpstart the fuzzing process. A few possibilities I can think of:

The way I created the first testcase was to record the traffic from the client to the server using strace. Start the server without -i:

./sshd -d -e -p 2200 -r -f sshd_config
[...]
Server listening on :: port 2200.

Then start a client (using the ssh binary you’ve just compiled) under strace:

$ strace -e trace=write -o strace.log -f -s 8192 ./ssh -c none -p 2200 localhost

This should hopefully log you in (if not, you may have to fiddle with users, keys, and passwords until you succeed in logging in to the server you just started).

The first few lines of the strace log should read something like this:

2945  write(3, "SSH-2.0-OpenSSH_7.4\r\n", 21) = 21
2945 write(3, "\0\0\4|\5\24\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0010curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c\0\0\1\"ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa\0\0\0\4none\0\0\0\4none\0\0\0\325umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\325umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\32none,zlib@openssh.com,zlib\0\0\0\32none,zlib@openssh.com,zlib\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1152) = 1152

We see here that the client is communicating over file descriptor 3. You will have to delete all the writes happening on other file descriptors. Then take the strings and paste them into a Python script, something like:

import sys
for x in [
"SSH-2.0-OpenSSH_7.4\r\n"
"\0\0\4..."
...
]:
sys.stdout.write(x)

When you run this, it will print a byte-perfect copy of everything that the client sent to stdout. Just redirect this to a file. That file will be your first input testcase.

You can do a test run (without AFL) by passing the same data to the server again (this time using -i):

./sshd -d -e -p 2200 -r -f sshd_config -i < testcase 2>&1 > /dev/null

Hopefully it will show that your testcase replay was able to log in successfully.

Before starting the fuzzer you can also double check that the instrumentation works as expected using afl-analyze:

~/afl-2.39b/afl-analyze -i testcase -- ./sshd -d -e -p 2200 -r -f sshd_config -i

This may take a few seconds to run, but should eventually show you a map of the file and what it thinks each byte means. If there is too much red, that’s an indication you were not able to disable checksumming/encryption properly (maybe you have to make clean and rebuild?). You may also see other errors, including that AFL didn’t detect any instrumentation (did you compile sshd with afl-clang-fast?). This is general AFL troubleshooting territory, so I’d recommend checking out the AFL documentation.

Creating an OpenSSH dictionary

I created an AFL “dictionary” for OpenSSH, which is basically just a list of strings with special meaning to the program being fuzzed. I just used a few of the strings found by running ssh -Q cipher, etc. to allow the fuzzer to use these strings without having to discover them all at once (which is pretty unlikely to happen by chance).

s0="3des-cbc"
s1="aes128-cbc"
s2="aes128-ctr"
s3="aes128-gcm@openssh.com"
s4="aes192-cbc"
s5="aes192-ctr"
s6="aes256-cbc"
s7="aes256-ctr"
s8="aes256-gcm@openssh.com"
s9="arcfour"
s10="arcfour128"
s11="arcfour256"
s12="blowfish-cbc"
s13="cast128-cbc"
s14="chacha20-poly1305@openssh.com"
s15="curve25519-sha256@libssh.org"
s16="diffie-hellman-group14-sha1"
s17="diffie-hellman-group1-sha1"
s18="diffie-hellman-group-exchange-sha1"
s19="diffie-hellman-group-exchange-sha256"
s20="ecdh-sha2-nistp256"
s21="ecdh-sha2-nistp384"
s22="ecdh-sha2-nistp521"
s23="ecdsa-sha2-nistp256"
s24="ecdsa-sha2-nistp256-cert-v01@openssh.com"
s25="ecdsa-sha2-nistp384"
s26="ecdsa-sha2-nistp384-cert-v01@openssh.com"
s27="ecdsa-sha2-nistp521"
s28="ecdsa-sha2-nistp521-cert-v01@openssh.com"
s29="hmac-md5"
s30="hmac-md5-96"
s31="hmac-md5-96-etm@openssh.com"
s32="hmac-md5-etm@openssh.com"
s33="hmac-ripemd160"
s34="hmac-ripemd160-etm@openssh.com"
s35="hmac-ripemd160@openssh.com"
s36="hmac-sha1"
s37="hmac-sha1-96"
s38="hmac-sha1-96-etm@openssh.com"
s39="hmac-sha1-etm@openssh.com"
s40="hmac-sha2-256"
s41="hmac-sha2-256-etm@openssh.com"
s42="hmac-sha2-512"
s43="hmac-sha2-512-etm@openssh.com"
s44="rijndael-cbc@lysator.liu.se"
s45="ssh-dss"
s46="ssh-dss-cert-v01@openssh.com"
s47="ssh-ed25519"
s48="ssh-ed25519-cert-v01@openssh.com"
s49="ssh-rsa"
s50="ssh-rsa-cert-v01@openssh.com"
s51="umac-128-etm@openssh.com"
s52="umac-128@openssh.com"
s53="umac-64-etm@openssh.com"
s54="umac-64@openssh.com"

Just save it as openssh.dict; to use it, we will pass the filename to the -x option of afl-fuzz.

Running AFL

Whew, it’s finally time to start the fuzzing!

First, create two directories, input and output. Place your initial testcase in the input directory. Then, for the output directory, we’re going to use a little hack that I’ve found to speed up the fuzzing process and keep AFL from hitting the disk all the time: mount a tmpfs RAM-disk on output with:

sudo mount -t tmpfs none output/

Of course, if you shut down (or crash) your machine without copying the data out of this directory, it will be gone, so you should make a backup of it every once in a while. I personally just use a bash one-liner that just tars it up to the real on-disk filesystem every few hours.

To start a single fuzzer, you can use something like:

~/afl-2.39b/afl-fuzz -x sshd.dict -i input -o output -M 0 -- ./sshd -d -e -p 2100 -r -f sshd_config -i

Again, see the AFL docs on how to do parallel fuzzing. I have a simple bash script that just launches a bunch of the line above (with different values to the -M or -S option) in different screen windows.

Hopefully you should see something like this:

                         american fuzzy lop 2.39b (31)

┌─ process timing ─────────────────────────────────────┬─ overall results ─────┐
│ run time : 0 days, 13 hrs, 22 min, 40 sec │ cycles done : 152 │
│ last new path : 0 days, 0 hrs, 14 min, 57 sec │ total paths : 1577 │
│ last uniq crash : none seen yet │ uniq crashes : 0 │
│ last uniq hang : none seen yet │ uniq hangs : 0 │
├─ cycle progress ────────────────────┬─ map coverage ─┴───────────────────────┤
│ now processing : 717* (45.47%) │ map density : 3.98% / 6.67% │
│ paths timed out : 0 (0.00%) │ count coverage : 3.80 bits/tuple │
├─ stage progress ────────────────────┼─ findings in depth ────────────────────┤
│ now trying : splice 4 │ favored paths : 117 (7.42%) │
│ stage execs : 74/128 (57.81%) │ new edges on : 178 (11.29%) │
│ total execs : 74.3M │ total crashes : 0 (0 unique) │
│ exec speed : 1888/sec │ total hangs : 0 (0 unique) │
├─ fuzzing strategy yields ───────────┴───────────────┬─ path geometry ────────┤
│ bit flips : n/a, n/a, n/a │ levels : 7 │
│ byte flips : n/a, n/a, n/a │ pending : 2 │
│ arithmetics : n/a, n/a, n/a │ pend fav : 0 │
│ known ints : n/a, n/a, n/a │ own finds : 59 │
│ dictionary : n/a, n/a, n/a │ imported : 245 │
│ havoc : 39/25.3M, 20/47.2M │ stability : 97.55% │
│ trim : 2.81%/1.84M, n/a ├────────────────────────┘
└─────────────────────────────────────────────────────┘ [cpu015: 62%]

Crashes found

In about a day of fuzzing (even before disabling encryption), I found a couple of NULL pointer dereferences during key exchange. Fortunately, these crashes are not harmful in practice because of OpenSSH’s privilege separation code, so at most we’re crashing an unprivileged child process and leaving a scary segfault message in the system log. The fix made it in CVS here: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/kex.c?rev=1.131&content-type=text/x-cvsweb-markup.

Conclusion

Apart from the two harmless NULL pointer dereferences I found, I haven’t been able to find anything else yet, which seems to indicate that OpenSSH is fairly robust (which is good!).

I hope some of the techniques and patches I used here will help more people get into fuzzing OpenSSH.

Other things to do from here include doing some fuzzing rounds using ASAN or running the corpus through valgrind, although it’s probably easier to do this once you already have a good sized corpus found without them, as both ASAN and valgrind have a performance penalty.

It could also be useful to look into ./configure options to configure the build more like a typical distro build; I haven’t done anything here except to get it to build in a minimal environment.

Please let me know in the comments if you have other ideas on how to expand coverage or make fuzzing OpenSSH faster!

Thanks

I’d like to thank Oracle (my employer) for providing the hardware on which to run lots of AFL instances in parallel :-)


  1. Well, we can’t fix up signatures we don’t have the private key for, so in those cases we’ll just assume the attacker does have the private key. You can still do damage e.g. in an otherwise locked down environment; as an example, GitHub uses the SSH protocol to allow pushing to your repositories. These SSH accounts are heavily locked down, as you can’t run arbitrary commands on them. In this case, however, we do have have the secret key used to authenticate and sign messages.

March 17, 2017 12:29 PM

March 15, 2017

Michael Kerrisk (manpages): man-pages-4.10 is released

I've released man-pages-4.10. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from over 40 contributors. This release sees a large number of changes: over 600 commits changing around 160 pages. The changes include the addition of 11 pages, significant rewrites of 3 other pages, and enhancements to many other pages.

Among the more significant changes in man-pages-4.10 are the following:

March 15, 2017 05:09 AM

March 14, 2017

Kernel Podcast: Kernel Podcast for March 13th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170313.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc2 (including pre-enablement for Intel 5-level paging), VMA based swap readahead, and ongoing development ahead of the next cycle.

Linus Torvalds announced Linux 4.11-rc2. In his announcement, he said that the past week had been “fairly quiet” because “people are still looking for bugs and taking a breather after the merge window”. But he also noted that “we’ve got a healthy number of fixes in, and there’ssome cleanup/prep patches for the upcoming 5-level page table support that I took after the merge window just to make the next merge window easier”.

Various fixes and updates have been posted against the previous rc1, over the past week, including an urgent fix from Matthew (Willy) Wilcox for his idr rewrite in 4.11 (freeing the correct IDA bitmap).

Geert Uytterhoeven posted “Build regressions/improvements in v4.11-rc1”. This compared build error/warning regressions and improvements between v4.11-rc1 and v4.10. According to Geert, the 4.11-rc1 kernel saw an increase of 19 build errors and 1108 warnings when compared to 4.10.

Announcements

Jiri Slaby announced Linux 3.12.71, Greg Kroah Hartman (KH) announced 4.4.53, 4.9.14, and 4.10.2 (which started a conversation about git tags being stale that we will address in a moment). Greg took the opportunity of various stable kernel work to prod the i915 graphics driver team with a message entitled “The i915 stable patch marking is totally broken”.

Sebastian Andrzej Siewior announced the v4.9.13-rt12 preempt-rt “Real Time” kernel patch set, which has a known issue that “CPU hotplug got a little better but can deadlock”, suggesting you might not want to try that then.

Julia Cartwright announced 4.1.38-rt46.

Steven Rostedt announced the 3.18.48-rt53 stable release of the RT kernel. He also announced the 3.10.105-rt119 and 3.2.86-rt124 releases.

Jair Ruusu announced “loop-AES-v3.7k file/swap crypto package”, which is available on sourceforge at: http://loop-aes.sourceforge.net/

Andy Lutomirski sent out detailed notes (along with a followup with yet more explanation) of the Intel SGX (“Secure Enclave”) feature discussion that occured at Kernel Summit and Linux Plumbers Conference last fall. The thread is called “SGX notes from KS/LPC”. In the thread, he explains what SGX is (a small region of virtual memory within a Linux process – known as a task inside the kernel – that is not visible to the host OS after the enclave is “launched”) and how it can be used to hide certain data from system administrators or providers – for example, cryptographic keys that one would rather were not compromised. SGX comes with a litany of new requirements upon the Operating System that Andy covers, along with some guidelines for how to expose this feature, and how to make it useable.

Packet.net are now sponsoring the kernel.org project to the tune of various geo-diverse bare metal frontend systems in datacenters around the globe. Each of these (powerful) frontends provides read-only public access to kernel.org git repositories and the public website (git.kernel.org and www.kernel.org). More information, including machine specifications can be found here: https://www.kernel.org/fast-new-frontends-with-packet.html

(this came to light because of a brief outage affecting the Newark, NJ mirror which was lagging behind other mirrors in picking up new git tags pushed, but one hopes that an official announcement and thanks was otherwise forthcoming)

Masahiro Yamada has been added as a Kbuild (co-)maintainer.

Intel 5-level paging

Kirill A. Shutemov posted version 4 of his “5-level paging” patch series that implements support for the la57 (56 bit Virtual Address space for x64 Canonical Addressing) feature on some future CPUs. We covered the underlying patch series before, explaining the benefit of a larger (virtual) address space, as well as the additional compexities required to implement backward compatibility (including new prctls to limit the virtual address space of certain legacy applications), and the lack (so far) of boot time switching between 4-and-5-level support, which is seen as important for the distros.

Linus responded by saying that he thought “we should just aim for this being in 4.12” as he didn’t “see any real reason to delay merging it”. After some discussion about whose tree to merge it through, it was decided (by Thomas Gleixner) that it could come in through the “-tip” x86 tree. Which resulted in Linus pulling a preparatory “5-level paging: prepare generic code” patch series from Kirill into 4.11 (even after the merge window had closed) to lay the groundwork for pulling the main feature into the next (4.12) cycle. This promptly broke PowerPC, which was promptly fixed by a followup patch. Following the merge of enabling support in 4.11, Kirill posted “5-level paging enabling for v4.12” which aims to complete the merge next cycle.

The earlier version 4 iteration of the patch series noted that the Xen hypervisor currently doesn’t support 5-level paging and thus CONFIG_XEN is disabled automatically when building CONFIG_X86_5LEVEL. It was pointed out by the Andrew Cooper that runtime (boottime) switching between 4 and 5 level support would be required in order to provide a clean experience, especially until Xen Dom0 support is available. That boottime switching is on the existing todo and presumably is going to land at some point.

Separately, Dmitry Safonov posted version 6 of a patch series entitled “Fix compatible mmap() return pointer over 4Gb” which has “some minor conflicts with Kirill’s set for 5-table paging”. Dmitry aims to solve a slightly different problem than Kirill’s PR_{SET,GET}_MAX_VADDR calls (which limit the virtual address ranges returned by mmap to avoid legacy programs breaking when suddenly able to receive much larger “Canonical Addresses” – in Intel parlance – than they were compiled with built-in and broken assumptions about once upon a time) insomuch as he is focused on 32-bit legacy syscalls on 64-bit x64 not returning memory above 4GB that cannot be used by older 32-bit code.

VMA based swap readahead

Ying Huang (Intel) posted an RFC (Request For Comments) entitled “mm, swap: VMA based swap readahead” in which he discussed the current kernel paging implementation for Virtual Memory Areas (VMAs) as well as how it could be improved to facilitate greater awareness of the in-memory access patterns of associated data by changing the corresponding readahead algorithm.

“Readahead” as a concept is what it sounds like. Locality (both spacial, in this case, as well as temporal, in other cases) of data means that when a memory access occurs, it is usually more likely than not that an access to a nearby memory location will soon follow (except in the case of pure random access workloads). Thus, the kernel contains support for preloading nearby data when performing various disk and memory operations. Examples include readahead of nearby disk blocks when loading filesystem data, and loading nearby disk blocks when reading pages back in from swap.

VMAs (Virtual Memory Areas) are regions of memory managed by the Linux kernel. A running application (process), known as a “task” by the kernel, contains a large number of different VMAs which form its overall address space. You can see this by inspecting /proc/self/maps (replacing “self” with a process ID that you have access to). The output will show a series of memory regions representing various memory owned by the task. Memory that doesn’t represent files is known as “anonymous memory” and it is what is paged (swapped) out under memory pressure situations.

As Ying notes in his RFC, the “original swap readahead algorithm does readahead based on the consecutive blocks in [the] swap device” but “the consecutive blocks in [the] swap device just reflect the order of page reclaiming” and not necessarily “the access sequence in RAM”. His patch series aims to change this by teaching the readahead algorithm about VMAs and how to bias the readahead to sequentially walk through the address space of a task (process), reading those parts of the swap space containing this data rather than simply walking through swap sequentially.

But wait! There’s more! Ying also posted a separate patch series entitled “THP swap: Delay splitting THP during swapping out”, which does what it sounds like it would do. THP (Transparent Huge Pages) is a technology used by the Linux kernel to dynamically allocate “huge” (optionally very large – up to 1GB in size, but in this case 2MB) pages of memory to contiguous regions of virtual memory address space, especially those backing shared large memory data (even including a huge zero page used for virtual machine RAM at boot). THP reduces pressure on limited CPU internal microarchitectural caches known as TLBs (Translation Lookaside Buffers) – as well as uTLBs at a lower level than the TLBs – which cache the translation performed by page table entries to physical or intermediate memory addresses. Reducing the number of TLBs required to map regions of virtual memory reduces the number of times TLBs must be reused by the underlying architecture during memory access operations.

The existing Linux kernel THP code splits THPs back into smaller pages whenever they are swapped (paged) out to disk. Yet it turns out that this is particularly inefficient on contemporary systems in which secondary disk or NVMe storage has far greater bandwidth than a single high end core can saturate if forced to do this work. Ying’s patch instead delays this split and pushes entire THPs out to swap, allowing for larger writes and reads of contiguous memory out to the backing storage.

Ongoing Development

“David F” inquired about RAID mode support for Intel m.2 chipsets. These devices continue the recent-ish legacy of certain Intel storage devices providing dual modes of operation: as an AHCI device, and as a hardware RAID device operating in a propietary mode for which no Linux drivers exist. David was quite concerned that the lack of a Linux driver was becoming particular problematic on newer machines, which might not provide a means to switch into AHCI mode (supported by Linux). Christoph Hellwig was…unsympathetic…suggesting that the RAID mode “provides worse performance”, and that its implementation was questionable. He also had a series of other suggestions for what to do with these devices – those are less family friendly to repeat in this podcast.

Michal Hocko posted “kvmalloc” which is a generic replacement for the many “open coded kmalloc with vmalloc fallback instances in the tree”. k-and-vmalloc are two different means by which kernel code allocates memory. The former is used to obtain small allocations (on the order of a few pages – the minimal granule size operated on by the virtual memory subsystem of Linux on contemporary processors) that are also linerally contiguous in physical memory. The latter is for larger allocations of strictly “virtual” memory – contiguous only when accessed using the underlying Memory Mangement Unit to perform a translation (this is usually automatic for kernel code, since the kernel runs with virtual memory of its own, just like user processes do, but it can be problematic if a driver would like to use this memory for certain hardware operations, such as DMA transfers). The generic wrapper aims to clean up the common case that kernel code just wants a chunk of memory and will try to allocate it with kmalloc, but fallback to the more generic vmalloc if that fails.

Christian Konig (AMD) posted “PCI: add resizeable BAR infrastructure” (version 2, and later an update with some fixes in a version 3 also), which aims to add support to the kernel for a PCI SIG (Peripheral Component Interconnect Special Interest Group) ECN (Engineering Change Notice) that enables BARs (Base Address Registers) to be resized at runtime. PCI(e) BARs are mapping windows (aperatures) in the system memory map that are used to talk to hardware add-on cards (or built-in devices within modern platforms) by determining where the device’s memory will live. Traditionally, BARs were fixed size and so on architectures not relying upon firmware configuration of underlying BARs, Linux would have to determine where to place certain PCI(e) resources at boot/hotplug time by checking how much memory a device needed to expose and programming the BARs. With the new extension comes the possibility to increase the size of a BAR to map larger regions of memory. This is a useful feature for graphics cards, which may want to map very large regions of memory. A subsequent patch wires up the AMD GPU driver to use this.

Javi Merino posted “Documentation/EDID fixes”, which aims to correct some broken assumptions in the kernel documentation for EDID (Extended Display Identification Data – the data provided over e.g. I2C from a VGA monitor when the cable is connected). The examples didn’t build correctly due to existing assumptions. This author is probably one of few people who always thinks of EDID and the interaction with Xorg every time he plugs in an external projector to his laptop.

David Howells posted “net: Work around lockdep limitation in sockets that use sockets” in which he corrected an erroneous assumption in the kernel “lockdep” (lock dependency checker) that prevented it from correctly identifying bad call chains involving TCP sockets when there exists a dependency between sockets created purely in the kernel and sockets created purely in userspace (which the lockdep could not distinguish between due to its use of broad lock classes). The AFS (Andrew File System) was generating a false lockdep warning because it was exposing such an implied dependency.

Charles Keepax posted “genirq: Add support for nested shared IRQs” to address an audio CODEC that also acts as an interrupt controller. The details sounded rather painful. Yet it was “fairly easy” to fix.

Steven Rostedt posted “tracing: Allow function tracing to start earlier in boot up”, which does roughly what it says on the can, “moving tracing up further in the boot process”, “right after memory is initialized”. He noted that his RFC was a start and could be futher improved upon.

Matthew (Willy) Wilcox posted an RFC entitled “memset_l and memfill” that provides a generic means for architectures to provide optimized functions that “fill regions of memory with patterns larger than those contained in a single byte”. This is intended to be used by zram as well as other code.

Paul McKenney noticed some of his RCU torture tests failing during hotplug early in boot due to calls to smp_store_cpu_info during that operation. The call is not safe because it indirectly invokes schedule_work() which wants to use RCU prior to RCU being enabled as a side effect of dealing with an unstable TSC (Time Stamp Counter) on the afflicted CPU. Peter Zijlstra had an opinion on hotplug, and also a patch to handle this situation.

Vlad Zakharov posted “update timer frequencies”, which inquired about the best means to implement a cpufreq driver for ARC CPUs. These having a special property that “ARC timers (including those are used for timekeeping) are driven by the same clock as ARC CPU core(s)”. Yup, they change frequency according to the current CPU frequency. Which as Thomas Gleixner noted in response is “broken by design and you really should go and tell your hardware folks to fix that”. He added that “It’s well known for more than TWO decades that changing the frequency of the timekeeper clocksource is a complete disaster”.

Thomas Gleixner posted “kexec, x86/purgatory: Cleanup the unholy mess”, which aims to address his opinion that “the whole machinery is undocumented and lacks any form of forward declarations” (of variables which were previously global but had been made static). Purgatory is a special piece of code which is provided by the kernel but runs in the interim period between the kernel crashing (or beginning kexec) and the new crash or kexec kernel that is then subsequently loaded – this is what performs the load and exec.

March 14, 2017 06:36 PM

March 13, 2017

James Morris: LSM mailing list archive: this time for sure!

Following various unresolved issues with existing mail archives for the Linux Security Modules mailing list, I’ve set up a new archive here.

It’s a mailman mirror of the vger list.

March 13, 2017 10:20 PM

March 09, 2017

James Morris: Hardening the LSM API

The Linux Security Modules (LSM) API provides security hooks for all security-relevant access control operations within the kernel. It’s a pluggable API, allowing different security models to be configured during compilation, and selected at boot time. LSM has provided enough flexibility to implement several major access control schemes, including SELinux, AppArmor, and Smack.

A downside of this architecture, however, is that the security hooks throughout the kernel (there are hundreds of them) increase the kernel’s attack surface. An attacker with a pointer overwrite vulnerability may be able to overwrite an LSM security hook and redirect execution to other code. This could be as simple as bypassing an access control decision via existing kernel code, or redirecting flow to an arbitrary payload such as a rootkit.

Minimizing the inherent security risk of security features, is, I believe, an essential goal.

Recently, as part of the Kernel Self Protection Project, support for marking kernel pages as read-only after init (ro_after_init) was merged, based on grsecurity/pax code. (You can read more about this in Kees Cook’s blog here). In cases where kernel pages are not modified after the kernel is initialized, hardware RO page protections are set on those pages at the end of the kernel initialization process. This is currently supported on several architectures (including x86 and ARM), with more architectures in progress.

It turns out that the LSM hook operations make an ideal candidate for ro_after_init marking, as these hooks are populated during kernel initialization and then do not change (except in one case, explained below). I’ve implemented support for ro_after_init hardening for LSM hooks in the security-next tree, aiming to merge it to Linus for v4.11.

Note that there is one existing case where hooks need to be updated, for runtime SELinux disabling via the ‘disable’ selinuxfs node. Normally, to disable SELinux, you would use selinux=0 at the kernel command line. The runtime disable feature was requested by Fedora folk to handle platforms where the kernel command line is problematic. I’m not sure if this is still the case anywhere. I strongly suggest migrating away from runtime disablement, as configuring support for it in the kernel (via CONFIG_SECURITY_SELINUX_DISABLE) will cause the ro_after_init protection for LSM to be disabled. Use selinux=0 instead, if you need to disable SELinux.

It should be noted, of course, that an attacker with enough control over the kernel could directly change hardware page protections. We are not trying to mitigate that threat here — rather, the goal is to harden the security hooks against being used to gain that level of control.

March 09, 2017 10:52 AM

Rusty Russell: Quick Stats on zstandard (zstd) Performance

Was looking at using zstd for backup, and wanted to see the effect of different compression levels. I backed up my (built) bitcoin source, which is a decent representation of my home directory, but only weighs in 2.3GB. zstd -1 compressed it 71.3%, zstd -22 compressed it 78.6%, and here’s a graph showing runtime (on my laptop) and the resulting size:

zstandard compression (bitcoin source code, object files and binaries) times and sizes

For this corpus, sweet spots are 3 (the default), 6 (2.5x slower, 7% smaller), 14 (10x slower, 13% smaller) and 20 (46x slower, 22% smaller). Spreadsheet with results here.

March 09, 2017 12:53 AM

March 08, 2017

Matthew Garrett: The Internet of Microphones

So the CIA has tools to snoop on you via your TV and your Echo is testifying in a murder case and yet people are still buying connected devices with microphones in and why are they doing that the world is on fire surely this is terrible?

You're right that the world is terrible, but this isn't really a contributing factor to it. There's a few reasons why. The first is that there's really not any indication that the CIA and MI5 ever turned this into an actual deployable exploit. The development reports[1] describe a project that still didn't know what would happen to their exploit over firmware updates and a "fake off" mode that left a lit LED which wouldn't be there if the TV were actually off, so there's a potential for failed updates and people noticing that there's something wrong. It's certainly possible that development continued and it was turned into a polished and usable exploit, but it really just comes across as a bunch of nerds wanting to show off a neat demo.

But let's say it did get to the stage of being deployable - there's still not a great deal to worry about. No remote infection mechanism is described, so they'd need to do it locally. If someone is in a position to reflash your TV without you noticing, they're also in a position to, uh, just leave an internet connected microphone of their own. So how would they infect you remotely? TVs don't actually consume a huge amount of untrusted content from arbitrary sources[2], so that's much harder than it sounds and probably not worth it because:

YOU ARE CARRYING AN INTERNET CONNECTED MICROPHONE THAT CONSUMES VAST QUANTITIES OF UNTRUSTED CONTENT FROM ARBITRARY SOURCES

Seriously your phone is like eleven billion times easier to infect than your TV is and you carry it everywhere. If the CIA want to spy on you, they'll do it via your phone. If you're paranoid enough to take the battery out of your phone before certain conversations, don't have those conversations in front of a TV with a microphone in it. But, uh, it's actually worse than that.

These days audio hardware usually consists of a very generic codec containing a bunch of digital→analogue converters, some analogue→digital converters and a bunch of io pins that can basically be wired up in arbitrary ways. Hardcoding the roles of these pins makes board layout more annoying and some people want more inputs than outputs and some people vice versa, so it's not uncommon for it to be possible to reconfigure an input as an output or vice versa. From software.

Anyone who's ever plugged a microphone into a speaker jack probably knows where I'm going with this. An attacker can "turn off" your TV, reconfigure the internal speaker output as an input and listen to you on your "microphoneless" TV. Have a nice day, and stop telling people that putting glue in their laptop microphone is any use unless you're telling them to disconnect the internal speakers as well.

If you're in a situation where you have to worry about an intelligence agency monitoring you, your TV is the least of your concerns - any device with speakers is just as bad. So what about Alexa? The summary here is, again, it's probably easier and more practical to just break your phone - it's probably near you whenever you're using an Echo anyway, and they also get to record you the rest of the time. The Echo platform is very restricted in terms of where it gets data[3], so it'd be incredibly hard to compromise without Amazon's cooperation. Amazon's not going to give their cooperation unless someone turns up with a warrant, and then we're back to you already being screwed enough that you should have got rid of all your electronics way earlier in this process. There are reasons to be worried about always listening devices, but intelligence agencies monitoring you shouldn't generally be one of them.

tl;dr: The CIA probably isn't listening to you through your TV, and if they are then you're almost certainly going to have a bad time anyway.

[1] Which I have obviously not read
[2] I look forward to the first person demonstrating code execution through malformed MPEG over terrestrial broadcast TV
[3] You'd need a vulnerability in its compressed audio codecs, and you'd need to convince the target to install a skill that played content from your servers

comment count unavailable comments

March 08, 2017 01:30 AM

March 06, 2017

Kernel Podcast: Kernel Podcast for March 6th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170306.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc1, rants about folks not correctly leveraging linux-next, the remainder of this cycle’s merge window pulls, and announcements concerning end of life for some features.

Linus Torvalds announced Linux 4.11-rc1, noting that “two weeks have passed, the merge window is over, and 4.11 has been tagged and pushed out.” He notes that the latest kernel cycle is set to be “on the smallish side”, but that is only in comparison with the most recent two cycles, which have been significantly larger than typical. He notes that 4.11 has a similar number of commits to 4.1, 4.3, 4.5, and 4.7 before it. With the release of 4.11-rc1 comes the closing of the “merge window” (defined by it, the period of time during which disruptive changes are allowed into the kernel prior to RC).

We covered most of the major pulls for 4.11 in last week’s podcast. But there were a few more stragglers. Here’s a sample of those:

J. Bruce Fields posted “nfsd changes for 4.11” which included two semantic changes: NFS security labels are “now off by default” and a “new security_label export flag reenables it per export” since this “only makes sense if all your clients and servers have similar enough selinux policies”. Secondly, NFSv4/UDP support is off because “It was never really supported, and the spec explicitly forbids it. We only ever left it on out of laziness; thanks to Jeff Layton for finally fixing that.”

Anna Schumaker followed up a little later with “Please pull NFS client changes for Linux 4.11”, which includes a memory leak in “_nfs4_open_and_get_state”, as well as various other fixes and new features.

Matthew (Willy) Wilcox posted “Please pull IDR rewrite” which seeks to harmonize the IDR (“Small id to pointer translation service avoding fixed sized tables”) and in-kernel radix tree code. Accoring to Willy, merging the two codebases “lets us share the memory alloction pools, and results in a net deletion of 500 lines of code. It also opens up the possibility of exposing more of the fetures of the radix tree to users of the IDR”.

Will Deacon posted “arm64 fixes for -rc1” of which the “main fix here addresses a kernel panic triggered on Qualcomm QDF2400 due to incorrect register usage in an erratum workaround introduced during the merge window”.

Michael S. Tsirkin posted “vhost: cleanups and fixes”, of which there were very few for this kernel cycle.

Nicholas A. Bellinger posted “target updates for v4.11-rc1”, which includes support for “dual mode (initiator + target) qla2xxx operation”, and a number of other fixes and improvements. He pre-warns that things are “shaping up to be a busy cycle for v4.12 with a new fabric driver (efct) in flight, and a number of other patches on the list being discussed”.

Rafael J. Wysocki posted “Additional ACPI update for v4.11-rc1”, which includes a fix for “an apparant, but actually artificial, resource conflict between the ACPI NVS memory region and the ACPI BERT (Boot Error Record Table)”.

Jens Axboe posted “Block fixes for 4.11-rc1”, which includes a “collection of fixes for this merge window, either fixes for existing issues, or parts that were waiting for acks to come in”. These include a performance fix for the allocation of nvme queues on the right node, along with others.

Miklos Szeredi posted “fuse update for 4.11” and “overlayfs update for 4.11”. the latter “allows concurrent copy up of regular files eliminating [the] potential problem” of (previously) serialized copy ups taking a long time.

Bjorn Helgaas posted “PCI fixes for v4.11”, including a couple of fixes for bugs introduced during code refactoring.

Dan Williams posted “libnvdimm fixes for 4.11-rc1”, which includes a fix for the generation of “nvdimm namespace label”s (metadata) checksums that “Linux was not calculating correcting leading to other environments rejecting the Linux label”.

Helge Deller posted “parisc updates for 4.11”, noting that there was “nothing really important” in this particular cycle to pull in.

James Bottomley posted “final round of SCSI updates for the 4.10+ merge window”, which “is the set of stuff that didn’t quite make the initial pull and a set of fixes for stuff which did”.

Radim Krcmar posted “Second batch of KVM changes for 4.11 merge window”, which includes a number of fixes for PPC and x86.

David Miller posted “Networking”, including many fixes.

A linux-next rant

In his 4.11-rc1 announcement, Linus noted that “it *does* feel like there was more stuff that I was asked to pull than was in linux-next. That always happens, but seems to have happened more now than usually. Comparing to the linux-next tree at the time of the 4.10 release, almost 18% of the non-merge commits were not in Linux-next. That seems higher than usual, although I guess Stephen Rothwell has actual numbers from past merges.” Let’s break what Linus said a little. Stephen Rothwell is an (overworked) kernel hacker based in Australia who produces a (daily, outside of the merge window) kernel tree (and accompanying test infrastructure, patch tracking, and announcement mechanisms) known as “linux-next”. Its raison d’etre is to be the proving ground for new features before they are sent to Linus for merging.

Typically, major new features soak in linux-next for a cycle prior to the one in which they are actually merged (so features landing in 4.11 would have been largely complete and tested via -next during 4.10). Linux kernel development cycles are generally on the order of about two months, so this isn’t an unreasonable long period of time for disruptive changes to languish. Contrast this with the multi-year wait that used to happen back when Linux had an odd/even minor version cycle in which even numbers (2.2, 2.4, 2.6) were the “supported” releases and the odd numbers (2.1, 2.3, 2.5) were development ones. That seems like ancient history now, but it’s really only in the past decade of git that kernel development tooling and community has reached a level of sophistication that the ship can keep moving while the engine is replaced.

Linus noted that there are a “few different classes” of changes that didn’t come to him following a previous test in linux-next. Those include fixes (which is “obviously ok and inevitable”), a specific example (statx) for a longstanding issue that has been ongoing for years (to which he said, “Yeah, I’ll allow this one too”), the “quite noticeable <linux/sched.h> split up series” which “had real reasons for late inclusion”. Finally, he includes the class of subsystems such as “drm, Infiniband, watchdog and btrfs”, which he “found rather annoying this merge window”. He reminded folks of the “linux-next sanity checks” and that if folks ingore them “you had better have your own sanity checks that you replaced them with” rather than “screw all the rules and processes we have in place to verify things”.

The bottom line? Linus says “You people know who you are. Next merge window I will not accept anything even remotely like that. Things that haven’t been in linux-next will be rejected, and since you’re already on my sh*t-list you’ll get shouted at again”. And nobody enjoys being shouted at by Linus. Well, almost nobody. There do seem to be a few people who perversely enjoy it.

Announcements

A couple of questions of code maintenance arose this week. The first was from Natale Patriciello, who asked whether UML (User Mode Linux) is “not maintained anymore?” by citing a few bugs that haven’t been resolved in some time. There were no followups at the time of this recording. The second question came in form of an RFC (Request For Comments) patch entitled “remove support for AVR32 architecture” from Hans-Christian Noren Egtvedt. He noted that AVR32 is “not keeping up with the development of the kernel”, “shares so much of the drivers with Atmel ARM SoC”, and “all AVR32 AP7 SoC processors are end of lifed from Atmel (now Microchip)”. This did seem like a fairly compelling set of reasons to kill it, which others agreed with also. This means that unless someone comes forward soon to maintain AVR32 (along with the associated GCC toolchain and other distribution pieces), its days in the upstream Linux kernel are numbered – and probably removed in 4.12.

Sebastian Andrzej Siewior announced Linux v4.9.13-rt11, which includes a fix for a previous fix (allowing the previous lockdep fix to compile on UP).

Drivers

Logan Gunthorpe posted “New Microsemi PCI Switch Management Driver”, which is in its 7th revision. The RFC (Request for Comments “proposes a management driver for Microsemi’s Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform”. Logan notes that “Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations”.

Ongoing Development

Michael S. Tsirkin continued his work on “vfio error recovery: kernel support” with version 4 of the patch series wich seeks to do more than simply ignoring non-fatal PCIe AER (Advanced Error Reporting) errors that hit assigned devices passed using VFIO into a guest Virtual Machine. Currently, only fatal errors (which cause a PCIe link reset) are reported – they stop the guest. In his summary email, Michael notes that his goal is to handle non-fatal errors by reporting them to the guest and having it handle them. And rather than surprising existing code, he calls out under “issues” that “this behavior should only be enabled with new userspace, old userspace should work without changes”. By “userspace” he means the code driving VFIO, which might be a QEMU process that is backing a KVM virtual machine context, or a container, or merely a bare metal userspace process that is using VFIO directly.

Johannes Weiner posted “mm: kswapd spinning on unreclaimable nodes – fixes and cleanups” in which he notes a previous posting from Jia He that he (and the team at Facebook) have reproduced. In the case of the problem scenario, the kernel’s kswapd (swap space daemon) for a given (memory) node spins indefinitely at 100% CPU usage when there are absolutely no reclaimable pages (granules of the smallest size of memory that can be managed by Linux and the underlying hardware) however the “condition for backing off is never met”. This results in kswapd busy-looping forever. In his patches, Johannes changes reclaim behavior so that kswapd will eventually really back off after failing 16 times (which is the same magic number of times we try during an OOM “Out Of Memory” situation) as defined by MAX_RECLAIM_RETRIES. He includes various examples.

Len Brown posted “cpufreq: Add the “cpufreq.off=1” cmdline option. This is a corollary to “cpuidle.off=1” and comes about for similar reasons for the purpose of testing. This author wonders aloud whether this will allow for buggy platforms that don’t support CPPC (Collaborative Processor Performance Control) to easily disable this at runtime too.

Aleksey Makarov posted “printk: fix double printing with earlycon”. On ACPI compliant platforms (including ARM servers), the SPCR (“Serial Port Console Redirection”) table provides information about the serial console UART that the kernel should be using, rather than having the user provide memory register addresses and baud rates on the kernel command line. This is a feature which is generally useful beyond ARM systems (although most x86 systems follow the traditional “PC” UART design). Prior to this fix, the kernel would double print output if given a “console=” and “earlycon”.

Minchan Kim posted “make try_to_unmap simple” which aims to remove some of the (apparently somewhat gratitous) complexity in the return value of this function. Currently it can return SWAP_SUCCESS, SWAP_FAIL, SWAP_AGAIN, SWAP_DIRTY, and SWAP_MLOCK. But Minchan feels that it can be simply a boolean return by removing the latter three of those return values.

Matthew Gerlach (Intel) posted “Altera Partial Reconfiguration IP”, which adds support to the kernel’s (Alan Tull’s) “fpga-mgr” driver for the “Altera Partial Reconfiguration IP”. Partial Reconfiguration (sometimes known as “PR” in the reconfigurable logic community) allows an FPGA (Field Programmable Gate Array)’s logic fabric to be reconfigured in smaller than whole regions. This (for example) would allow a closely coupled datacenter (Xeon) processor to continue to drive certain FPGA contained IP while other IP were being replaced dynamically. If one were to couple this with support in OpenStack Nomad or Kubernetes for dynamic reconfiguration at VM/container setup it would begin to enable various use cases for the mainstream datacenter around FPGA acceleration.

Andi Kleen posted “pci: Allow lockless access path to PCI mmconfig”. “mmconfig” refers to the memory mapped configuration region used by contemporary PCIe devices during enumeration and configuration. This is a kind of out-of-band mechanism by which the kernel can talk to PCIe devices in a fully standards compliant means prior to having configured them. Intel processors include many “PCIe” devices that are in fact a logical means of expressing so called “uncore” non-compute features on the processor SoC. They’re not real PCIe devices but appear to the kernel as such. This wonderful abstraction comes with some overhead cost, especially when the kernel spends time grabbing the “pci_cfg_lock” which it actually doesn’t need to hold, according to Andi.

Jarkko Sakkinen posted version 3 of “in-kernel resource manager”, which adds support to the kernel for “TPM spaces that provide an isolated execution context for transient objects and HMAC policy sessions”.

Tomas Winkler posted a question about what the community considered to be the “correct usage of arrats of variable length within [the] Linux kernel”. The replies generally included language to the form of “don’t”. Both for reasons of general language ugliness, and also because (especially in the case of local variables) the Linux kernel’s fixed (and also small) size stack raises serious potential for stack overflow if one is not careful. There was a suggestion that the kernel should be built with a compiler option to disallow VLAs, but that this would require various code to be fixed first.

March 06, 2017 12:53 AM

March 02, 2017

Pete Zaitcev: PTG, Sheraton, Atlanta

Serious reports trickle in (one, two), but on the lighter side, how was the venue? It's on the other side of the downtown, you know.

Bum incursions were very mild. Most went to the coffee area at the 1st floor, under the lobby level of Sheraton. One was remarkable though. I saw him practicing an unusual style of fighting on the sidewalk, with deep squats and wild swings - probably one of them prison styles. Very impressive, and also somewhat disturbing since all I have for him is checking with feet as well as I could, then move in for grappling phase, and then it's luck... Once done with his routine, he proceeded right past the coffee into the 2 rooms where some other org was meeting (not OpenStack) and started begging food off the hotel workers setting up tables. I heard him claiming that he was very hungry. Seemed super energetic and powerful a few minutes prior lol. But as much as I know, no laptops were stolen at the PTG (unlike e.g. OLS and UDS). Only goes to show that the main hazard is the venue staff and bums are more of an amusement, unless it's a really rough area.

March 02, 2017 01:51 AM

February 28, 2017

Kernel Podcast: Kernel Podcast for Feb 27th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170228.mp3

In this week’s kernel podcast: the merge window for kernel 4.11 is open and patches are flying into Linus’s inbox, fixing NUMA node determination at runtime, Virtual Machine Aware Caches, Advisory Memory Allocations, and a non-fixed TASK_SIZE to bring excitement to your life. We will have this, and a summary of ongoing development in this week’s Linux Kernel podcast.

The merge window (period of time during which disruptive changes are allowed to be “merged” – incorporated into Linus’s official git tree – prior to a multi-week stabilization and Release Candidate cycle) for Linux 4.11 is currently open. This means that the most recent official kernel remains Linux 4.10. Meanwhile, many “pull requests” and merges are in flight for various kernel subsystems planning updates in 4.11. These include:

For a detailed sumary of current merge widow pulls and patches, consult this week’s Linux Weekly News at LWN.net (Thursday).

Geert Uytterhoeven posted a summary of “Build regressions/improvements in v4.10”. These show an increase in build errors and warnings vs the previous 4.9 kernel cycle. He posted a list of configs used, the error and warning messages, and thanked the “linux-next team for providing the build service”.

Pavel Machek has been posting about various problems running 4.10 kernels. In one instance, he saw a corrupted stack that implied a double call to “startup_32_smp” (the secondary CPU boot method on Intel x64 Architecture). This lead Josh Poimbeouf to ponder whether the GCC in use was somehow bad.

Announcements

Greg Kroah-Hartman announced Linux 4.4.52, 4.9.13, and 4.10.1. Ben Hutchings announced Linux 3.16.41, and 3.2.86.

Stephen Hemminger announced iproute2-4.10, including support for “new features in Linux 4.10”. Amongst those new features are “enhanced support for BPF [Berkley Packet Filter], VRF [Virtual Routing and Forwarding], and Flow based classifier (flower)”. The latest version is available here: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.10.0.tar.gz

Karel Zak announced util-linux v2.29.2, including a fix for a (nasty) “su” security issue, otherwise documented in CVE-2017-2616. According to Karel, it is “possible for any local user to send SIGKILL to other processes with root privileges. To exploit this, the user must be able to perform su with a successful login. SIGKILL can only be send to processes which were executed after the su process. It is not possible to send SIGKILL to processes which were already running”. A fix entitled “properly clear child PID” against “su” is included among the fixes listed.

Lucas De Marchi announced kmod 24, which includes enhanced support for kernel module dependency loop detection: ftp://ftp.kernel.org/pub/linux/utils/kernel/kmod/kmod-24.tar.xz

Junio C Hamano announced git version 2.12.0: https://www.kernel.org/pub/software/scm/git/

Con Kolivas announced his Linux-4.10-ck1 MuQSS (Multiple Queue Skiplist Scheduler) version 0.152. More details at: http://ck.kolivas.org/patches/4.0/4.10/4.10-ck1/

Ove Kent Karlsen has been performing various Linux gaming experiments. They posted links to YouTube videos showing results with “Doom 3”, which can be found here: https://www.youtube.com/watch?v=xDct6vVvFxA

NUMA node determination

Dou Liyang (Fujitsu) posted several revisions of a patch series entitled “Revert works for the mapping of cpuid <-> nodeid”. This is intended to clean up the process by which (Intel x64 Architecture) systems enumerate the mapping of physical processor IDs to NUMA (Non-Uniform Memory Architecture) multi-socket “node” IDs. Conventionally, Linux uses the MADT (Multiple APIC Description Table – otherwise known as the “APIC” table for legacy reasons). ACPI table to map processors to their “Local APIC ID” (the ID of the core connected to the Intel APIC interrupt controller’s LAPIC CPU interface). It then maps these to NUMA nodes using the _PXM node ID in the ACPI DSDT (Differentiated System Description Table) and determines NUMA topology using the SRAT (Static Resource Affinity Table) and SLIT (System Locality Information Table). But this is fragile. Firmware developers are known to make mistakes on occasion, and these have included “duplicated processor IDs in DSDT”, and having the “_PXM in DSDT…inconsistent with the one in [the] MADT”. For this reason, Dou seeks to move the proximity discovery into the system’s hotplug path by reverting two previous commits. Xiaolong Ye (Intel) said he would test these and followup.

As a footnote, it’s worth adding that modern processors have a very  oose notion of a “physical” core, since they usually (internally) support dynamic remapping of true physical cores to the IDs exposed even to system programmers. This affords the illusion of contiguously numbered processors, and prevents an easy analysis of binning and yield characteristics. It’s one of the reasons that processors such as Intel’s use various mapping schemes in order to determine NUMA node proximinity. But one should never assume that any information given about a processor in any table reflects reality other than as a microprocessor company wanted you to perceive it.

Virtual Machine Aware Caches

Shanker Donthineni (Codeaurora) posted “arm64: Add support for VMID aware PIPT instruction cache”. Caches on the ARMv8 architecture are defined to be PIPT (Physically Indexed, Physically Tagged) from a software perspective (although the underlying implementation might be different – for example, you could index virtually with VIPT underneath a PIPT facade if you implemented expensive logic for automatic homonym detection). The ARMv8.2 specification allows “VMID aware PIPT” which means a cache is PIPT but aware of the existence of Virtual Machine IDs (VMIDs), which might form part of the cache entry. Will Deacon responded that the approach “may well cause problems for KVM with non-VHE [Virtual Host Extension – the ability to run “type 2″ hypervisors with split page tables for the kernel and userspace, as opposed to non-VHE implemented on original ARMv8.0 machines in which a shim running with its own page tables is required for KVM] because the host VMID is different from the guest VMID, yet we assume that I-cache invalidation by the host *will* affect the guest when, for example, invalidating the I-cache for pages holding the guest kernel Image”. He noted that he had some other patches in flight that he would post soon (for 4.12).

Advisory Memory Allocations in real life

Shaohua Li (Facebook) posted “mm: fix some MADV_FREE issues”. MADV_FREE is part of relatively recent(ish) kernel infrastructure to support advisory mmaps that the kernel may need to arbitrarily reclaim later when low on available memory. It’s the kind of thing that other Operating Systems (such as Windows) have done for many years (Windows will even dynamically enlarge its swap (paging) file on low memory situations). Facebook apparently like to use the (alternative) “jemalloc” userspace memory allocator and have found a number of issues when attempting to combine this with MADV_FREE flags to mmap. Shaohua notes that MADV_FREE cannot be used on a machine without swap enabled, actually increases memory pressure (due to page reclaim being biases against anonymous pages), and the lack of global accounting. The patches aim to address these.

Non-fixed TASK_SIZE

Martin Schwidefsky and Linus Torvalds had a back and forth discussion about “Using TASK_SIZE for kernel threads”. As kernel programmers know, kernel threads (“tasks”, or “kernel processes” – these show up in brackets in “ps” and “top”) don’t have an associated “mm” struct (they have no userspace). On s390, just to be different, TASK_SIZE is not fixed. It can actually be one of several values that are determined by reading a field in a task’s mm struct (context.asce_limit). This was causing very subtle breakage as the kernel indirected into a null structure which happened to contain a value very close to zero that kinda worked. Martin has a fixed queued up but had some suggestions for changes to make to the kernel to avoid such a subtle issue in future. Linus was more convinced that s390 was just doing something that needed fixing.

Ongoing Development

Elena Reshetova (Intel) posted many patches converting various uses of the kernel’s “atomic_t” datatype as a reference counter over to the new “refcount_t”. As she notes, “[b]y doing this we prevent intentional or accidental underflows or overflows that can le[a]d to use-after-free vulnerabilities”. Examples including architecture and VM code fixes.

Xunlei Pang (Red Hat) posted version 2 of a patch entitled “x86/mce: Don’t participate in rendezvous process once nmi-shootdown_cpus() was  made’. This aims to juggle a post-crash conumdrum: system errors sufficient enough to generate an MCE (Machine Check Exception) should not be ignored (and thus the machine check handler should run in the kernel) but they might be generated during the process of actively taking a crash/kdump. The existing code might instead cause a panic on exit from the (old kernel provided) MCE handler. Borislav Petkov didn’t like some of the details of the patch. He wanted to also see explicit documentation as to the handling of MCEs.

Andy Lutomirski posted “KVM TSS cleanups and speedups”, which aims to refactor how the kernel handles guest TSS (Task Segment Selector) handling on Intel x64 Architecture systems. These are layered upon a series from Thomas Gleixner aimed at cleaning up GDT (Global Descriptor Table) use. He notes that there “may be a slight speedup, too, because they remove an STR [store] instruction from the VMX [Virtual Machine] entry path”.

Heikki Krogerus posted version 17 of a patch series implementing “USB Type-C Connector class” support. This is “meant to provide [a] unified interface to…userspace to present the USB Type-C ports in a system”. Your author is looking forward to trying this on his Dell XPS Skylake with USB-C.

Rob Herring posted a patch “Add SPDX license tag check for dts files and headers” to the kernel’s “checkpatch.pl” patch submission checking tool.

Finally this week, Lorenzo Pieralisi posted “PCI: fix config and I/O Address space memory mappings” intended to address the inconvenient fact that “ioremap” on 32-bit and 64-bit ARM platforms was failing to strictly comply with the PCI local bus specification’s “Transaction Ordering and Posting” requirements. These mandate that PCI configuration cycles (during startup or hotplug) and I/O address space accesses must be “non-posted” (in other words, they must always receive a write notification response and not be buffered arbitrarily). Lorenzo addresses this with a 20 part patch series that cleans this up.

February 28, 2017 07:53 AM

Kees Cook: security things in Linux v4.10

Previously: v4.9.

Here’s a quick summary of some of the interesting security things in last week’s v4.10 release of the Linux kernel:

PAN emulation on arm64

Catalin Marinas introduced ARM64_SW_TTBR0_PAN, which is functionally the arm64 equivalent of arm’s CONFIG_CPU_SW_DOMAIN_PAN. While Privileged eXecute Never (PXN) has been available in ARM hardware for a while now, Privileged Access Never (PAN) will only be available in hardware once vendors start manufacturing ARMv8.1 or later CPUs. Right now, everything is still ARMv8.0, which left a bit of a gap in security flaw mitigations on ARM since CONFIG_CPU_SW_DOMAIN_PAN can only provide PAN coverage on ARMv7 systems, but nothing existed on ARMv8.0. This solves that problem and closes a common exploitation method for arm64 systems.

thread_info relocation on arm64

As done earlier for x86, Mark Rutland has moved thread_info off the kernel stack on arm64. With thread_info no longer on the stack, it’s more difficult for attackers to find it, which makes it harder to subvert the very sensitive addr_limit field.

linked list hardening
I added CONFIG_BUG_ON_DATA_CORRUPTION to restore the original CONFIG_DEBUG_LIST behavior that existed prior to v2.6.27 (9 years ago): if list metadata corruption is detected, the kernel refuses to perform the operation, rather than just WARNing and continuing with the corrupted operation anyway. Since linked list corruption (usually via heap overflows) are a common method for attackers to gain a write-what-where primitive, it’s important to stop the list add/del operation if the metadata is obviously corrupted.

seeding kernel RNG from UEFI

A problem for many architectures is finding a viable source of early boot entropy to initialize the kernel Random Number Generator. For x86, this is mainly solved with the RDRAND instruction. On ARM, however, the solutions continue to be very vendor-specific. As it turns out, UEFI is supposed to hide various vendor-specific things behind a common set of APIs. The EFI_RNG_PROTOCOL call is designed to provide entropy, but it can’t be called when the kernel is running. To get entropy into the kernel, Ard Biesheuvel created a UEFI config table (LINUX_EFI_RANDOM_SEED_TABLE_GUID) that is populated during the UEFI boot stub and fed into the kernel entropy pool during early boot.

arm64 W^X detection

As done earlier for x86, Laura Abbott implemented CONFIG_DEBUG_WX on arm64. Now any dangerous arm64 kernel memory protections will be loudly reported at boot time.

64-bit get_user() zeroing fix on arm
While the fix itself is pretty minor, I like that this bug was found through a combined improvement to the usercopy test code in lib/test_user_copy.c. Hoeun Ryu added zeroing-on-failure testing, and I expanded the get_user()/put_user() tests to include all sizes. Neither improvement alone would have found the ARM bug, but together they uncovered a typo in a corner case.

no-new-privs visible in /proc/$pid/status
This is a tiny change, but I like being able to introspect processes externally. Prior to this, I wasn’t able to trivially answer the question “is that process setting the no-new-privs flag?” To address this, I exposed the flag in /proc/$pid/status, as NoNewPrivs.

That’s all for now! Please let me know if you saw anything else you think needs to be called out. :) I’m already excited about the v4.11 merge window opening…

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

February 28, 2017 06:31 AM

February 27, 2017

Dave Airlie: radv + steamvr

If anyone wants to run SteamVR on top of radv, the code is all public now.

https://github.com/airlied/mesa/tree/radv-wip-steamvr

The external memory code will be going upstream to master once I clean it up a bit, the semaphore hack is waiting on kernel
changes, and the NIR shader hack is waiting on a new SteamVR build that removes the bad use of SPIR-V.

I've run Serious SAM TFE in VR mode on this branch.

February 27, 2017 07:42 PM

Matthew Garrett: The Fantasyland Code of Professionalism is an abuser's fantasy

The Fantasyland Institute of Learning is the organisation behind Lambdaconf, a functional programming conference perhaps best known for standing behind a racist they had invited as a speaker. The fallout of that has resulted in them trying to band together events in order to reduce disruption caused by sponsors or speakers declining to be associated with conferences that think inviting racists is more important than the comfort of non-racists, which is weird in all sorts of ways but not what I'm talking about here because they've also written a "Code of Professionalism" which is like a Code of Conduct except it protects abusers rather than minorities and no really it is genuinely as bad as it sounds.

The first thing you need to know is that the document uses its own jargon. Important here are the concepts of active and inactive participation - active participation is anything that you do within the community covered by a specific instance of the Code, inactive participation is anything that happens anywhere ever (ie, active participation is a subset of inactive participation). The restrictions based around active participation are broadly those that you'd expect in a very weak code of conduct - it's basically "Don't be mean", but with some quirks. The most significant is that there's a "Don't moralise" provision, which as written means saying "I think people who support slavery are bad" in a community setting is a violation of the code, but the description of discrimination means saying "I volunteer to mentor anybody from a minority background" could also result in any community member not from a minority background complaining that you've discriminated against them. It's just not very good.

Inactive participation is where things go badly wrong. If you engage in community or professional sabotage, or if you shame a member based on their behaviour inside the community, that's a violation. Community sabotage isn't defined and so basically allows a community to throw out whoever they want to. Professional sabotage means doing anything that can hurt a member's professional career. Shaming is saying anything negative about a member to a non-member if that information was obtained from within the community.

So, what does that mean? Here are some things that you are forbidden from doing:


Now, clearly, some of these are unintentional - I don't think the authors of this policy would want to defend the idea that you can't report something to the police, and I'm sure they'd be willing to modify the document to permit this. But it's indicative of the mindset behind it. This policy has been written to protect people who are accused of doing something bad, not to protect people who have something bad done to them.

There are other examples of this. For instance, violations are not publicised unless the verdict is that they deserve banishment. If a member harasses another member but is merely given a warning, the victim is still not permitted to tell anyone else that this happened. The perpetrator is then free to repeat their behaviour in other communities, and the victim has to choose between either staying silent or warning them and risk being banished from the community for shaming.

If you're an abuser then this is perfect. You're in a position where your victims have to choose between their career (which will be harmed if they're unable to function in the community) and preventing the same thing from happening to others. Many will choose the former, which gives you far more freedom to continue abusing others. Which means that communities adopting the Fantasyland code will be more attractive to abusers, and become disproportionately populated by them.

I don't believe this is the intent, but it's an inevitable consequence of the priorities inherent in this code. No matter how many corner cases are cleaned up, if a code prevents you from saying bad things about people or communities it prevents people from being able to make informed choices about whether that community and its members are people they wish to associate with. When there are greater consequences to saying someone's racist than them being racist, you're fucking up badly.

comment count unavailable comments

February 27, 2017 01:40 AM

February 26, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: What if I Knew Then What I Know Now?

During my keynote at the 2017 Multicore World, Mark Moir asked what I would have done differently if I knew then what I know now, with the “then” presumably being the beginning of the RCU effort back in the early 1990s. Because I got the feeling that my admittedly glib response did not fully satisfy Mark, I figured I should try again. So imagine that you traveled back in time to the very end of the year 1993, not long after Jack Slingwine and I came up with read-copy lock (now read-copy update, or just RCU), and tried to pass on a few facts about my younger self's future. The conversation might have gone something like this:

You  By the year 2017, RCU will be part of the concurrency curriculum at numerous universities and will be very well-regarded in some circles.
Me  Nice! That must mean that DYNIX/ptx will also be doing well!

You  Well, no. DYNIX/ptx will disappear by 2005, being replaced by the combination of IBM's AIX and another operating system kernel started as a hobby.
Me  AIX??? Surely you mean Solaris, HP-UX or Ultrix! And I wouldn't say that BSD started as a hobby! It was after all fully funded research.

You  No, Sun Microsystems was acquired by Oracle in 2010, and Solaris was already in decline by that time. IBM's AIX was by then the last proprietary UNIX operating system standing. A new open-source kernel called "Linux" became the dominant OS.
Me  IBM??? But they are currently laying off more people each month than Sequent employs worldwide!!! Why would they even still be in business in 2010?

You  True. But their new CEO, Louis Gerstner, will turn IBM around.
Me  Well, yes, he did just become IBM's CEO, but before that he was CEO of RJR Nabisco. That should work about as well as John Sculley's tenure as CEO of Apple. What does Gerstner know about computers, anyway?

You  He apparently knew enough to get IBM back on its feet. In fact, IBM will buy Sequent, so that you will become an IBM employee on April 1, 2000.
Me  April Fools day? Now I know you are joking!!!

You  No joke. You will become an IBM employee on April 1, 2000, seven years to the day after Louis Gerstner became an IBM employee.
Me  OK, I guess that explains why DYNIX/ptx doesn't make it past 2005. That is really annoying! So the teaching of RCU in universities is some sort of pity play, then?

You  No. Dipankar Sarma will get RCU accepted into Linux in 2002.
Me  I could easily believe that—he is very capable. So what do I do instead?

You  You will take over maintainership of RCU in 2005.
Me  Is Dipankar going to be OK?

You  Of course! He will just move on to other projects. It is just that there will be a lot more work needed on RCU, which you will take on.
Me  What more work could there be? It is a pretty simple mechanism, way simpler than a memory allocator, for example.

You  Well, there will be quite a bit of scalability work needed. For example, you will receive a scalability bug report involving a 512-CPU shared-mmeory system.
Me  Hmmm... It took Sequent from 1985 to 1997 to get from 30 to 64 CPUs, so that is doubling every 12 years, so I am guessing that I received this bug report somewhere near the year 2019. So what did I do in the meantime?

You  No, you will receive this bug report in 2004.
Me  512-CPU system in 2004??? Well, suspending disbelief, this must be why I will start maintaining RCU in 2005.

You  No, a quick fix will be supplied by a guy named Manfred Spraul, who writes concurrent Linux-kernel code as a hobby. So you didn't do the scalability work until 2008.
Me  Concurrent Linux-kernel coding as a hobby? That sounds unlikely. But never mind. So what did I do between 2005 and 2008? Surely it didn't take me three years to create a highly scalable RCU implementation!

You  You will work with a large group of people adding real-time capabilities to the Linux kernel. You will create an RCU implementation that allowed readers to be preempted.
Me  That makes absolutely no sense! A context switch is a quiescent state, so preempting an RCU read-side critical section would result in a too-short grace period. That most certainly isn't going to help anything, given that a crashed kernel isn't going to offer much in the way of real-time response!

You  I don't know the details, but you will make it work. And this work will be absolutely necessary for the Linux kernel to achieve 20-microsecod interrupt and scheduling latencies.
Me  Given that this is a general-purpose OS, you obviously meant 20 milliseconds!!! But what could RCU possibly be doing that would contribute significantly to a 20-millisecond interrupt/scheduling delay???

You  No, I really did mean sub-20-microsecond latencies. By 2010 or so, even vanilla non-realtime Linux kernel will easily meet 20-millisecond latencies, assuming the hardware and software is properly configured.
Me  Ah, got it! CPU core clock rates should be somewhere around 50GHz by 2010, which might well make those sorts of latencies achievable.

You  No, power-consumption and heat-dissipation constraints will cap CPU core clock frequencies at about 5GHz in 2003. Most systems will run in the 1-3GHz range even as late as in 2017.
Me  Then I don't see how a general-purpose OS could possibly achieve sub-20-microsecond latencies, even on a single-CPU system, which wouldn't have all that much use for RCU.

You  No, this will be on SMP systemss. In fact, in 2012, you will receive a bug report complaining of excessively long 200-microsecond latencies on a system running 4096 CPUs.
Me  Come on! I believe that Amdahl's Law has something to say about lock contention on such large systems, which would rule out reasonable latencies, let alone 200-microsecond latencies! And there would be horrible reliability problems with that many CPUs! You wouldn't be able to keep the system running long enough to measure the latency!!!

You  Hey, I am just telling you what will happen.
Me  OK, so after I get RCU to handle insane scalability and real-time response, there cannot be anything left to do, right?

You  Actually, wrong. Energy efficiency becomes extremely important, and you will rewrite the energy-efficiency RCU code more than eight times before you get it right.
Me  Eight times??? You must be joking!!! Seems like it would be better to just waste a little energy. After all, computers don't consume all that much energy, especially compared to industrial and transportation systems.

You  No, that would not work. By 2005, there are quite a few datacenters that are limited by electrical power rather than by floor space. So much so that large data centers open in Eastern Oregon, on the sites of the old aluminum smelters. When you have that many servers, even a few percent of energy savings translates to millions of dollars a year, which is well worth spending some development effort on.
Me  That is an insanely large number of servers!!! How many Linux instances are running by that time, anyway?

You  By the mid-2010s, the number of Linux instances is well in excess of one billion, but no one knows the exact number.
Me  One billion??? That is almost one server for every family in the world! No way!!!

You  Well, most of the Linux instances are not servers. There are a lot of household appliances running Linux, to say nothing of battery-powered handl-held smartphones. By 2017, most of the smartphones will have multiple CPUs.
Me  Why on earth would you need multiple CPUs to make a phone call? And how would you fit multiple CPUs into a hand-held device? And where do you put the battery, in a large backpack or something???

You  No, the entire device, batteries, CPUs and all, will fit easily into your shirt pocket. And these smartphones can take pictures, record video, do video conference calls, find precise locations using GPS, translate among multiple languages, and much else besides. They are really full-fledged computers that fit in your pocket.
Me  A pocket-sized supercomputer??? And how would I possibly go about testing RCU code sufficiently for your claimed billion instances???

You  Interesting question. You will give a keynote at the 2017 Multicore World in February 2017 at Wellington, New Zealand describing some of your plans. These plans include the use of formal verification in your regression test suite.
Me  Formal verification of highly concurrent code in a regression test suite??? OK, now I know for sure that you are pulling my leg! It has been an interesting conversation, but I must get back to reality!!!


My 1993 self did not have a very accurate view of 2017, did he? As the old saying goes, predictions are hard, especially about the future! So it is quite wise to take such predictions with a considerable supply of salt.

February 26, 2017 11:09 PM

Pavel Machek: Using Linux notebook as an alarm clock

Is someone using notebook as an alarm clock? Yes, it would be easy if I did not suspend machine overnight, but that would waste power and produce noise from fans. I'd like version that suspends the machine...

February 26, 2017 10:32 PM

February 21, 2017

Pavel Machek: X220 to play with

Nice machine. Slightly bigger than X60, bezel around display way too big, but quite powerful. Biggest problem seems to be that it does not accept 9.5mm high drives...

I tried 4.10 there, and got two nasty messages during bootup. Am I the last one running 32 bit kernels?

I was hoping to get three-monitor configuration on my desk, but apparently X220 can not do that. xrandr reports 8 outputs (!), but it physically only has 3: LVDS, displayport and VGA. Unfortunately, it seems to only have 2 CRTCs, so only 2 outputs can be active at a time. Is there a way around that?

February 21, 2017 10:21 PM

Gustavo F. Padovan: Collabora Contributions to Linux Kernel 4.10

Linux Kernel v4.10 is out and this time Collabora contributed a total of 39 patches by 10 different developers. You can read more about the v4.10 merge window on LWN.net: part 1, part 2 and part 3.

Now here is a look at the changes made by Collaborans. To begin with Daniel Stone fixed an issue when waiting for fences on the i915 driver, while Emil Velikov added support to read the PCI revision for sysfs to improve the starting time in some applications.

Emilio López added a set of selftests for the Sync File Framework and Enric Balletbo i Serra added support for the ChromeOS Embedded Controller Sensor Hub. Fabien Lahoudere added support for the NVD9128 simple panel and enabled ULPI phy for USB on i.MX.

Gabriel Krisman fixed a spurious CARD_INT interrupts for SD cards that was preventing one of our kernelCI machines to boot. On the graphics side Gustavo Padovan added Explicit Synchronization support to DRM/KMS.

Martyn Welch added GPIO support for CP2105 USB serial device while Nicolas Dufresne fixed Exynos4 FIMC to roundup imagesize to row size for tiled formats, otherwise there would be enough space to fit the last row of the image. Last but not least, Tomeu Vizoso added debugfs interface to capture frames CRCs, which is quite helpful for debugging and automated graphics testing.

And now the complete list of Collabora contributions:

Daniel Stone (1):

Emil Velikov (1):

Emilio López (7):

Enric Balletbo i Serra (3):

Fabien Lahoudere (4):

Gabriel Krisman Bertazi (1):

Gustavo Padovan (18):

Martyn Welch (1):

Nicolas Dufresne (1):

Tomeu Vizoso (2):

February 21, 2017 04:02 PM

February 20, 2017

Kernel Podcast: Kernel Podcast for Feb 20th, 2017

UPDATE: Thanks to LWN for the mention. This podcast is in “alpha”. It will start to show up on iTunes and Google Play (which didn’t exist last time I did this thing!) stores within the next day or two. You can also subscribe (for the moment) by using this link: kernel podcast audio rss feed. This podcast format will be tweaked, and the format/layout will very likely change a bit as I figure out what works, and what does not. Equipment just started to arrive at home (Zoom H4N Pro, condenser mics, etc.), a new content publishing platform needs to get built (I intend ultimately for listeners to help to create summaries by annotating threads as they happen). And yes, my former girlfriend will once again be reprising her role as author of another catchy intro jingle…soon 😉

Audio: Kernel Podcast 20170220

Support for this podcast comes from Jon Masters, trying to bring back the Kernel Podcast since 2012.

In this week’s edition: Linus Torvalds announces Linux 4.10, Alan Tull updates his FPGA manager framework, and Intel’s latest 5-level paging patch series is posted for review. We will have this, and a summary of ongoing development in the first of the newly revived Linux Kernel Podcast.

Linux 4.10

Linus Torvalds announced the release of 4.10 final, noting that “it’s been quiet since rc8, but we did end up fixing several small issues, so the extra week was all good”. Linus added a (relatively rare) additional “RC8” (Release Candidate 8) to this kernel cycle due to the timing – many of us were attending the “Open Source Leadership Summit” (OSLS, formerly “Linux Foundation Collaboration Summit”, or “Collab”) over the past week. The 4.10 kernel contains about 13,000 commits, which used to seem large but somehow now…isn’t. Kernelnewbies.org has the usual summary of new features and fixes: https://kernelnewbies.org/Linux_4.10

With the announcement of 4.10 comes the opening of the merge window for Linux 4.11 (the period of up to two weeks at the beginning of a development cycle, during with new features and disruptive changes are “pulled” into Linus’s kernel (git) tree). The 4.11 merge window begins today.

FPGA Manager Updates

Alan Tull posted a patch series implementing “FPGA Region enhancements and fixes”, which “intends to enable expanding the user of FPGA regions beyond device tree overlays”. Alan’s FPGA manager framework allows the kernel to manage regions within FPGAs (Field Programmable Gate Arrays) known as “partial reconfigurable” regions – areas of the logic fabric that can be loaded with new bitstream configs. Part of the discussion around the latest patches centered on their providing a new sysfs interface for loading FPGA images, and in particular the need to ensure that this ABI handle FPGA bitstream metadata in a standard and portable fashion across different OSes.

Intel 5-level paging

Kirill A. Shutemov posted version 3 of Intel’s 5 level paging patch series that expands the supportable VA (Virtual Address) space on Intel Architecture from 256TiB (64TiB physical) to 128PiB (4PiB physical). Channeling his inner Bill Gates, he suggests that this “ought to be enough for anybody”. Key among the TODO items remains “boot-time switch between 4 and 5-level paging” to avoid the need for custom kernels. The latest patches introduce two new prctl calls to manage the maximum virtual address space available to userspace processes during mmap calls (PR_SET_MAX_VADDR and PR_GET_MAX_VADDR). This is intended to aid in compatibility by preventing certain legacy programs from breaking when confronted with a 56-bit address space they weren’t expecting. In particular, some JITs use high order “canonical” bits in existing x86 addresses to encode pointer tags and other information (that they should not per a strict interpretation of Intel’s “Canonical Addressing”).

Announcements

Steven Rostedt announced verious preempt-rt (“Real Time”) kernel trees (4.4.47-rt59, 4.1.38-rt45, 3.18.47-rt52, 3.12.70-rt94, and 3.10.104-rt118). Sebastian Andrzej also announced version v4.9.9-rt6 of the preempt-rt “Real Time” Linux patch series. It includes fixes for a spurious softirq wakeup, and a GPL symbol issue. A known issue is that CPU hotplug can still deadlock.

Junio C Hamano announced version v2.12.0-rc2 of git.

Bugfixes

Hoeun Ryu posted version 6 of a patch that takes care to properly free up virtually mapped (vmapped) stacks that might be in the kernel’s stack cache when cpus are offlined (otherwise the kernel was leaking these during offline/online operations).

New Drivers

Mahipal Challa posted version 2 of a patch series implementing a compression driver for the Cavium ThunderX “ZIP” IP on their 64-bit ARM server SoC (System-on-Chip) to plumb into the kernel cryptoapi.

Anup Patel posted version 3 of a patch implementing RAID offload
support for the Broadcom “SBA” RAID device on their SoCs.

Ongoing Development

Andi Kleen posted various perf vendor events for Intel uncore devices, Kan Liang posted new core events for Intel Goldmont, and Srinivas Pandruvada posted perf events for Intel Kaby Lake.

Velibor Markovski (Broadcom) posted a patch implementing ARM Cache Coherent Network (CCN) 502 support.

Sven Schmidt posted version 7 of a patch series updating the LZ4 compression module to support a mode known as “LZ4 fast”, in particular for the benefit of its use by the lustre filesystem.

Zhou Xianrong posted a patch (for the ARM Architecture) that attempts to save kernel memory by freeing parts of the the linear memmap for physical PFNs (page frame numbers) that are marked reserved in a DeviceTree. This had some pushback. The argument is that it saves memory on resource constrained machines – 6MB of RAM in the example.

Jessica Yu (who took over maintaining the in-kernel module loader infrastructure from Rusty Russell some time back) posted a link to her module-next tree in the kernel MAINTAINERS document.

Bhupesh Sharma posted a patch moving in-kernel handling of ACPI BGRT (Boot(time) Graphics Resource) tables out of the x86 architecture tree and into drivers/firmware/efi (so that it can be shared with the 64-bit ARM Architecture).

Jarkko Sakkinen posted version 2 of a patch series implementing a new in-kernel resource manager for “TPM spaces” (these are “isolated execution context(s) for transient objects and HMAC and policy sessions.”. Various test scripts were provided also.

That’s all for this week. Tune in next time for the latest happenings in the Linux kernel community. Don’t forget to follow us @kernelpodcast

February 20, 2017 07:50 AM

Kernel Podcast: Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

February 20, 2017 06:43 AM

February 12, 2017

James Bottomley: Using letsencrypt certificates with DANE

If, like me, you run your own cloud server, at some point you need TLS certificates to export secure services.  Like a lot of people I object to paying a so called X.509 authority a yearly fee just to get a certificate, so I’ve been using a free startcom one for a while.  With web browsers delisting startcom, I’m unable to get a new usable certificate from them, so I’ve been investigating letsencrypt instead (if you’re in to fun ironies, by the way, you can observe that currently the letsencrypt site isn’t using a letsencrypt certificate, perhaps indicating the administrative difficulty of doing so).

The problem with letsencrypt certificates is that they currently have a 90 day expiry, which means you really need to use an automated tool to keep your TLS certificate current.  Fortunately the EFF has developed such a tool: certbot (they use a letsencrypt certificate for their site, indicating you can have trust that they do know what they’re doing).  However, one of the problems with certbot is that, by default, it generates a new key each time the certificate is renewed.  This isn’t a problem for most people, but if you use DANE records, it causes significant issues.

Why use both DANE and letsencrypt?

The beauty of DANE, as I’ve written before, is that it gives you a much more secure way of identifying your TLS certificate (provided you run DNSSEC).  People verifying your certificate may use DANE as the only verification mechanism (perhaps because they also distrust the X.509 authorities) which means the best practice is to publish a DANE TLSA record for each service and also use an X.509 authority rooted certificate.  That way your site just works for everyone.

The problem here is that being DNS based, DANE records can be cached for a while, so it can take a few days for DANE certificate updates to propagate through the DNS infrastructure. DANE records have an answer for this: they have a mode where the record identifies only the hash of the public key used by the website, not the certificate itself, which means you can change your certificate as much as you want provided you keep the same public/private key pair.  And here’s the rub: if certbot is going to try to give you a new key on each renewal, this isn’t going to work.

The internet society also has posts about this.

Making certbot work with DANE

Fortunately, there is a solution: the certbot manual mode (certonly) takes a –csr flag which allows you to construct your own certificate request to send to letsencrypt, meaning you can keep a fixed key … at the cost of not using most of the certbot automation.  So, how do you construct a correct csr for letsencrypt?  Like most free certificates, letsencrypt will only allow you to specify the certificate commonName, which must be a DNS pointer to the actual website.  If you need a certificate that covers multiple sites, all the other sites must be enumerated in the x509 v3 extensions field subjectAltName.  Let’s look at how openssl can generate such a certificate request.  One of the slight problems is that openssl, being a cranky tool, does not allow you to specify a subjectAltName on the command line, so you have to construct a special configuration file for it.  I called mine letsencrypt.conf

[req]
prompt = no
distinguished_name = req_dn
req_extensions = req_ext

[req_dn]
commonName = bedivere.hansenpartnership.com

[req_ext]
subjectAltName=@alt_names

[alt_names]
DNS.1=bedivere.hansenpartnership.com
DNS.2=www.hansenpartnership.com
DNS.3=hansenpartnership.com
DNS.4=blog.hansenpartnership.com

As you can see, I’ve given my canonical server (bedivere) as the common name and then four other subject alt names.  Once you have this config file tailored to your needs, you merely run

openssl req -new -key <mykey.key> -config letsencrypt.conf -out letsencrypt.csr

Where mykey.key is the path to your private key (you need a private key because even though the CSR only contains the public key, it is also signed).  However, once you’ve produced this letsencrypt.csr, you no longer need the private key and, because it’s undated, it will now work forever, meaning the infrastructure you put into place with certbot doesn’t need to be privileged enough to access your private key.  Once this is done, you make sure you have TLSA 3 1 1 records pointing to the hash of your public key (here’s a handy website to generate them for you) and you never need to alter your DANE record again.  Note, by the way that letsencrypt certificates can be used for non-web purposes (I use mine for encrypted SMTP as well), so you’ll need one DANE record for every service you use them for.

Putting it all together

Now that you have your certificate request, depending on what version of certbot you have, you may need it in DER format

openssl req -in letsencrypt.csr -out letsencrypt.der -outform DER

And you’re ready to run the following script from cron

#!/bin/bash
date=$(date +%Y-%m-%d)
dir=/etc/ssl/certs
cert="${dir}/letsencrypt-${date}.crt"
fullchain="${dir}/letsencrypt-${date}.pem"
chain="${dir}/letsencrypt-chain-${date}.pem"
csr=/etc/ssl/letsencrypt.der
out=/tmp/certbot.out

##
# certbot handling
#
# first it cannot replace certs, so ensure new locations (date suffix)
# each time mean the certificate is unique each time.  Next, it's
# really chatty, so the only way to tell if there was a failure is to
# check whether the certificates got updated and then get cron to
# email the log
##

certbot certonly --webroot --csr ${csr} --preferred-challenges http-01 -w /var/www --fullchain-path ${fullchain} --chain-path ${chain} --cert-path ${cert} > ${out} 2>&1

if [ ! -f ${fullchain} -o ! -f ${chain} -o ! -f ${cert} ]; then
    cat ${out}
    exit 1;
fi

# link into place

# cert only (apache needs)
ln -sf ${cert} ${dir}/letsencrypt.crt
# cert with chain (stunnel needs)
ln -sf ${fullchain} ${dir}/letsencrypt.pem
# chain only (apache needs)
ln -sf ${chain} ${dir}/letsencrypt-chain.pem

# reload the services
sudo systemctl reload apache2
sudo systemctl restart stunnel4
sudo systemctl reload postfix

Note that this script needs the ability to write files and create links in /etc/ssl/certs (can be done by group permission) and the systemctl reloads need the following in /etc/sudoers

%LimitedAdmins ALL=NOPASSWD: /bin/systemctl reload apache2
%LimitedAdmins ALL=NOPASSWD: /bin/systemctl reload postfix
%LimitedAdmins ALL=NOPASSWD: /bin/systemctl restart stunnel4

And finally you can run this as a cron script under whichever user you’ve chosen to have sufficient privilege to write the certificates.  I run this every month, so I know I if anything goes wrong I have at least two months to fix it.

Oh, and just in case you need proof that I got this all working, here you are!

February 12, 2017 06:55 PM

February 03, 2017

Daniel Vetter: LCA Hobart: Maintainers Don't Scale

Seems that there was a rift in the spacetime that sucked away the video of my LCA talk, but the awesome NextDayVideo team managed to pull it back out. And there’s still the writeup and slides available.

February 03, 2017 12:00 AM

February 02, 2017

Pete Zaitcev: Richard Feynman on Gerrit reviews

Here's a somewhat romanticized vision what a code review is:

What am I going to do? I get an idea. Maybe it's a valve. I take my finger and put it down on one of the mysterious little crosses in the middle of one of the blueprints on page three, and I say, ``What happens if this valve gets stuck?'' --figuring they're going to say, ``That's not a valve, sir, that's a window.''

So one looks at the other and says, ``Well, if that valve gets stuck--'' and he goes up and down on the blueprint, up and down, the other guy goes up and down, back and forth, back and forth, and then both look at each other. They turn around to me and they open their mouths like astonished fish and say, ``You're absolutely right, sir.''

Quoted from "Surely You Are Joking, Mr. Feynman!".

February 02, 2017 04:37 PM

January 27, 2017

Michael Kerrisk (manpages): Next Linux/UNIX System Programming course in Munich: 15-19 May, 2017

I've scheduled another 5-day Linux/UNIX System Programming course to take place in Munich, Germany, for the week of 15-19 May 2017.

The course is intended for programmers developing system-level, embedded, or network applications for Linux and UNIX systems, or programmers porting such applications from other operating systems (e.g., Windows) to Linux or UNIX. The course is based on my book, The Linux Programming Interface (TLPI), and covers topics such as low-level file I/O; signals and timers; creating processes and executing programs; POSIX threads programming; interprocess communication (pipes, FIFOs, message queues, semaphores, shared memory), and network programming (sockets).
     
The course has a lecture+lab format, and devotes substantial time to working on some carefully chosen programming exercises that put the "theory" into practice. Students receive printed and electronic copies of TLPI, along with a 600-page course book that includes all slides and exercises presented in the course. A reading knowledge of C is assumed; no previous system programming experience is needed.

Some useful links for anyone interested in the course:

Questions about the course? Email me via training@man7.org.

January 27, 2017 01:11 AM

January 26, 2017

Gustavo F. Padovan: Mainline Explicit Fencing – part 3

In the last two articles we talked about how Explicit Fencing can help the graphics pipeline in general and what happened on the effort to upstream the Android Sync Framework. Now on the third post of this series we will go through the Explicit Fencing implementation on DRM and other elements of the graphics stack.

The DRM implementation lays down on top of two kernel infrastructures, struct dma_fence, which represents the fence and struct sync file that provides the file descriptors to be shared with userspace (as it was discussed in the previous articles). With fencing the display infrastructure needs to wait for a signal on that fence before displaying the buffer on the screen. On a Explicit Fencing implementation that fence is sent from userspace to the kernel. The display infrastructure also sends back to userspace a fence, encapsulated in a struct sync_file, that will be signalled when the buffer is scanned out on the screen. The same process happens on the rendering side.

It is mandatory to use of Atomic Modesetting and here is not plan to support legacy APIs. The fence that DRM will wait on needs to be passed via the IN_FENCE_FD property for each DRM plane, that means it will receive one sync_file fd containing one or more dma_fence per plane. Remember that in DRM a plane directly relates to a framebuffer so one can also say that there is one sync_file per framebuffer.

On the other hand for the fences created by the kernel that are sent back to userspace the OUT_FENCE_PTR property is used. It is a DRM CRTC property because we only create one dma_fence per CRTC as all the buffers on it will be scanned out at the same time. The kernel sends this fence back to userspace by writing the fd number to the pointer provided in the OUT_FENCE_PTR property. Note that, unlike from what Android did, when the fence signals it means the previous buffer – the buffer removed from the screen – is free for reuse. On Android when the signal was raised it meant the current buffer was freed. However, the Android folks have patched SurfaceFlinger already to support the Mainline semantics when using Explicit Fencing!

Nonetheless, that is only one side of the equation and to have the full graphics pipeline running with Explicit Fencing we need to support it on the rendering side as well. As every rendering driver has its own userspace API we need to add Explicit Fencing support to every single driver there. The freedreno driver already has its Explicit Fencing support  mainline and there is work in progress to add support to i915 and virtio_gpu.

On the userspace side Mesa already has support for the EGL_ANDROID_native_fence_sync needed to use Explicit Fencing on Android. Libdrm incorporated the headers to access the sync file IOCTL wrappers. On Android, libsync now has support for both the old Android Sync and Mainline Sinc File APIs. And finally, on drm_hwcomposer, patches to use Atomic Modesetting and Explicit Fencing are available but they are not upstream yet.

Validation tests for both Sync Files and fences on the Atomic API were written and added to IGT.

January 26, 2017 03:23 PM

January 23, 2017

Matthew Garrett: Android permissions and hypocrisy

I wrote a piece a few days ago about how the Meitu app asked for a bunch of permissions in ways that might concern people, but which were not actually any worse than many other apps. The fact that Android makes it so easy for apps to obtain data that's personally identifiable is of concern, but in the absence of another stable device identifier this is the sort of thing that capitalism is inherently going to end up making use of. Fundamentally, this is Google's problem to fix.

Around the same time, Kaspersky, the Russian anti-virus company, wrote a blog post that warned people about this specific app. It was framed somewhat misleadingly - "reading, deleting and modifying the data in your phone's memory" would probably be interpreted by most people as something other than "the ability to modify data on your phone's external storage", although it ends with some reasonable advice that users should ask why an app requires some permissions.

So, to that end, here are the permissions that Kaspersky request on Android:


Every single permission that Kaspersky mention Meitu having? They require it as well. And a lot more. Why does Kaspersky want the ability to record audio? Why does it want to be able to send SMSes? Why does it want to read my contacts? Why does it need my fine-grained location? Why is it able to modify my settings?

There's no reason to assume that they're being malicious here. The reasons that these permissions exist at all is that there are legitimate reasons to use them, and Kaspersky may well have good reason to request them. But they don't explain that, and they do literally everything that their blog post criticises (including explicitly requesting the phone's IMEI). Why should we trust a Russian company more than a Chinese one?

The moral here isn't that Kaspersky are evil or that Meitu are virtuous. It's that talking about application permissions is difficult and we don't have the language to explain to users what our apps are doing and why they're doing it, and Google are still falling far short of where they should be in terms of making this transparent to users. But the other moral is that you shouldn't complain about the permissions an app requires when you're asking for even more of them because it just makes you look stupid and bad at your job.

comment count unavailable comments

January 23, 2017 07:58 AM

January 20, 2017

Daniel Vetter: Maintainers Don't Scale

This is the write-up of my talk at LCA 2017 in Hobart. It’s not exactly the same, because this is a blog and not a talk, but the same contents. The slides for the talk are here, and I will link to the video as soon as it is available. Update: Video is now uploaded.

Linux Kernel Maintainers

First let’s look at how the kernel community works, and how a change gets merged into Linus Torvalds’ repository. Changes are submitted as patches to mailing list, then get some review and eventually get applied by a maintainer to that maintainer’s git tree. Each maintainer then sends pull request, often directly to Linus. With a few big subsystems (networking, graphics and ARM-SoC are the major ones) there’s a second or third level of sub-maintainers in. 80% of the patches get merged this way, only 20% are committed by a maintainer directly.

Most maintainers are just that, a single person, and often responsible for a bunch of different areas in the kernel with corresponding different git branches and repositories. To my knowledge there are only three subsystems that have embraced group maintainership models of different kinds: TIP (x86 and core kernel), ARM-SoC and the graphics subsystem (DRM).

The radical change, at least for the kernel community, that we implemented over a year ago for the Intel graphics driver is to hand out commit rights to all regular contributors. Currently there are 19 people with commit rights to the drm-intel repository. In the first year of ramp-up 70% of all patches are now committed directly by their authors, a big change compared to how things worked before, and still work everywhere else outside of the graphics subsystem. More recently we also started to manage the drm-misc tree for subsystem wide refactorings and core changes in the same way.

I’ve covered the details of the new process in my Kernel Recipes talk “Maintainers Don’t Scale”, and LWN has covered that, and a few other talks, in their article on linux kernel maintainer scalability. I also covered this topic at the kernel summit, again LWN covered the group maintainership discussion. I don’t want to go into more detail here, mostly because we’re still learning, too, and not really experts on commit rights for everyone and what it takes to make this work well. If you want to enjoy what a community does who really has this all figured out, watch Emily Dunham’s talk “Life is better with Rust’s community automation” from last year’s LCA.

What we are experts on is the Linux Kernel’s maintainer model - we’ve run things for years with the traditional model, both as single maintainers and small groups, and now gained the outside perspective by switching to something completely different. Personally, I’ve come to believe that the maintainer model as implemented by the kernel community just doesn’t scale. Not in the technical sense of big-O scalability, because obviously the kernel community scales to a rather big size. Much larger organizations, entire states are organized in a hierarchical way, the kernel maintainer hierarchy is not anything special. Besides that, git was developed specifically to support the Linux maintainer hierarchy, and git won. Clearly, the linux maintainer model scales to big numbers of contributors. Where I think it falls short is the constant factor of how efficiently contributions are reviewed and merged, especially for non-maintainer contributors. Which do 80% of all patches.

Cult of Busy

The first issue that routinely comes out when talking about maintainer topics is that everyone is overloaded. There’s a pervasive spirit in our industry (especially in the US) hailing overworked engineers as heroes, with an entire “cult of busy” around. If you have time, you’re a slacker and probably not worth it. Of course this doesn’t help when being a maintainer, but I don’t believe it’s a cause of why the Linux maintainer model doesn’t work. This cult of busy leads to burnout, which is in my opinion a prime risk when you’re an open source person. Personally I’ve gone through a few difficult phases until I understood my limits and respected them. When you start as a maintainer for 2-3 people, and it increases to a few dozen within a couple of years, then getting a bit overloaded is rather natural - it’s a new job, with a different set of responsibilities and I had no clue about a lot of things. That’s no different from suddenly being a leader of a much bigger team anywhere else. A great talk on this topic is “What part of “… for life” don’t you understand?” from Jacob Kaplan-Moss since it’s by a former maintainer. It also contains a bunch of links to talks on burnout specifically. Ignoring burnout is not healthy, or not knowing about the early warning signs, it is rampant in our communities, but for now I’ll leave it at that.

Boutique Trees and Bus Factors

The first issue I see is how maintainers usually are made: You scratch an itch somewhere, write a bit of code, suddenly a few more people find it useful, and “tag” you’re the maintainer. On top, you often end up being stuck in that position “for life”. If the community keeps growing, or your maintainer becomes otherwise busy with work&life, you have your standard-issue overloaded bottleneck.

That’s the point where I think the kernel community goes wrong. When other projects reach this point they start to build up a more formal community structure, with specialized roles, boards for review and other bits and pieces. One of the oldest, and probably most notorious, is Debian with its constitution. Of course a small project doesn’t need such elaborate structures. But if the goal is world domination, or at least creating something lasting, it helps when there’s solid institutions that cope with people turnover. At first just documenting processes and roles properly goes a long way, long before bylaws and codified decision processes are needed.

The kernel community, at least on the maintainer side, entirely lacks this.

What instead most often happens is that a new set of ad-hoc, chosen-by-default maintainers start to crop up in a new level of the hierarchy, below your overload bottleneck. Because becoming your own maintainer is the only way to help out and to get your own features merged. That only perpetuates the problem, since the new maintainers are as likely to be otherwise busy, or occupied with plenty of other kernel parts already. If things go well that area becomes big, and you have another git tree with another overloaded maintainer. More often than not people move around, and accumulate small bits allover under their maintainership. And then the cycle repeats.

The end result is a forest of boutique trees, each covering a tiny part of the project, maintained by a bunch of notoriously overloaded people. The resulting cross-tree coordination issues are pretty impressive - in the graphics subsystem we fairly often end up with with simple drivers that somehow need prep patches in 5 different trees before you can even land that simple driver in the graphics tree.

Unfortunately that’s not the bad part. Because these maintainers are all busy with other trees, or their work, or life in general, you’re guaranteed that one of them is not available at any given time. Worse, because their tree has relatively little activity because it covers a small area, many only pick up patches once per kernel release, which means a built-in 3 month delay. That’s all because each tree and area has just one maintainer. In the end you don’t even need the proverbial bus to hit anyone to feel the pain of having a single point of failure in your organization - there’s so many maintainer trees around that that absence always happens, and constantly.

Of course people get fed up trying to get features merged, and often the fix is trying to become a maintainer yourself. That takes a while and isn’t easy - only 20% of all patches are authored by maintainers - and after the new code landed it makes it all worse: Now there’s one more semi-absent maintainer with one more boutique tree, adding to all the existing troubles.

Checks and Balances

All patches merged into the Linux kernel are supposed to be reviewed, and rather often that review is only done by the maintainers who merges the patch. When maintainers send out pull requests the next level of maintainers then reviews those patch piles, until they land in Linus’ tree. That’s an organization where control flows entirely top-down, with no checks and balances to reign in maintainers who are not serving their contributors well. History of dicatorships tells us that despite best intentions, the end result tends to heavily favour the few over the many. As a crude measure for how much maintainers subject themselves to some checks&balances by their peers and contributors I looked at how many patches authored and committed by the same person (probably a maintainer) do not also carry a reviewed or acked tag. For the Intel driver that’s less than 3%. But even within the core graphics code it’s only 5%, and that covers the time before we started to experiment with commit rights for that area. And for the graphics subsystem overall the ratio is still only about 25%, including a lot of drivers with essentially just one contributor, who is always volunteered as the maintainer, and hence somewhat natural that those maintainers lack reviewers.

Outside of graphics only roughly 25% of all patches written by maintainers are reviewed by their peers - 75% of all maintainer patches lack any kind of recorded peer review, compared to just 25% for graphics alone. And even looking at core areas like kernel/ or mm/ the ratio is only marginally better at about 30%. In short, in the kernel at large, peer review of maintainers isn’t the norm.

And there’s nothing outside of the maintainer hierarchy that could provide some checks and balance either. The only way to escalate disagreement is by starting a revolution, and revolutions tend to be long, drawn-out struggles and generally not worth it. Even Debian only recently learned that they lack a way to depose maintainers, and that maybe going maintainerless would be easier (again, LWN has you covered).

Of course the kernel is not the only hierarchy where there’s no meaningful checks and balances. Professor at universities, and managers at work are in a fairly similar position, with minimal options for students or employers to meaningfully appeal decisions. But that’s a recognized problem, and at least somewhat countered by providing ways to provide anonymous feedback, often through regular surveys. The results tend to not be all that significant, but at least provide some control and accountability to the wider masses of first-level dwellers in the hierarchy. In the kernel that amounts to about 80% of all contributions, but there’s no such survey. On the contrary, feedback sessions about maintainer happiness only reinforce the control structure, with e.g. the kernel summit featuring an “Is Linus happy?” session each year.

Another closely related aspect to all this is how a project handles personal conflicts between contributors. For a very long time Linux didn’t have any formal structures in this area either, with the only options available to unhappy people to either take it or leave it. Well, or usurping a maintainer with a small revolution, but that’s not really an option. For two years we’ve now had the “Code of Conflict”, which de facto just throws up its hands and declares that conflict are the normal outcome, essentially just encoding the status quo. Refusing to handle conflicts in a project with thousands of contributors just doesn’t work, except that it results in lots of frustration and ultimately people trying to get away. Again, the lack of a poised board to enforce a strong code of conduct, independent of the maintainer hierarchy, is in line with the kernel community unwillingness to accept checks and balances.

Mesh vs. Hierarchy

The last big issue I see with the Linux kernel model, featuring lots of boutique trees and overloaded maintainer, is that it seems to harm collaboration and integration of new contributors. In the Intel graphics, driver maintainers only ever reviewed a small minority of all patches over the last few years, with the goal to foster direct collaboration between contributors. Still, when a patch was stuck, maintainers were the first point of contact, especially, but not only, for newer contributors. No amount of explaining that only the lack of agreement with the reviewer was the gating factor could persuade people to fully collaborate on code reviews and rework the code, tests and documentation as needed. Especially when they’re coming with previous experience where code review is more of a rubber-stamp step compared to the distributed and asynchronous pair-programming it often resembles in open-source. Instead, new contributors often just ended up falling back to pinging maintainers to make a decision or just merge the patches as-is.

Giving all regular contributors commit rights and fully trusting them to do the right thing entirely fixed that: If the reviewer or author have commit rights there’s no easy excuse anymore to involve maintainers when the author and reviewer can’t reach agreement. Of course that requires a lot of work in mentoring people, making sure requirements for merging are understood and documented, and automating as much as possible to avoid screw ups. I think maintainers who lament their lack of review bandwidth, but also state they can’t trust anyone else aren’t really doing their jobs.

At least for me, review isn’t just about ensuring good code quality, but also about diffusing knowledge and improving understanding. At first there’s maybe one person, the author (and that’s not a given), understanding the code. After good review there should be at least two people who fully understand it, including corner cases. And that’s also why I think that group maintainership is the only way to run any project with more than one regular contributor.

On the topic of patch review and maintainers, there’s also the habit of wholesale rewrites of patches written by others. If you want others to contribute to your project, then that means you need to accept other styles and can’t enforce your own all the time. Merging first and polishing later recognizes new contributions, and if you engage newcomers for the polish work they tend to stick around more often. And even when a patch really needs to be reworked before merging it’s better to ask the author to do it: Worst case they don’t have time, best case you’ve improved your documentation and training procedure and maybe gained a new regular contributor on top.

A great take on the consequences of having fixed roles instead of trying to spread responsibilities more evenly is Alice Goldfuss’ talk “Rock Stars, Builders, and Janitors: You’re doing it wrong”. I also think that rigid roles present a bigger bar for people with different backgrounds, hampering diversity efforts and in the spirit of Sarah Sharps post on what makes a good community, need to be fixed first.

Towards a Maintainer’s Manifest

I think what’s needed in the end is some guidelines and discussions about what a maintainer is, and what a maintainer does. We have ready-made licenses to avoid havoc, there’s code of conducts to copypaste and implement, handbooks for building communities, and for all of these things, lots of conferences. Maintainer on the other hand you become by accident, as a default. And then everyone gets to learn how to do it on their own, while hopefully not burning too many bridges - at least I myself was rather lost on that journey at times. I’d like to conclude with a draft on a maintainer’s manifest.

It’s About the People

If you’re maintainer of a project or code area with a bunch of full time contributors (or even a lot of drive-by contributions) then primarily you deal with people. Insisting that you’re only a technical leader just means you don’t acknowledge what your true role really is.

And then, trust them to do a good job, and recognize them for the work they’re doing. The important part is to trust people just a bit more than what they’re ready for, as the occasional challenge, but not too much that they’re bound to fail. In short, give them the keys and hope they don’t wreck the car too badly, but in all cases have insurance ready. And insurance for software is dirt cheap, generally a git revert and the maintainer profusely apologizing to everyone and taking the blame is all it takes.

Recognize Your Power

You’re a maintainer, and you have essentially absolute power over what happens to your code. For successful projects that means you can unleash a lot of harm on people who for better or worse are employed to deal with you. One of the things that annoy me the most is when maintainers engage in petty status fights against subordinates, thinly veiled as technical discussions - you end up looking silly, and it just pisses everyone off. Instead recognize your powers, try to stay on the good side of the force and make sure you share it sufficiently with the contributors of your project.

Accept Your Limits

At the beginning you’re responsible for everything, and for a one-person project that’s all fine. But eventually the project grows too much and you’ll just become a dictator, and then failure is all but assured because we’re all human. Recognize what you don’t do well, build institutions to replace you. Recognize that the responsibility you initially took on might not be the same as that which you’ll end up with and either accept it, or move on. And do all that before you start burning out.

Be a Steward, Not a Lord

I think one of key advantages of open source is that people stick around for a very long time. Even when they switch jobs or move around. Maybe the usual “for life” qualifier isn’t really a great choice, since it sounds more like a mandatory sentence than something done by choice. What I object to is the “dictator” part, since if your goal is to grow a great community and maybe reach world domination, then you as the maintainer need to serve that community. And not that the community serves you.

Thanks a lot to Ben Widawsky, Daniel Stone, Eric Anholt, Jani Nikula, Karen Sandler, Kimmo Nikkanen and Laurent Pinchart for reading and commenting on drafts of this text.

January 20, 2017 12:00 AM

January 19, 2017

Matthew Garrett: Android apps, IMEIs and privacy

There's been a sudden wave of people concerned about the Meitu selfie app's use of unique phone IDs. Here's what we know: the app will transmit your phone's IMEI (a unique per-phone identifier that can't be altered under normal circumstances) to servers in China. It's able to obtain this value because it asks for a permission called READ_PHONE_STATE, which (if granted) means that the app can obtain various bits of information about your phone including those unique IDs and whether you're currently on a call.

Why would anybody want these IDs? The simple answer is that app authors mostly make money by selling advertising, and advertisers like to know who's seeing their advertisements. The more app views they can tie to a single individual, the more they can track that user's response to different kinds of adverts and the more targeted (and, they hope, more profitable) the advertising towards that user. Using the same ID between multiple apps makes this easier, and so using a device-level ID rather than an app-level one is preferred. The IMEI is the most stable ID on Android devices, persisting even across factory resets.

The downside of using a device-level ID is, well, whoever has that data knows a lot about what you're running. That lets them tailor adverts to your tastes, but there are certainly circumstances where that could be embarrassing or even compromising. Using the IMEI for this is even worse, since it's also used for fundamental telephony functions - for instance, when a phone is reported stolen, its IMEI is added to a blacklist and networks will refuse to allow it to join. A sufficiently malicious person could potentially report your phone stolen and get it blocked by providing your IMEI. And phone networks are obviously able to track devices using them, so someone with enough access could figure out who you are from your app usage and then track you via your IMEI. But realistically, anyone with that level of access to the phone network could just identify you via other means. There's no reason to believe that this is part of a nefarious Chinese plot.

Is there anything you can do about this? On Android 6 and later, yes. Go to settings, hit apps, hit the gear menu in the top right, choose "App permissions" and scroll down to phone. Under there you'll see all apps that have permission to obtain this information, and you can turn them off. Doing so may cause some apps to crash or otherwise misbehave, whereas newer apps may simply ask for you to grant the permission again and refuse to do so if you don't.

Meitu isn't especially rare in this respect. Over 50% of the Android apps I have handy request your IMEI, although I haven't tracked what they all do with it. It's certainly something to be concerned about, but Meitu isn't especially rare here - there are big-name apps that do exactly the same thing. There's a legitimate question over whether Android should be making it so easy for apps to obtain this level of identifying information without more explicit informed consent from the user, but until Google do anything to make it more difficult, apps will continue making use of this information. Let's turn this into a conversation about user privacy online rather than blaming one specific example.

comment count unavailable comments

January 19, 2017 11:36 PM

January 12, 2017

Pete Zaitcev: git-codereview

Not content with a legacy git-review, Google developed another Gerrit front-end, the git-coderevew. They use it for contributions to Go. I have to admit, that was a bit less of a special move than Facebook's git-review that uses the same name but does something entirely different.

P.S. There used to be a post about creating a truly distributed github, which used blockchain in order to vote on globally unique names. Can't find a link though.

January 12, 2017 07:56 PM

January 08, 2017

Pete Zaitcev: Mirantis and the business of OpenStack

It seems that only in November we heard about massive layoffs at Mirantis, "The #1 Pure Play OpenStack Company" (per <title>). Now they are teaching us thus:

And what about companies like Mirantis adding Kubernetes and other container technologies to their slate? Is that a sign of the OpenStack Apocalypse?

In a word, “no”.

Gee, thanks. I'm sure they know what it's like.

January 08, 2017 06:07 PM

January 03, 2017

James Bottomley: TPM2 and Linux

Recently Microsoft started mandating TPM2 as a hardware requirement for all platforms running recent versions of windows.  This means that eventually all shipping systems (starting with laptops first) will have a TPM2 chip.  The reason this impacts Linux is that TPM2 is radically different from its predecessor TPM1.2; so different, in fact, that none of the existing TPM1.2 software on Linux (trousers, the libtpm.so plug in for openssl, even my gnome keyring enhancements) will work with TPM2.  The purpose of this blog is to explore the differences and how we can make ready for the transition.

What are the Main 1.2 vs 2.0 Differences?

The big one is termed Algorithm Agility.  TPM1.2 had SHA1 and RSA2048 only.  TPM2 is designed to have many possible algorithms, including support for elliptic curve and a host of government mandated (Russian and Chinese) crypto systems.  There’s no requirement for any shipping TPM2 to support any particular algorithms, so you actually have to ask your TPM what it supports.  The bedrock for TPM2 in the West seems to be RSA1024-2048, ECC and AES for crypto and SHA1 and SHA256 for hashes1.

What algorithm agility means is that you can no longer have root keys (EK and SRK see here for details) like TPM1.2 did, because a key requires a specific crypto algorithm.  Instead TPM2 has primary “seeds” and a Key Derivation Function (KDF).  The way this works is that a seed is simply a long string of random numbers, but it is used as input to the KDF along with the key parameters and the algorithm and out pops a real key based on the seed.  The KDF is deterministic, so if you input the same algorithm and the same parameters you get the same key again.  There are four primary seeds in the TPM2: Three permanent ones which only change when the TPM2 is cleared: endorsement (EPS), Platform (PPS) and Storage (SPS).  There’s also a Null seed, which is used for ephemeral keys and changes every reboot.  A key derived from the SPS can be regarded as the SRK and a key derived from the EPS can be regarded as the EK. Objects descending from these keys are called members of hierarchies2. One of the interesting aspects of the TPM is that the root of a hierarchy is a key not a seed (because you need to exchange secret information with the TPM), and that there can be multiple of these roots with different key algorithms and parameters.

Additionally, the mechanism for making use of keys has changed slightly.  In TPM 1.2 to import a secret key you wrapped it asymmetrically to the SRK and then called LoadKeyByBlob to get a use handle.  In TPM2 this is a two stage operation, firstly you import a wrapped (or otherwise protected) private key with TPM2_Import, but that returns a private key structure encrypted with the parent key’s internal symmetric key.  This symmetrically encrypted key is then loaded (using TPM2_Load) to obtain a use handle whenever needed.  The philosophical change is from online keys in TPM 1.2 (keys which were resident inside the TPM) to offline keys in TPM2 (keys which now can be loaded when needed).  This philosophy has been reinforced by reducing the space available to keep keys loaded in TPM2 (see later).

Playing with TPM2

If you have a recent laptop, chances are you either have or can software upgrade to a TPM2.  I have a dell XPS13 the skylake version which comes with a software upgradeable Nuvoton TPM.  Dell kindly provides a 1.2->2 switching program here, which seems to work under Freedos (odin boot) so I have a physical TPM2 based system.  For those of you who aren’t so lucky, you can still play along, but you need a TPM2 emulator.  The best one is here; simply download and untar it then type make in the src directory and run it as ./tpm_server.  It listens on two TCP ports, 2321 and 2322, for TPM commands so there’s no need to install it anywhere, it can be run directly from the source directory.

After that, you need the interface software called tss2.  The source is here, but Fedora 25 and recent Ubuntu already package it.  I’ve also built openSUSE packages here.  The configuration of tss2 is controlled by environment variables.  The most important one is TPM_INTERFACE_TYPE which tells it how to connect to the TPM2.  If you’re using a simulator, you set this to “socsim” and if you have a real TPM2 device you set it to “dev”.  One final thing about direct device connection: in tss2 there’s no daemon like trousers had to broker the connection, all your users connect directly to the TPM2 device /dev/tpm0.  To do this, the device has to support read and write by arbitrary users, so its permissions need to be 0666.  I’ve got a udev script to achieve this

# tpm 2 devices need to be world readable
SUBSYSTEM=="tpm", ACTION=="add", MODE="0666"

Which goes in /etc/udev/rules.d/80-tpm-2.rules on openSUSE.  The next thing you need to do, if you’re running the simulator, is power it on and start it up (for a real device, this is done by the bios):

tsspowerup
tssstartup

The simulator will now create a NVChip file wherever you started it to store NV ram based objects, which it will read on next start up.  The first thing you need to do is create an SRK and store it in NV memory.  Microsoft uses the well known key handle 81000001 for this, so we’ll do the same.  The reason for doing this is that a real TPM takes ages to run the KDF for RSA keys because it has to look for prime numbers:

jejb@jarvis:~> TPM_INTERFACE_TYPE=socsim time tsscreateprimary -hi o -st -rsa
Handle 80000000
0.03 user 0.00 system 0:00.06 elapsed

jejb@jarvis:~> TPM_INTERFACE_TYPE=dev time tsscreateprimary -hi o -st -rsa
Handle 80000000
0.04 user 0.00 system 0:20.51 elapsed

As you can see: the simulator created a primary storage key (the SRK) in a few milliseconds, but it took my real TPM2 20 seconds to do it3 … not something you want to wait for, hence the need to store this permanently under a well known key handle and get rid of the temporary copy

tssevictcontrol -hi o -ho 80000000 -hp 81000001
tssflushcontext -ha 80000000

tssevictcontrol tells the TPM to copy the key at transient handle 800000004  to permanent NV handle 81000001 and tssflushcontext erases the transient key.  Flushing transient objects is very important, because TPM2 has a lot less transient storage space than TPM1.2 did; usually only about three handles worth.  You can tell how much you have by doing

tssgetcapability -cap 6|grep -i transient
TPM_PT 0000010e value 00000003 TPM_PT_HR_TRANSIENT_MIN - the minimum number of transient objects that can be held in TPM RAM

Where the value (00000003) tells me that the TPM can store at least 3 transient objects.  After that you’ll start getting out of space errors from it.

The final step in taking ownership of a TPM2 is to set the authorization passwords.  Each of the four hierarchies (Null, Owner, Endorsement, Platform) and the Lockout has a possible authority password.  The Platform authority is cleared on startup, so there’s not much point setting it (it’s used by the BIOS or Firmware to perform TPM functions).  Of the other four, you really only need to set Owner, Endorsement and Lockout (I use the same password for all of them).

tsshierarchychangeauth -hi l -pwdn <your password>
tsshierarchychangeauth -hi e -pwdn <your password>
tsshierarchychangeauth -hi o -pwdn <your password>

After this is done, you’re all set.  Note that as well as these authorizations, each object can have its own authorization (or even policy), so the SRK you created earlier still has no password, allowing it to be used by anyone.  Note also that the owner authorization controls access to the NV memory, so you’ll need to supply it now to make other objects persistent.

An Aside about Real TPM2 devices and the Resource Manager

Although I’m using the code below to store my keys in the TPM2, there’s a couple of practical limitations which means it won’t work for you if you have multiple TPM2 using applications without a kernel update.  The two problems are

  1. The Linux Kernel TPM2 device /dev/tpm0 only allows one user at once.  If a second application tries to open the device it will get an EBUSY which causes TSS_Create() to fail.
  2. Because most applications make use of transient key slots and most TPM2s have only a couple of these, simultaneous users can end up running out of these and getting unexpected out of space errors.

The solution to both of these is something called a Resource Manager (RM).  What the RM does is effectively swap transient objects in and out of the TPM as needed to prevent it from running out of space.  Linux has an issue in that both the kernel and userspace are potential users of TPM keys so the resource manager has to live inside the kernel.  Jarkko Sakkinen has preliminary resource manager patches here, and they will likely make it into kernel 4.11 or 4.12.  I’m currently running my laptop with the RM patches applied, so multiple applications work for me, but since these are preliminary patches, I wouldn’t currently advise others to do this.  The way these patches work is that once you declare to the kernel via an ioctl that you want to use the RM, every time you send a command to the TPM, your context gets swapped in, the command is executed, the context is swapped out and the response sent meaning that no other user of the TPM sees your transient objects.  The moment you send the ioctl, the TPM device allows another user to open it as well.

Using TPM2 as a keystore

Once the implementation is sorted out, openssl and gnome-keyring patches can be produced for TPM2.  The only slight wrinkle is that for create_tpm2_key you require a parent key to exist in the NV storage (at the 81000001 handle we talked about previously).  So to convert from a password protected openssh RSA key to a TPM2 based one, you do

create_tpm2_key -a -p 81000001 -w id_rsa id_rsa.tpm
mv id_rsa.tpm id_rsa

And then gnome keyring manager will work nicely (make sure you keep a copy of your original private key in case you want to move to a new laptop or reseed the TPM2).  If you use the same TPM2 password as your original key password, you won’t even need to update the gnome loginkeyring for the new password.

Conclusions

Because of the lack of an in-kernel Resource Manager, TPM2 is ready for experimentation in Linux but definitely not ready for prime time yet (unless you’re willing to patch your kernel).  Hopefully this will change in the 4.11 or 4.12 kernel when the Resource Manager finally goes upstream5.

Looking forwards to the new stack, the lack of a central daemon is really a nice feature: tcsd crashing used to kill all of my TPM key based applications, but with tss2 having no central daemon, everything has just worked(tm) so far.  A kernel based RM also means that the kernel can happily use the TPM (for its trusted keys and disk encryption) without interfering with whatever userspace is doing.

January 03, 2017 12:55 AM

January 02, 2017

Paul E. Mc Kenney: Parallel Programming: January 2017 Update

Another year, another release of Is Parallel Programming Hard, And, If So, What Can You Do About It?!

Updates include:



  1. More formatting and build-system improvements, along with many bibliography updates, courtesy of Akira Yokosawa.
  2. A great many grammar and typo fixes from Akira and SeongJae Park.
  3. Numerous changes and fixes from Balbir Singh, Boqun Feng, Mike Rapoport, Praveen Kumar, and Tobias Klauser.
  4. Added code for concurrent skiplists, with the hope for added text in a later release.
  5. Added a running example to the deferred-processing chapter.
  6. Merged “Synchronization Primitives” into “Tools of Trade” section.
  7. Updated control-dependency discussion in memory-barriers section.
As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.

January 02, 2017 05:23 PM

December 27, 2016

Pete Zaitcev: The idea of ARM has gone mainstream

We still don't have any usable servers on which I could install Fedora and have it supported for more than 3 releases, but gamers already debate the merits of ARM. The idea of SPEC-per-Watt has completely gone mainstream, like Marxism.

<sage> http://www.bitsandchips.it/english/52-english-news/7854-rumor-even-intel-is-studying-a-new-x86-uarch new uarch? it's about time
<sage> they just can't make x86 as power efficient as arm
<JTFish> What is the point
<JTFish> it's not like ARM will replace x86 in real servers any time soon
<sage> what is "real" servers?
<JTFish> anything that does REAL WORLD shit
<sage> what is "real world"?
<JTFish> serving internet content etc
<JTFish> database servers
<JTFish> I dunno
<JTFish> mass encoding of files
<sage> lots of startups and established companies are already betting on ARM for their cloud server offerings
<sage> database and mass encoding, ok
<sage> what else
<JTFish> are you saying
<JTFish> i'm 2 to 1
<JTFish> for x86
<JTFish> also I should just go full retard and say minecraft servers
<sage> the power savings are big, if they can run part of their operation on ARM and make it financially viable, they will do it

QUICK UPDATE: In the linked article:

The next Intel uArch will be very similar to the approach used by AMD with Zen – perfect balance of power consumption/performance/price – but with a huge news: in order to save physical space (Smaller Die) and to improve the power consumption/performance ratio, Intel will throw away some old SIMD and old hardware remainders.

The 100% backward hardware x86 compatibility will not guaranteed anymore, but could be not a handicap (Some SIMD today are useless, and also we can use emulators or cloud systems). Nowadays a lot of software house have to develop code for ARM and for x86, but ARM is lacking useful SIMD. So, frequently, these software are a watered-down compromise.

Intel will be able to develop a thin and fast x86 uArch, and ICC will be able to optimize the code both for ARM and for x86 as well.

This new uArch will be ready in 2019-2020.

Curious. Well, as long as they don't go full Transmeta on us, it may be fine.

December 27, 2016 06:51 PM

December 23, 2016

Dave Airlie: radv and doom - kinda

Yesterday Valve gave me a copy of DOOM for Christmas (not really for Christmas), and I got the wine bits in place from Fedora, then I spent today trying to get DOOM to render on radv.



Thanks to ParkerR on #radeon for taking the picture from his machine, I'm too lazy.

So it runs kinda, it hangs the GPU a fair bit, it misrenders some colors in some scenes, but you can see most of it. I'm not sure if I'll get back to this before next year (I'll try), but I'm pretty happy to have gotten it this far in a day, though I'm sure the next few things will me much more difficult to debug.

The branch is here:
https://github.com/airlied/mesa/commits/radv-wip-doom-wine

December 23, 2016 07:26 AM

December 22, 2016

Daniel Vetter: How do you do docs?

The fancy new Sphinx-based documentation has landed a while ago in upstream. Jani Nikula has written a nice overview on LWN (part 2). And it is getting used a lot. But judging by how often I type it in replies on the mailing list what’s missing is a super-short howto. To build the documentation, run:

$ make DOCBOOKS="" htmldocs

The output can then be found in Documentation/output/. When typing documentation please always check that your new text does get rendered. The output also contains documentation about kernel-doc and the toolchain itself. Since the build is incremental it is recommended that you first run it before touching anything. That way you’ll only see warnings in areas you’ve touched, not all of them - the build is unfortunately somewhat noisy.

December 22, 2016 06:00 AM

December 20, 2016

Gustavo F. Padovan: Collabora Contributions to Linux Kernel 4.9

Linux Kernel 4.9 was released this week and once more Collabora developers took part on the kernel development cycle. This time we contributed 37 patches by 11 different developers, our highest number of single contributors in a kernel release ever. Remember that in the previous release we had our highest number total contributions. The numbers shows how Collabora have been increasing its commitment in contributing to the upstream kernel community.

For those who want to see an overall report of what was happened in the 4.9 kernel take a look  on the always good LWN articles: part 1, 2  and 3.

As for Collabora contributions most of our work was in the DRM and DMABUF subsystems. Andrew Shadura and Daniel Stone added to fixes to the AMD and i915 drivers respectively. Emilio López added the missing install of sync_file.h uapi.

Gustavo Padovan advanced a few more steps on the goal to add explicit fencing to the DRM subsystem, besides a few improvements to Sync File and the virtio_gpu driver he also de-staged the SW_SYNC validation framework that helps with Sync File testing.

Peter Senna added drm_bridge support to imx-ldb device while Tomeu Vizoso improved drm_bridge support on RockChip’s analogic-dp and added documentation about validation of the DRM subsystem.

Outside of the Graphics world we had Enric Balletbo i Serra adding support to upload firmware on the ziirave watchdog device. Fabien Lahoudere and Martyn Welch enabled and improved DMA support for i.MX53 UARTs, allowing the device tree to decide whether DMA is used or not. Martyn also added a fake VMEbus (Versa Module Europa bus) to help with VME driver development.

On the Bluetooth, subsystem Frédéric Dalleau fixed an error code for SCO connections, that was causing big timeout and failures on SCO connections requests. Finally Robert Foss worked to clear the pipeline on errors for cdc-wdm USB devices.

Andrew Shadura (1):

Daniel Stone (1):

Emilio López (2):

Enric Balletbo i Serra (1):

Fabien Lahoudere (3):

Frédéric Dalleau (1):

Gustavo Padovan (14):

Martyn Welch (4):

Peter Senna Tschudin (1):

Robert Foss (2):

Tomeu Vizoso (7):

December 20, 2016 04:46 PM

December 14, 2016

Daniel Vetter: Midlayers, Once More With Feelings!

The collective internet troll fest had it’s fun recently discussing AMD’s DAL. Hacker news discussed the rejection and some of the reactions, reddit had some fun and of course everyone on phoronix forums was going totally nuts. Luckily reason seems to finally prevail with LWN covering things too. I don’t want to spill more bits over the topic itself (read the LWN coverage and mailing list threads for that), but I think it’s worth looking at the fundamental underlying problem a bit more.

Discussing midlayers seems to be one of the recuring topics in the linux kernel. There’s the original midlayer-mistake article from Neil Brown that seems to have started it all. But LWN gained more articles in the years since, covering the iscsi driver as a study in avoiding OS abstraction layers, or a similar case in wireless with the Broadcom driver. The dismissal of midlayers and hailing of helper libraries has become so prevalent that calling your shiny new subsystem libfoo (viz. libnvdimm) seems to be a powerful trick to speed up its acceptance.

It seems common knowledge and accepted fact, but still there’s a constant stream of drivers that come packaged with huge abstraction layers - mostly to abstract OS internals, but very often also just a few midlayers between different components within the driver. A major reason for this is certain that submissions by former proprietary teams, or just generally code developed internally behind closed doors is suffering from the platform problem - again LWN has you covered. If your driver is not open and part of upstream (or even for an open source OS), then the provided services and interfaces are fixed. And there’s no way to improve things, worse, when change does happen the driver team generally doesn’t have any influence at all over what and how things changes. Hence it makes sense to insert a big abstraction layer to isolate the driver from the outside madness.

But that does not explain why big drivers (and more so, subsystems) come with some nice abstraction layers wedged in-between different parts. I believe, with not any proof really, that this is because company planners are extremely risk averse: Sure, working together and sharing code across the entire project has long-term benefits. But the more people are involved the bigger the risk for a bikeshed fest or some other delay, and you never want to be the one team that delayed a release by a few months. Adding isolation in the form of lots of fixed abstraction layers helps with that. But long term, and for really big projects, sharing code and working together has clear benefits, even if there’s routinely a hiccup - the neck-breaking speed of Linux kernel development overall is testament enough for that I think.

All that is just technicalities really, because in the end upstream and open source is about collaboratively developing software. That requires shared control, and interfaces between different components need to be a lot more permeable. In my experience that core idea of handing control over development to outsiders is the really scary part of joining upstream, since followed through to its conclusions it means you need to completely rethink how products are planned and developed: The entire organisation, from individual developers, to teams and including the management chain have to change. And that freaks people out.

In summary I think code submissions with lots of midlayers are bound to stay with us, and that’s good: Because new midlayers means new teams and companies start to take upstream seriously. And new midlayers getting cleaned up means new teams and new companies are going to the painful changes necessary to adjust to an upstream first model. The code itself is just the canary for the real shifts happening.

In other news: World domination is still progressing according to plan.

December 14, 2016 10:00 AM

December 12, 2016

Kees Cook: security things in Linux v4.9

Previously: v4.8.

Here are a bunch of security things I’m excited about in the newly released Linux v4.9:

Latent Entropy GCC plugin

Building on her earlier work to bring GCC plugin support to the Linux kernel, Emese Revfy ported PaX’s Latent Entropy GCC plugin to upstream. This plugin is significantly more complex than the others that have already been ported, and performs extensive instrumentation of functions marked with __latent_entropy. These functions have their branches and loops adjusted to mix random values (selected at build time) into a global entropy gathering variable. Since the branch and loop ordering is very specific to boot conditions, CPU quirks, memory layout, etc, this provides some additional uncertainty to the kernel’s entropy pool. Since the entropy actually gathered is hard to measure, no entropy is “credited”, but rather used to mix the existing pool further. Probably the best place to enable this plugin is on small devices without other strong sources of entropy.

vmapped kernel stack and thread_info relocation on x86

Normally, kernel stacks are mapped together in memory. This meant that attackers could use forms of stack exhaustion (or stack buffer overflows) to reach past the end of a stack and start writing over another process’s stack. This is bad, and one way to stop it is to provide guard pages between stacks, which is provided by vmalloced memory. Andy Lutomirski did a bunch of work to move to vmapped kernel stack via CONFIG_VMAP_STACK on x86_64. Now when writing past the end of the stack, the kernel will immediately fault instead of just continuing to blindly write.

Related to this, the kernel was storing thread_info (which contained sensitive values like addr_limit) at the bottom of the kernel stack, which was an easy target for attackers to hit. Between a combination of explicitly moving targets out of thread_info, removing needless fields, and entirely moving thread_info off the stack, Andy Lutomirski and Linus Torvalds created CONFIG_THREAD_INFO_IN_TASK for x86.

CONFIG_DEBUG_RODATA mandatory on arm64

As recently done for x86, Mark Rutland made CONFIG_DEBUG_RODATA mandatory on arm64. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so there’s no reason to make the protection optional.

random_page() cleanup

Cleaning up the code around the userspace ASLR implementations makes them easier to reason about. This has been happening for things like the recent consolidation on arch_mmap_rnd() for ET_DYN and during the addition of the entropy sysctl. Both uncovered some awkward uses of get_random_int() (or similar) in and around arch_mmap_rnd() (which is used for mmap (and therefore shared library) and PIE ASLR), as well as in randomize_stack_top() (which is used for stack ASLR). Jason Cooper cleaned things up further by doing away with randomize_range() entirely and replacing it with the saner random_page(), making the per-architecture arch_randomize_brk() (responsible for brk ASLR) much easier to understand.

That’s it for now! Let me know if there are other fun things to call attention to in v4.10.

© 2016 – 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

December 12, 2016 07:05 PM

Michael Kerrisk (manpages): man-pages-4.09 is released

I've released man-pages-4.09. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 44 contributors. This is one of the more substantial releases in recent times, with more than 500 commits changing around 190 pages. The changes include the addition of eight new pages and significant enhancements or rewrites to many existing pages.

Among the more significant changes in man-pages-4.09 are the following:

In addition to the above, substantial changes were also made to the close(2), getpriority(2), nice(2), timer_create(2), timerfd_create(2), random(4), and proc(5) pages.

December 12, 2016 02:22 PM

December 10, 2016

James Bottomley: TPM enabling gnome-keyring

One of the questions about the previous post on using your TPM as a secure key store was “could the TPM be used to protect ssh keys?”  The answer is yes, because openssh uses openssl (so you can simply convert an openssh private key to a TPM private key) but the ssh-agent wouldn’t work because ssh-add passes in the private keys by their component primes, which is not possible when the private key is guarded by the TPM.  However,  that made me actually look at gnome-keyring to figure out how it worked and whether it could be used with the TPM.  The answer is yes, but there are also some interesting side effects of TPM enabling gnome-keyring.

Gnome-keyring Architecture

Gnome-keyring consists of essentially three components: a pluggable store backend, a secure passphrase holder (which is implemented as a backend store) and an agent frontend.  The frontend and backend talk to each other using the pkcs11 protocol.  This also means that the backend can serve anything that also speaks pkcs11, which most encryption systems do.  The stores consist of a variety of file or directory backed keys (for instance the ssh-store simply loads all the ssh keys in your $HOME/.ssh; the secret-store uses the gnome default keyring store, $HOME/.local/shared/keyring/,  to store collections of passwords)  The frontends act as bridges between a variety of external protocols and the keyring daemon.  They take whatever external protocol they’re speaking in, convert the request to pkcs11 and query the backends for the information.  The most important frontend is the login one which is called by gnome-keyring-daemon at start of day to unlock the secret-store which contains all your other key passwords using your login password as the key.  The ssh-agent frontend speaks the ssh agent protocol and converts key and signing requests to pkcs11 which is mostly served by the ssh-store.  The gpg-agent store speaks a very cut down version of the gpg agent protocol: basically all it does is allow you to store gpg key passwords in the secret-store; it doesn’t do any cryptographic operations.

Pkcs11 Essentials

Pkcs11 is a highly complex protocol designed for opening sessions which query or operate on tokens, which roughly speaking represent a bundle of objects (If you have a USB crypto key, that’s usually represented as a token and your stored keys as objects).  There is some intermediate stuff about slots, but that mostly applies to tokens which may be offline and need insertion, which isn’t relevant to the gnome keyring.  The objects may have several levels of visibility, but the most common is public (always visible) and private (must be logged in to the token to see them).  Objects themselves have a variety of attributes, some of which depend on what type of object they are and some of which are universal (like id and label).  For instance, an object representing an RSA public key would have the public exponent and the public modulus as queryable attributes.

The pkcs11 protocol also has a reasonably comprehensive object finding protocol.  An arbitrary list of attributes and values can be passed to the query and it will return all objects that fully match.  The token identifier is a query attribute which may be present but doesn’t have to be, so if you omit it, you end up searching over every token the pkcs11 subsystem knows about.

The other operation that pkcs11 understands is logging into a token with a pin (which is pkcs11 speak for a passphrase).  The pin doesn’t have to be supplied by the entity performing the login, it may be supplied to the token via an out of band mechanism (for instance the little button on the yukikey, or even a real keypad).  The important thing for gnome keyring here is that logging into a token may be as simple as sending the login command and letting the token sort out the authorization, which is the way gnome keyring operates.

You also need to understand is conventions about searching for keys.  The pkcs11 standard recommends (but does not require) that public and private key pairs, which exist as two separate objects, should have the same id attribute.  This means that if you want to find an rsa private key, the way you do this is by searching the public objects for the exponent and modulus.  Once this is returned, you log into the token and retrieve the private key object by the public key id.

The final thing to understand is that once you have the private key object, it merely acts as an entitlement to have the token perform certain private key operations for you (like encryption, decryption or signing); it doesn’t mean you have access to the private key itself.

Gnome Keyring handling of ssh keys

Once the ssh-agent side of gnome keyring receives a challenge, it must respond by returning the private key signature of the challenge.  To do this, it searches the pkcs11 for the key used for the challenge (for RSA keys, it searches by modulus and exponent, for DSA keys it searches by signature primes, etc).  One interesting point to note is the search isn’t limited to the gnome keyring ssh token store, so if the key is found anywhere, in any pkcs11 token, it will be used.  The expectation is that they key id attribute will be the ssh fingerprint and the key label attribute will be the ssh comment, but these aren’t used in the actual search.  Once the public key is found, the agent logs into the token, retrieves the private key by label and id and proceeds to get the private key to sign the challenge.

Adding TPM key handling to the ssh store

Gnome keyring is based on GNU libgcrypt which handles all cryptographic objects via s-expressions (which are basically lisp like strings).  Gcrypt itself doesn’t seem to have an s-expression for a token, so the actual signing will be done inside the keyring code.  The s-expression I chose to represent a TPM based private key is

(private-key
  (rsa
    (tpm
      (blob <binary key blob>)
      (auth <authoriztion>))))

The rsa is necessary for it to be recognised for RSA signatures.  Now the ssh-store is modified to recognise the PEM guards for TPM keys, load the key blob up into the s-expression and ask the secret-store for the authorization passphrase which is also loaded.  In theory, it might be safer to ask the secret store for the authorization at key use time, but this method mirrors what happens to private keys, which are decrypted at load time and stored as s-expressions containing the component primes.

Now the RSA signing code is hooked to check for TPM s-expressions and divert the signature to the TPM if they’re found.  Once this is done, gnome keyring is fully enabled for TPM based ssh keys.  The initial email thread about this is here, and an openSUSE build repository of the modified code is here.

One important design constraint for this is that people (well, OK me) have a lot of ssh keys, sometimes more than could be effectively stored by the TPM in its internal shielded memory (plus if you’re like me, you’re using the TPM for other things like VPN), so the design of the keyring TPM additions is not to burden that memory further, thus TPM keys are only loaded into the TPM whenever they’re needed for an operation and are unloaded otherwise.  This means the TPM can scale to hundreds of keys, but at the expense of taking longer: Instead of simply asking the TPM to sign something, you have to first ask the TPM to load and unwrap (which is an RSA operation) the key, then use it to sign.  Effectively it’s two expensive TPM operations per real cryptographic function.

Using TPM based ssh keys

Once you’ve installed the modified gnome-keyring package, you’re ready actually to make use of it.  Since we’re going to replace all your ssh private keys with TPM equivalents, which are keyed to a given storage root key (SRK)1 which changes every time the TPM is cleared, it is prudent to take a backup of all your ssh keys on some offline encrypted USB stick, just in case you ever need to restore them (or transfer them to a new laptop2).

cd ~/.ssh/
for pub in *rsa*.pub; do
    priv=$(basename $pub .pub)
    echo $priv
    create_tpm_key -m -a -w $priv ${priv}.tpm
    mv ${priv}.tpm $priv
done

You’ll be prompted first to create a TPM authorization password for the key, then to verify it, then to give the PEM password for the ssh key (for each key).  Note, this only transfers RSA keys (the only algorithm the TPM can handle) and also note the script above is overwriting the PEM private key, so make sure you have a backup.  The create_tpm_key command comes from the openssl_tpm_engine package, which I’ve patched here to support random migration authority and well known SRK authority.

All you have to do now is log out and back in (to restart the gnome-keyring daemon with the new ssh keystore) and you’re using TPM based keys for all your ssh operations.  I’ve noticed that this adds a couple of hundred milliseconds per login, so if you batch stuff over ssh, this is why your scripts are slower.

December 10, 2016 07:48 PM

December 06, 2016

LPC 2016: Linux Plumbers Conference 2017

It’s our pleasure to announce that Linux Plumbers Conference 2017 will take place on September 13-15 2017, in Los Angeles, California, USA. The conference will be co-located with the Linux Foundation Open Source Summit North America.

Stay tuned for more information as the Linux Plumbers Conference committee is starting to plan for the 2017 edition.

We hope you’ll join us in 2017.

The LPC Planning Team.

 

December 06, 2016 07:51 PM

December 05, 2016

James Bottomley: Using Your TPM as a Secure Key Store

One of the new features of Linux Plumbers Conference this year was the TPM Microconference, which facilitated great discussions both in the session itself and in the hallways.  Quite a bit of discussion was generated by the Beginner’s Guide to the TPM talk I gave, mostly because I blamed the Trusted Computing Group for the abject failure to adopt TPMs for anything citing the incredible complexity of their stack.

The main thing that came out of this discussion was that a lot of this stack complexity can be hidden from users and we should concentrate on making the TPM “just work” for all cryptographic functions where we have parallels in the existing security layers (like the keystore).  One of the great advantages of the TPM, instead of messing about with USB pkcs11 tokens, is that it has a file format for TPM keys (I’ll explain this later) which can be used directly in place of standard private key files.  However, before we get there, lets discuss some of the basics of how your TPM works and how to make use of it.

TPM Basics

Note that all of what I’m saying below applies to a 1.2 TPM (the type most people have in their laptops) 2.0 TPMs are now appearing on the market, but chances are you have a 1.2.

A TPM is traditionally delivered in your laptop in an uninitialised state.  In older laptops, the TPM is traditionally disabled and you usually have to find an entry in the BIOS menu to enable it.  In more modern laptops (thanks to Windows 10) the TPM is enabled in the bios and ready for the OS install to make use of it.  All TPMs are delivered with one manufacturer set key called the Endorsement Key (EK).  This key is unique to your TPM (like an identifying label) and is used as part of the attestation protocol.  Because the EK is a unique label, the attestation protocol is rather complex involving a so called privacy CA to protect your identity, but because it isn’t necessary to use the TPM as a secure keystore, I won’t cover it further.

The other important key, which you have to generate, is called the Storage Root Key.  This key is generated internally within the TPM once somebody takes ownership of it.  The package you need to begin using the tpm is tpm-tools, which is packaged by most distros. You must also have the Linux TSS stack trousers installed (just installing tpm-tools will often pull this in) and have the tcsd part of trousers running (usually systemctl start tcsd; systemctl enable tcsd). I tend to configure my TPM with an owner password (for things like resetting dictionary attacks) but a well known storage root key authority.  To do this from a fully cleared and enabled TPM, execute

tpm_takeownership -z

And give your chosen owner password when prompted.  If you get an error, chances are you need to go back to the BIOS menu and actively clear and reset the TPM (usually under the security options).

Aside about Authority and the Trusted Security Stack

To the TPM, an “authority” is a 20 byte number you use to prove you’re allowed to manipulate whatever object you’re trying to use.  The TPM typically has a well known way of converting typed passwords into these 20 byte codes.  The way you prove you know the authority is to add a Hashed Message Authentication Code (HMAC) to your TPM command.  This means that the hash can only be generated by someone who knows the authority for the object, but anyone seeing the hash cannot derive the authority from it.  The utility of this is that the trousers library (tspi) generates the HMAC before the TPM command is passed to the central daemon (tcsd) meaning that nothing except you and the TPM know the authority

trousersThe final thing about authority you need to know is that the TPM has a concept of “well known authority” which simply means supply 20 bytes of zeros.  It’s kind of paradoxical to have a secret everyone knows, however, there are reasons for this:  For most objects in the TPM whether you require authority to use them is optional, but for some it is mandatory.  For objects (like the SRK) where authority is mandatory, using the well known authority is equivalent to saying actually I don’t need authorization for this object.

The Storage Root Key (SRK)

Once you’ve generated this above, the TPM keeps the secret part permanently hidden, but can be persuaded to give anyone the public part.  In TPM 1.2, the SRK is a RSA 2048 key.  On most modern TPMs, you have to tell the tpm you want anyone to be able to read the public part of the storage root key, which you do with this command

tpm_restrictsrk -a

You’ll get prompted for the owner password.  Once you execute this command, anyone who knows the SRK authority (which you’ve set to be well known) is allowed to read the public part.

Why all this fuss about SRK authorization?  Well, traditionally, the TPM is designed for use in a hostile multi-user environment.  In the relaxed, no authorization, environment I’ve advised you to set up, anyone who knows the SRK can upload any storage object (like a key or protected blob) into the TPM.  This means, since the TPM has very limited storage, that they could in theory do a DoS attack against the TPM simply by filling it with objects.  On a laptop where there’s only one user (you) this is not usually a concern, hence the advice to use a well known authority, which makes the TPM much easier to use.

The way external objects (like keys or data blobs) are uploaded into the TPM is that they all have a parent (which must be a storage key) and they are encrypted to the public part of this key (in TPM parlance, this is called wrapping).  The TPM can have deep key hierarchies (all eventually parented to the SRK), but for a laptop, it makes sense simply to use the SRK as the only storage key and wrap everything for it as the parent.  Now here’s the reason for the well known authority: to upload an object into the TPM, it not only needs to be wrapped to the parent key, you also need to use the parent key authority to perform the upload.  The object you’re using also has a separate authority.  This means that when you upload and use a key, if you’ve set a SRK password, you’ll end up having to type both the SRK password and the key password pretty much every time you use it, which is a bit of a pain.

The tools used to create wrapped keys are found in the openssl_tpm_engine package.  I’ve done a few patches to make it easier to use (mostly by trying well known authority first before asking for the SRK password), so you can see my patched version here.  The first thing you can do is take any PEM key file you have and wrap it for your tpm

create_tpm_key -m -w test.key test.tpm.key

This creates a TPM key file test.tpm.key containing a wrapped key for your TPM with no authority (to add an authority password, use the -a option).  If you cat the test.tpm.key file, you’ll see it looks like a standard PEM file, except the guards are now

-----BEGIN TSS KEY BLOB-----
-----END TSS KEY BLOB-----

This key is now wrapped for your TPM’s SRK and would only be usable on your laptop.  If you’re fortunate enough to be using an application linked with gnutls, you can simply use this key with the URI  tpmkey:file=<path to test.tpm.key>.  If you’re using openssl, you need to patch it to get it to use TPM keys easily (see below).

The ideal, however, would be since these are PEM files with unique guards, any ssl provider should simply recognise the guards  and load the key into the TPM  This means that in order to use a TPM key, you take a standard PEM private key file, transform it into a TPM key file and then simply copy it back to where the original key file was being used from and voila! you’re using a TPM based key.  This is what the openssl patches below do.

Getting TPM keys to “just work” with openssl

In openssl, external encryption processors, like the TPM or USB keys are used by things called engines.  The engine you need for the TPM is also in the openssl_tpm_engine package, so once you’ve installed that package, the engine is available.  Unfortunately, openssl doesn’t naturally use a particular engine unless told to do so (most of the openssl tools have a -engine option for this).  However, having to specify the engine in every application somewhat spoils the “just works” aspect we’re looking for, so the openssl patches here allow an engine to specify that it knows how to parse a PEM file and can load a key from it.  This allows you simply to replace the original key file with a TPM protected key file and have your application continue working with it.

As a demo of the usefulness, I’m using it on my current laptop with all my VPN keys.  It is also possible to use it with openssh keys, since they’re standard PEM files.  However, the way openssh works with agents means that the agent cannot handle the keys and you have to type the password (if you set one one the key) each time you use it.

It should be noted that the idea of having PEM based TPM keys just work in openssl is encountering resistance.  However, it does just work in gnutls (provided you change the file name to be a tpmkey:file= URL).

Conclusions (or How Well is it Working?)

As I said above, I’m currently using this scheme for my openvpn and ssh keys.  I have to confess, since I use openssh a lot, I got very tired of having to type the password on every ssh operation, so I’ve gone back to using non-TPM based keys which can be handled by the agent.  Fixing this is on my list of things to look at.  However, I still am using TPM based keys for my openvpn.

Even for openvpn, though there are hiccoughs: the trousers daemon, tcsd, crashes periodically on my platform.  When it does, the vpn goes down (because the VPN needs a key based authentication transaction every hour to rotate the symmetric encryption keys).  Unfortunately, just restarting tcsd isn’t enough because the design of trousers doesn’t seem to be robust to this failure (even though the tspi part linked with the application could recreate all the keys), so the VPN itself must be restarted when this happens, which makes it rather user unfriendly.  Fixing trousers to cope with tcsd failure is also on my list of things to fix …

December 05, 2016 04:41 PM

December 02, 2016

Matthew Garrett: Ubuntu still isn't free software

Mark Shuttleworth just blogged about their stance against unofficial Ubuntu images. The assertion is that a cloud hoster is providing unofficial and modified Ubuntu images, and that these images are meaningfully different from upstream Ubuntu in terms of their functionality and security. Users are attempting to make use of these images, are finding that they don't work properly and are assuming that Ubuntu is a shoddy product. This is an entirely legitimate concern, and if Canonical are acting to reduce user confusion then they should be commended for that.

The appropriate means to handle this kind of issue is trademark law. If someone claims that something is Ubuntu when it isn't, that's probably an infringement of the trademark and it's entirely reasonable for the trademark owner to take action to protect the value associated with their trademark. But Canonical's IP policy goes much further than that - it can be interpreted as meaning[1] that you can't distribute works based on Ubuntu without paying Canonical for the privilege, even if you call it something other than Ubuntu.

This remains incompatible with the principles of free software. The freedom to take someone else's work and redistribute it is a vital part of the four freedoms. It's legitimate for Canonical to insist that you not pass it off as their work when doing so, but their IP policy continues to insist that you remove all references to Canonical's trademarks even if their use would not infringe trademark law.

If you ask a copyright holder if you can give a copy of their work to someone else (assuming it doesn't infringe trademark law), and they say no or insist you need an additional contract, it's not free software. If they insist that you recompile source code before you can give copies to someone else, it's not free software. Asking that you remove trademarks that would otherwise infringe trademark law is fine, but if you can't use their trademarks in non-infringing ways, that's still not free software.

Canonical's IP policy continues to impose restrictions on all of these things, and therefore Ubuntu is not free software.

[1] And by "interpreted as meaning" I mean that's what it says and Canonical refuse to say otherwise

comment count unavailable comments

December 02, 2016 09:37 AM

November 16, 2016

Pavel Machek: Linux did not win, yet

http://www.cio.com/article/3141918/linux/linux-has-won-microsoft-joins-the-linux-foundation.html Yes, Linux won on servers. Unfortunately... servers are not that important, and Linux still did not win on desktops (and is not much closer now than it was in 1998, AFAICT). We kind-of won on phones, but are not getting any benefits from that. Android is incompatible with X applications. Kernels on phones are so patched that updating kernel on phone is impossible... :-(. This means that Microsoft sponsors Linux Foundation. Well, nice, but not a big deal. Has Microsoft promised not to use their patents against Linux? Does their kernel actually contain vfat code? Can I even get source for "their" Linux kernel? [Searching for Linux on microsoft.com does not reveal anything interesting; might be switching to english would help...]

November 16, 2016 11:49 PM

November 15, 2016

Paul E. Mc Kenney: Another great Linux Plumbers Conference!

A big “thank you” to the program committee, to the microconference leads, to the refereed-track speakers, and, most of all, to the attendees! We had a great Linux Plumbers Conference this year, and we could not have done it without all of you!!!

November 15, 2016 10:47 PM

November 14, 2016

Pavel Machek: foxtrotgps: not suitable for spacecraft navigation

Subject: foxtrotgps: not suitable for spacecraft navigation
Package: foxtrotgps
Version: 1.2.0-1
Severity: normal
Dear Maintainer,
Trying to use foxtrotgps in the spacecraft leads to some interesting
glitches.
When date line is reached, "track traveled" jumps over the whole
world, and "your position" gets de-synchronized from point when the
red line is painted.
Reproduced with Vostok-1 spacecraft.

November 14, 2016 10:22 AM

November 10, 2016

Matthew Garrett: Tor, TPMs and service integrity attestation

One of the most powerful (and most scary) features of TPM-based measured boot is the ability for remote systems to request that clients attest to their boot state, allowing the remote system to determine whether the client has booted in the correct state. This involves each component in the boot process writing a hash of the next component into the TPM and logging it. When attestation is requested, the remote site gives the client a nonce and asks for an attestation, the client OS passes the nonce to the TPM and asks it to provide a signed copy of the hashes and the nonce and sends them (and the log) to the remote site. The remoteW site then replays the log to ensure it matches the signed hash values, and can examine the log to determine whether the system is trustworthy (whatever trustworthy means in this context).

When this was first proposed people were (justifiably!) scared that remote services would start refusing to work for users who weren't running (for instance) an approved version of Windows with a verifiable DRM stack. Various practical matters made this impossible. The first was that, until fairly recently, there was no way to demonstrate that the key used to sign the hashes actually came from a TPM[1], so anyone could simply generate a set of valid hashes, sign them with a random key and provide that. The second is that even if you have a signature from a TPM, you have no way of proving that it's from the TPM that the client booted with (you can MITM the request and either pass it to a client that did boot the appropriate OS or to an external TPM that you've plugged into your system after boot and then programmed appropriately). The third is that, well, systems and configurations vary so much that outside very controlled circumstances it's impossible to know what a "legitimate" set of hashes even is.

As a result, so far remote attestation has tended to be restricted to internal deployments. Some enterprises use it as part of their VPN login process, and we've been working on it at CoreOS to enable Kubernetes clusters to verify that workers are in a trustworthy state before running jobs on them. While useful, this isn't terribly exciting for most people. Can we do better?

Remote attestation has generally been thought of in terms of remote systems requiring that clients attest. But there's nothing that requires things to be done in that direction. There's nothing stopping clients from being able to request that a server attest to its state, allowing clients to make informed decisions about whether they should provide confidential data. But the problems that apply to clients apply equally well to servers. Let's work through them in reverse order.

We have no idea what expected "good" values are

Yes, and this is a problem. CoreOS ships with an expected set of good values, and we had general agreement at the Linux Plumbers Conference that other distributions would start looking at what it would take to do the same. But how do we know that those values are themselves trustworthy? In an ideal world this would involve reproducible builds, allowing anybody to grab the source code for the OS, build it locally and verify that they have the same hashes.

Ok. So we're able to verify that the booted OS was good. But how about the services? The rkt container runtime supports measuring each container into the TPM, which means we can verify which container images were started. If container images are also built in such a way that they're reproducible, users can grab the source code, rebuild the container locally and again verify that it has the same hashes. Users can then be sure that the remote site is running the code they're looking at.

Or can they? Not really - a general purpose OS has all kinds of ways to inject code into containers, so an admin could simply replace the binaries inside the container after it's been measured, or ptrace() the server, or modify rkt so it generates correct measurements regardless of the image or, well, there's lots they could do. So a general purpose OS is probably a bad idea here. Instead, let's imagine an immutable OS that does nothing other than bring up networking and then reads a config file that tells it which container images to download and run. This reduces the amount of code that needs to support reproducible builds, making it easier for a client to verify that the source corresponds to the code the remote system is actually running.

Is this sufficient? Eh sadly no. Even if we know the valid values for the entire OS and every container, we don't know the legitimate values for the system firmware. Any modified firmware could tamper with the rest of the trust chain, making it possible for you to get valid OS values even if the OS has been subverted. This isn't a solved problem yet, and really requires hardware vendor support. Let's handwave this for now, or assert that we'll have some sidechannel for distributing valid firmware values.

Avoiding TPM MITMing

This one's more interesting. If I ask the server to attest to its state, it can simply pass that through to a TPM running on another system that's running a trusted stack and happily serve me content from a compromised stack. Suboptimal. We need some way to tie the TPM identity and the service identity to each other.

Thankfully, we have one. Tor supports running services in the .onion TLD. The key used to identify the service to the Tor network is also used to create the "hostname" of the system. I wrote a pretty hacky implementation that generates that key on the TPM, tying the service identity to the TPM. You can ask the TPM to prove that it generated a key, and that allows you to tie both the key used to run the Tor service and the key used to sign the attestation hashes to the same TPM. You now know that the attestation values came from the same system that's running the service, and that means you know the TPM hasn't been MITMed.

How do you know it's a TPM at all?

This is much easier. See [1].



There's still various problems around this, including the fact that we don't have this immutable minimal container OS, that we don't have the infrastructure to ensure that container builds are reproducible, that we don't have any known good firmware values and that we don't have a mechanism for allowing a user to perform any of this validation. But these are all solvable, and it seems like an interesting project.

"Interesting" isn't necessarily the right metric, though. "Useful" is. And I think this is very useful. If I'm about to upload documents to a SecureDrop instance, it seems pretty important that I be able to verify that it is a SecureDrop instance rather than something pretending to be one. This gives us a mechanism.

The next few years seem likely to raise interest in ensuring that people have secure mechanisms to communicate. I'm not emotionally invested in this one, but if people have better ideas about how to solve this problem then this seems like a good time to talk about them.

[1] More modern TPMs have a certificate that chains from the TPM's root key back to the TPM manufacturer, so as long as you trust the TPM manufacturer to have kept control of that you can prove that the signature came from a real TPM

comment count unavailable comments

November 10, 2016 08:48 PM

November 04, 2016

LPC 2016: Closing party at the compound

We have a map here for you to locate the compound with

November 04, 2016 10:18 PM

October 28, 2016

Matthew Garrett: Of course smart homes are targets for hackers

The Wirecutter, an in-depth comparative review site for various electrical and electronic devices, just published an opinion piece on whether users should be worried about security issues in IoT devices. The summary: avoid devices that don't require passwords (or don't force you to change a default and devices that want you to disable security, follow general network security best practices but otherwise don't worry - criminals aren't likely to target you.

This is terrible, irresponsible advice. It's true that most users aren't likely to be individually targeted by random criminals, but that's a poor threat model. As I've mentioned before, you need to worry about people with an interest in you. Making purchasing decisions based on the assumption that you'll never end up dating someone with enough knowledge to compromise a cheap IoT device (or even meeting an especially creepy one in a bar) is not safe, and giving advice that doesn't take that into account is a huge disservice to many potentially vulnerable users.

Of course, there's also the larger question raised by the last week's problems. Insecure IoT devices still pose a threat to the wider internet, even if the owner's data isn't at risk. I may not be optimistic about the ease of fixing this problem, but that doesn't mean we should just give up. It is important that we improve the security of devices, and many vendors are just bad at that.

So, here's a few things that should be a minimum when considering an IoT device:

  • Does the vendor publish a security contact? (If not, they don't care about security)
  • Does the vendor provide frequent software updates, even for devices that are several years old? (If not, they don't care about security)
  • Has the vendor ever denied a security issue that turned out to be real? (If so, they care more about PR than security)
  • Is the vendor able to provide the source code to any open source components they use? (If not, they don't know which software is in their own product and so don't care about security, and also they're probably infringing my copyright)
  • Do they mark updates as fixing security bugs? (If not, they care more about hiding security issues than fixing them)
  • Has the vendor ever threatened to prosecute a security researcher? (If so, again, they care more about PR than security)
  • Does the vendor provide a public minimum support period for the device? (If not, they don't care about security or their users)

    I've worked with big name vendors who did a brilliant job here. I've also worked with big name vendors who responded with hostility when I pointed out that they were selling a device with arbitrary remote code execution. Going with brand names is probably a good proxy for many of these requirements, but it's insufficient.

    So here's my recommendations to The Wirecutter - talk to a wide range of security experts about the issues that users should be concerned about, and figure out how to test these things yourself. Don't just ask vendors whether they care about security, ask them what their processes and procedures look like. Look at their history. And don't assume that just because nobody's interested in you, everybody else's level of risk is equal.


  • comment count unavailable comments

    October 28, 2016 05:23 PM

    October 25, 2016

    Valerie Aurora: Why I won’t be attending Systems We Love

    Systems We Love is a one day event in San Francisco to talk excitedly about systems computing. When I first heard about it, I was thrilled! I love systems so much that I moved from New Mexico to the Bay Area when I was 23 years old purely so that I could talk to more people about them. I’m the author of the Kernel Hacker’s Bookshelf series, in which I enthusiastically described operating systems research papers I loved in the hopes that systems programmers would implement them. The program committee of Systems We Love includes many people I respect and enjoy being around. And the event is so close to me that I could walk to it.

    So why I am not going to Systems We Love? Why am I warning my friends to think twice before attending? And why am I writing a blog post warning other people about attending Systems We Love?

    The answer is that I am afraid that Bryan Cantrill, the lead organizer of Systems We Love, will say cruel and humiliating things to people who attend. Here’s why I’m worried about that.

    I worked with Bryan in the Solaris operating systems group at Sun from 2002 to 2004. We didn’t work on the same projects, but I often talked to him at the weekly Monday night Solaris kernel dinner at Osteria in Palo Alto, participated in the same mailing lists as him, and stopped by his office to ask him questions every week or two. Even 14 years ago, Bryan was one of the best systems programmers, writers, and speakers I have ever met. I admired him and learned a lot from him. At the same time, I was relieved when I left Sun because I knew I’d never have to work with Bryan again.

    Here’s one way to put it: to me, Bryan Cantrill is the opposite of another person I admire in operating systems (whom I will leave unnamed). This person makes me feel excited and welcome and safe to talk about and explore operating systems. I’ve never seen them shame or insult or put down anyone. They enthusiastically and openly talk about learning new systems concepts, even when other people think they should already know them. By doing this, they show others that it’s safe to admit that they don’t know something, which is the first step to learning new things. They are helping create the kind of culture I want in systems programming – the kind of culture promoted by Papers We Love, which Bryan cites as the inspiration for Systems We Love.

    By contrast, when I’m talking to Bryan I feel afraid, cautious, and fearful. Over the years I worked with Bryan, I watched him shame and insult hundreds of people, in public and in private, over email and in person, in papers and talks. Bryan is no Linus Torvalds – Bryan’s insults are usually subtle, insinuating, and beautifully phrased, whereas Linus’ insults tend towards the crude and direct. Even as you are blushing in shame from what Bryan just said about you, you are also admiring his vocabulary, cadence, and command of classical allusion. When I talked to Bryan about any topic, I felt like I was engaging in combat with a much stronger foe who only wanted to win, not help me learn. I always had the nagging fear that I probably wouldn’t even know how cleverly he had insulted me until hours later. I’m sure other people had more positive experiences with Bryan, but my experience matches that of many others. In summary, Bryan is supporting the status quo of the existing culture of systems programming, which is a culture of combat, humiliation, and domination.

    People admire and sometimes hero-worship Bryan because he’s a brilliant technologist, an excellent communicator, and a consummate entertainer. But all that brilliance, sparkle, and wit are often used in the service of mocking and humiliating other people. We often laugh and are entertained by what Bryan says, but most of the time we are laughing at another person, or at a person by proxy through their work. I think we rationalize taking part in this kind of cruelty by saying that the target “deserves” it because they made a short-sighted design decision, or wrote buggy code, or accidentally made themselves appear ridiculous. I argue that no one deserves to be humiliated or laughed at for making an honest mistake, or learning in public, or doing the best they could with the resources they had. And if that means that people like Bryan have to learn how to be entertaining without humiliating people, I’m totally fine with that.

    I stopped working with Bryan in 2004, which was 12 years ago. It’s fair to wonder if Bryan has had a change of heart since then. As far as I can tell, the answer is no. I remember speaking to Bryan in 2010 and 2011 and it was déjà vu all over again. The first time, I had just co-founded a non-profit for women in open technology and culture, and I was astonished when Bryan delivered a monologue to me on the “right” way to get more women involved in computing. The second time I was trying to catch up with a colleague I hadn’t seen in a while and Bryan was invited along. Bryan dominated the conversation and the two of us the entire evening, despite my best efforts. I tried one more time about a month ago: I sent Bryan a private message on Twitter telling him honestly and truthfully what my experience of working with him was like, and asking if he’d had a change of heart since then. His reply: “I don’t know what you’re referring to, and I don’t feel my position on this has meaningfully changed — though I am certainly older and wiser.” Then he told me to google something he’d written about women in computing.

    But you don’t have to trust my word on what Bryan is like today. The blog post Bryan wrote announcing Systems We Love sounds exactly like the Bryan I knew: erudite, witty, self-praising, and full of elegant insults directed at a broad swathe of people. He gaily recounts the time he gave a highly critical keynote speech at USENIX, bashfully links to a video praising him at a Papers We Love event, elegantly puts down most of the existing operating systems research community, and does it all while using the words “ancillary,” “verve,” and “quadrennial.” Once you know the underlying structure – a layer cake of vituperation and braggadocio, frosted with eloquence – you can see the same pattern in most of his writing and talks.

    So when I heard about Systems We Love, my first thought was, “Maybe I can go but just avoid talking to Bryan and leave the room when he is speaking.” Then I thought, “I should warn my friends who are going.” Then I realized that my friends are relatively confident and successful in this field, but the people I should be worried about are the ones just getting started. Based on the reputation of Papers We Love and the members of the Systems We Love program committee, they probably fully expect to be treated respectfully and kindly. I’m old and scarred and know what to expect when Bryan talks, and my stomach roils at the thought of attending this event. How much worse would it be for someone new and open and totally unprepared?

    Bryan is a better programmer than I am. Bryan is a better systems architect than I am. Bryan is a better writer and speaker than I am. The one area I feel confident that I know more about than Bryan is increasing diversity in computing. And I am certain that the environment that Bryan creates and fosters is more likely to discourage and drive off women of all races, people of color, queer and trans folks, and other people from underrepresented groups. We’re already standing closer to the exit; for many of us, it doesn’t take much to make us slip quietly out the door and never return.

    I’m guessing that Bryan will respond to me saying that he humiliates, dominates, and insults people by trying to humiliate, dominate, and insult me. I’m not sure if he’ll criticize my programming ability, my taste in operating systems, or my work on increasing diversity in tech. Maybe he’ll criticize me for humiliating, dominating, and insulting people myself – and I’ll admit, I did my fair share of that when I was trying to emulate leaders in my field such as Bryan Cantrill and Linus Torvalds. It’s gone now, but for years there was a quote from me on a friend’s web site, something like: “I’m an elitist jerk, I fit right in at Sun.” It took me years to detox and unlearn those habits and I hope I’m a kinder, more considerate person now.

    Even if Bryan doesn’t attack me, people who like the current unpleasant culture of systems programming will. I thought long and hard about the friendships, business opportunities, and social capital I would lose over this blog post. I thought about getting harassed and threatened on social media. I thought about a week of cringing whenever I check my email. Then I thought about the people who might attend Systems We Love: young folks, new developers, a trans woman at her first computing event since coming out – people who are looking for a friendly and supportive place to talk about systems at the beginning of their careers. I thought about them being deeply hurt and possibly discouraged for life from a field that gave me so much joy.

    Come at me, Bryan.

    Note: comments are now closed on this post. You can read and possibly comment on the follow-up post, When is naming abuse itself abusive?


    Tagged: conferences, feminism, kernel

    October 25, 2016 03:24 AM

    October 24, 2016

    LPC 2016: Things to remember about the altitude in Santa Fe

    Santa Fe is at an altitude of 7,200 feet (2,200m). There are a few things that attendees who are not used to higher altitudes may want to bear in mind:

    October 24, 2016 12:36 AM

    October 23, 2016

    James Bottomley: Home Automation: Coping with Insecurity in the IoT

    Reading Matthew Garret’s exposés of home automation IoT devices makes most engineers think “hell no!” or “over my dead body!”.  However, there’s also the siren lure that the ability to program your home, or update its settings from anywhere in the world is phenomenally useful:  for instance, the outside lights in my house used to depend on two timers (located about 50m from each other).  They were old, loud (to the point the neighbours used to wonder what the buzzing was when they visited) and almost always wrongly set for turning the lights on at sunset.  The final precipitating factor for me was the need to replace our thermostat, whose thermistor got so eccentric it started cooling in winter; so away went all the timers and their loud noises and in came a z-wave based home automation system, and the guilty pleasure of having an IoT based home automation system.  Now the lights precisely and quietly turn on at sunset and off at 23:00 (adjusting themselves for daylight savings); the thermostat is accessible from my phone, meaning I can adjust it from wherever I happen to be (including Hong Kong airport when I realised I’d forgotten to set it to energy saving mode before we went on holiday).  Finally, there’s waking up at 3am to realise your wife has fallen asleep over her book again and being able to turn off her reading light from your alarm clock without having to get out of bed … Automation bliss!

    We all want the convenience; the trick is to work around the rampant insecurity that comes with today’s IoT to avoid your home automation system being part of the DDoS bot net that brings down the internet.

    Selecting your network

    For me, nothing IP/Wifi based was partly due to Matthew’s blog and partly because my home Wifi network looks different from everyone else’s: I actually run an internal, secure, home network that is wired and have my Wifi sit unsecured and outside the firewall.  This goes back to the good old days of expecting to find wifi wherever you travelled and returning the courtesy by ensuring your wifi was accessible, but it does mean that any wifi connected device would be outside my firewall and open to all, which, given the general insecurity of the devices, makes this a non-starter.

    The next level down is to use a private network, like zigbee or z-wave.  I chose z-wave because it covers longer distances (which I need) and it doesn’t interfere with wifi (I have a hard time covering the entire house, even with two wifi access points).  Z-wave also looks secure, but, if you dig deeply, you find that there are flaws in the protocol that lay you open to a local attacker.  This, by the way, shows the futility of demanding security from IoT vendors who really don’t understand how to do it: a flawed security implementation is pretty much as bad as no security at all.

    Once this decision is made, the next is to choose a gateway to the internet that does what you want, namely give you remote control without giving up your security.

    Gateway Phone Home?

    A surprising number of z-wave controllers are of the phone home type (this means phone their manufacturer’s home, not you), and almost all of these simply won’t work if they’re not allowed to phone home.  Google comprehensively demonstrated the issues this raises with nest: lots of early adopters now have so much non-functional junk.

    For me, there was also the burned hand experience with Google services: whenever I travel, I invariably get locked out because of some pseudo-security issue and it takes a fight to get back in again. This ultimately precipitated my move away from the Google cloud and on to Owncloud for calendar and contacts, but also means I really don’t want to have to trust another external service for my home automation.

    Given the significantly limited choice of non-phone home z-wave controllers, I chose the HomeSeer Zee S2.  It’s basically a raspberry pi with a z-wave dongle and Linux.  If you’re into Linux on evereything, you should be aware that the home automation system is actually written in .net and it uses mono to bridge the gap; an odd choice given that there’s no known windows platform that could actually possibly run this system.

    Secure Internet based Automation

    The ZS2 does actually come with wifi, but given my already listed wifi problems, it’s actually plugged into my secure wired network with all phone home capabilities disabled.  Great, but that means it’s only accessible over a VPN and I want to be able to control it from things like my phone, where running a VPN is cumbersome, so lets do some magic tricks to make it securely accessible by any member of the family from any device.

    Obviously, since I already run Owncloud, I have a server of my own in a co-located site.  It’s this server I propose to use as my secure gateway.  The obvious way of doing this is simply proxying the ZS2 controller web page, but there are a couple of problems: firstly if I do it globally the ZS2 will be visible to port scans and secondly it only actually has an unencrypted web page with http authentication, meaning the login credentials would go over the internet in clear text … oops!

    The solution to the first of these is to make the web page only accessible to authenticated devices.  My current method is to use firewall whitelisting and a hook to an existing service authentication to open up the port.  So in the firewall mangle table, all the ports which require whitelisting are marked.  Then, in the input firewall, any packet so marked is checked against the whitelist for a matching source IP.  If a match is found, then the packet is permitted, otherwise it is denied.

    Whitelisting itself is done by a simple pam script

    #!/usr/bin/perl
    use Socket;
    
    $xt_file = '/proc/net/xt_recent/whitelist';
    
    $name = $ENV{'PAM_RHOST'};
    if ($name =~ m/^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/) {
     $addr = $name;
    } else {
     $ip = gethostbyname($name);
     $addr = inet_ntoa($ip);
    }
    
    open(FD, ">$xt_file");
    print FD "+$addr\n";
    close(FD);
    
    exit 0

    And this script is executed from the dovecot pam file as

    # add session to cause ip address of successful login to be whitelisted
    session optional pam_exec.so /etc/pam.d/whitelist.pl

    Meaning that any IP address that gets an authenticated imap connection (which is basically everybody’s internet device, since they all connect to email) is now allowed to access the authenticated ports.  Since imap requires re-authentication after a configurable timeout, the whitelist entry only lasts for just over that timeout and hey presto, we have our secured port system.

    Obviously, this isn’t foolproof: in particular whitelisting by external IP means that anyone sharing the same ip address via nat (like at a hotel) also has access to the secured ports, but it does cut down enormously on generic internet visibility.

    The final thing is to add security to the insecure web page, so anyone in the path to my internet host can’t sniff the password.  This is easily achieved by an stunnel redirect from the secure incoming port to the ZS2 over the VPN that connects to the internal network.  The beauty of this is that stunnel can now use the existing web certificate for my internet host to afford protection from man in the middle attacks as well.

    Last thoughts about Security

    Obviously, the security above isn’t perfect.  Anyone sharing my external IP would be able to run a port scan and (if they’re clever) detect the https port the ZS2 is on.  However, it does require a lot of luck to do this and, obviously, even if they’re in the fortunate position of sharing an IP address, I’ve changed the default password, so the recent Mirai attack wouldn’t have been able to compromise the device.

    Do I think this is good enough security: absolutely.  In security, the bear principle applies: in that when escaping from a ravenous bear, you don’t have to be able to run faster than the bear itself, you merely need to be able to run faster than the slowest other potential food source …  In internet terms, this means that while there are so many completely insecure devices out there, no-one can be bothered to hack a moderately secure system like mine because the customisation makes it quite a bit harder.  It’s also instructive to think that the bear principle is why Linux has such a security reputation: it’s not that we have perfect security against virus and trojan systems, it’s just that Windows was always so much worse …

    Eventually, something like Mirai will look to attack the ZS2 web server itself (it is .net based, after all) rather than simply try a list of default passwords and then I’ll need to be a bit more clever, but while everyone else is so much more insecure, that day will be long delayed.

    October 23, 2016 07:20 PM

    Pete Zaitcev: FAA proposes to ban NavWorx

    Seen a curious piece of news today. As a short preamble, an aircraft in the U.S. may receive useful information from a ground station (TIS-B and FIS-B), but it has to transmit a certain ADS-B packet for that to happen. And all ADS-B packets include a field that specifies the system's claim that it operates according to a certain level of precision and integrity. The idea is, roughly, if you detect that e.g. one of your redundant GPS receivers is off-line, you should broadcast that you're downgraded. The protocol field is called SIL. The maximum level you can claim is determined by how crazily redundant and paranoid your design is. We are talking something in the order of $20,000 worth of cost, most of which is amortization of FAA paperwork certifying and you are entitled to claim SIL of 2. I lied about this explanation being short, BTW.

    So, apparently, NavWorks shipped cheap ADS-B boxes, which were made with a Raspberry Pie and a cellphone GPS chip (or such). They honestly transmitted a SIL of 0. Who cares, right? Well, FAA decided that TIS should stop to reply to airplanes flying around with a SIL Zero ADS-B boxes, because fuck the citizens, they should pay their $20k. Pilots called the NavWorks and complained that their iPads hooked to ADS600 do not display the weather reliably anymore. NavWorks issued a software update that programmed their boxes to transmit SIL of 2. No other change: the actual transmitted positions remained exactly as before, only the claimed reliability was faked. When FAA got the wind of this happening, they went nuclear on NavWorks users' asses. The proposed emergency directive orders owners to remove the offending equipment from their aircraft. They are grounded until the compliance.

    Now the good thing is, the ADS-B mandate comes in 2020. They still have 3 years to find a more compliant (and expensive) supplier, before they are prohibited from a vicinity of a major city. So it's only money.

    I don't have a dog in this fight, personally, so I can sympathize with both the bureaucrats who saw cheaters and threw a book at them, and the company that employed a workaround against a meaningless and capricious rule. However, here's a couple of observations.

    First, note how FAA maintains a database of individual (not aggregate) protocol compliance for each ADS-B ID. They will even helpfully send you a report about what they know about you (it's intended so you can test the performance your ADS-B equipment). Imagine if the government saved every query that your browser made, and could tell if your Chrome were not compliant with a certain RFC. This detailed tracking of everything is actually very necessary because the protocol has no encryption whatsoever and is trivially spoofed. Nothing stops a bad actor to use your ID in ADS-B. The only recourse is for the government to investigate reported issues and find the culprit. And they need the absolute tracking for it.

    Second, about the 2020 mandate. The airspace prohibition amounts to not letting someone into a city if the battery is flat in their EZ-pass transponder. Only in this case, the government sent you a letter saying that your transponder is banned, and you must buy a new one before you can get to work. In theory, your freedom of travel is not limited - you can take a bus. In practice though, not everyone has $20k, and the waiting list for the installer is 6 months.

    UPDATE 2016/12/19: NavWorx posted the following explanation on their website (no permalink, idiots):

    Our version 4.0.6 made our 12/13 products transmit SIL 3, which the FAA ground stations would recognize as sufficient to resume sending TIS-B traffic to our customers.

    Fortunately from product inception our internal GPS met SIL 3 performance. The FAA approved our internal GPS as SIL 3. During the TSO certification process, the FAA accepted our “compliance matrix” – which is the FAA’s primary means of compliance - showing our internal GPS integrity was 1x10-7, which translates to SIL of 3. However, FAA policy at that time was that ADS-B GPS must have its own separate TSO – our internal GPS was certified under TSO-C154c, the same as the UAT OUT/IN transceiver. It’s important to note that the FAA authorized us to certify our internal GPS in this manner, and that they know that our internal GPS is safe – applicants for TSO certification must present a project plan and the FAA reviews and approves this project plan before the FAA ever allows an applicant to proceed with TSO certification of any product. Although they approved our internal GPS to be SIL of 3 (integrity of 1x10-7), based on FAA policy at the time they made us transmit SIL 0, with the explanation that “uncertified GPS must transmit SIL 0”. This really is a misnomer, as our GPS is “certified” (under TSO-C154c), but the FAA refers to it as “uncertified”. The FAA AD states that “uncertified” GPS must transmit SIL of 0.

    So, basically, they never bothered to certify their GPS properly and used a fig leaf of TSO-C154c.

    The letter then goes on how unfair it is that all the shitty experimentals are allowed to signal SIL 3 if only they use a proper GPS.

    UPDATE 2016/12/20: AOPA weighs in a comment on NPRM:

    Specifically, AOPA recommends the FAA address the confusion over whether the internal position source meets the applicable performance requirements, the existence of an unsafe condition, and why the proposed AD applies to NavWorx’s experimental UAT model.

    The FAA requires a position source to meet the performance requirements in appendix B to AC 20-165B for the position source to be included in the ADS-B Out system and for an aircraft to meet the § 91.227(c) performance requirements (e.g., SIL = 3). The FAA does not require the position source be compliant with a specific TSO. Any person may demonstrate to the FAA that its new (uncertified) position source meets the requirements of appendix B to AC 20-165B, thereby qualifying that position source to be used in an ADS-B Out system. However, integrating a TSO-certified position source into a UAT means that a person will have fewer requirements to satisfy in AC 20-165B appendix B during the STC process for the ADS-B Out system.

    Around May 2014, the FAA issued NavWorx an STC for its ADS600-B UAT with part numbers 200-0012 and 200-0013 (Certified UATs). The STC allowed for the installation of those UATs into any type-certificated aircraft identified in the approved model list. The Certified UATs were compliant with TSO-C154c, but had internal, non-compliant GPS receivers. (ADS600-B Installation Manual 240-0008-00-36 (IM -36), at 17, 21, 28.) Specifically, section 2.3 of NavWorx’s March 2015 installation manual states:

    “For ADS600-B part numbers 200-0012 and 200-0013, the internal GPS WAAS receiver does not meet 14 CFR 91 FAA-2007-29305 for GPS position source. If the ADS600-B is configured to use the internal GPS as the position source the ADS-B messages transmitted by the unit reports: A Source Integrity Limit (SIL) of 0 indicating that the GPS position source does not meet the 14 CFR 91 FAA-2007-29305 rule.” (IM -36, at 19.)

    Hoo, boy. Per the above quote by AOPA, NavWorks previously admitted in writing that their internal GPS is not good enough, but they are trying to walk that back with the talk about "GPS integrity 1x10-7".

    In the same comment letter later, Justin T. Barkowski recommends to minimize the economic impact in the rulemaking and not force owners to pull NavWorx boxes out of the aircraft immediately.

    October 23, 2016 04:27 AM

    October 22, 2016

    Matthew Garrett: Microsoft aren't forcing Lenovo to block free operating systems

    Update: Patches to fix this have been posted

    There's a story going round that Lenovo have signed an agreement with Microsoft that prevents installing free operating systems. This is sensationalist, untrue and distracts from a genuine problem.

    The background is straightforward. Intel platforms allow the storage to be configured in two different ways - "standard" (normal AHCI on SATA systems, normal NVMe on NVMe systems) or "RAID". "RAID" mode is typically just changing the PCI IDs so that the normal drivers won't bind, ensuring that drivers that support the software RAID mode are used. Intel have not submitted any patches to Linux to support the "RAID" mode.

    In this specific case, Lenovo's firmware defaults to "RAID" mode and doesn't allow you to change that. Since Linux has no support for the hardware when configured this way, you can't install Linux (distribution installers will boot, but won't find any storage device to install the OS to).

    Why would Lenovo do this? I don't know for sure, but it's potentially related to something I've written about before - recent Intel hardware needs special setup for good power management. The storage driver that Microsoft ship doesn't do that setup. The Intel-provided driver does. "RAID" mode prevents the Microsoft driver from binding and forces the user to use the Intel driver, which means they get the correct power management configuration, battery life is better and the machine doesn't melt.

    (Why not offer the option to disable it? A user who does would end up with a machine that doesn't boot, and if they managed to figure that out they'd have worse power management. That increases support costs. For a consumer device, why would you want to? The number of people buying these laptops to run anything other than Windows is miniscule)

    Things are somewhat obfuscated due to a statement from a Lenovo rep:This system has a Signature Edition of Windows 10 Home installed. It is locked per our agreement with Microsoft. It's unclear what this is meant to mean. Microsoft could be insisting that Signature Edition systems ship in "RAID" mode in order to ensure that users get a good power management experience. Or it could be a misunderstanding regarding UEFI Secure Boot - Microsoft do require that Secure Boot be enabled on all Windows 10 systems, but (a) the user must be able to manage the key database and (b) there are several free operating systems that support UEFI Secure Boot and have appropriate signatures. Neither interpretation indicates that there's a deliberate attempt to prevent users from installing their choice of operating system.

    The real problem here is that Intel do very little to ensure that free operating systems work well on their consumer hardware - we still have no information from Intel on how to configure systems to ensure good power management, we have no support for storage devices in "RAID" mode and we have no indication that this is going to get better in future. If Intel had provided that support, this issue would never have occurred. Rather than be angry at Lenovo, let's put pressure on Intel to provide support for their hardware.

    comment count unavailable comments

    October 22, 2016 05:51 AM