Kernel Planet

August 05, 2020

Linux Plumbers Conference: Power Management and Thermal Control Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Power Management and Thermal Control Microconference has been accepted into the 2020 Linux Plumbers Conference!

Power management and thermal control is an important area in the Linux ecosystem to help with the global environment. Optimizing the amount of work that is achieved while having long battery life and keeping the box from overheating is critical in today’s world. This meeting will focus on continuing to have Linux be an efficient operating system while still lowering the cost of running a data center.

Last year’s meetup at Linux Plumbers resulted in the introduction of thermal pressure support into the CPU scheduler as well as several improvements to the thermal framework, such as a netlink implementation of thermal notification and improvements to CPU cooling. Discussions from last year also helped to improve systems-wide suspend testing tools.

This year’s topics to be discussed include:

Come and join us in the discussion about extending the battery life of your laptop and keeping it cool.

We hope to see you there!

August 05, 2020 03:15 AM

August 02, 2020

Linux Plumbers Conference: VFIO/IOMMU/PCI Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the VFIO/IOMMU/PCI Microconference has been accepted into the 2020 Linux Plumbers Conference!

The PCI interconnect specification, the devices implementing it, and the system IOMMUs providing memory/access control to them are incorporating more and more features aimed at high performance systems (eg PCI ATS (Address Translation Service)/PRI(Page Request Interface), enabling Shared Virtual Addressing (SVA) between devices and CPUs), that require the kernel to coordinate the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces that have to be designed in-sync for all three subsystems.

The kernel code that enables these new system features requires coordination between VFIO/IOMMU/PCI subsystems, so that kernel interfaces and userspace APIs can be designed in a clean way.

The following was a result of last years successful Linux Plumbers microconference:

Last year’s Plumbers resulted in a write-up justifying the dual-stageSMMUv3 integration but more work is needed to persuade the relevant maintainers.

Topics for this year include (but not limited to):

Come and join us in the discussion in helping Linux keep up with the new features being added to the PCI interconnect specification.

We hope to see you there!

August 02, 2020 02:04 PM

August 01, 2020

Linux Plumbers Conference: RISC-V Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the RISC-V Microconference has been accepted into the 2020 Linux Plumbers Conference!

The RISC-V ecosystem is gaining momentum at such an astounding speed that it wouldn’t be unfair to compare it to the early days of the Linux ecosystem’s growth. There are a plethora of Linux kernel features that have been added to RISC-V and many more are waiting to be reviewed in the mailing list. Some of them resulted from direct discussions during last year’s RISC-V microconference. For example, RISC-V has a standard boot process along with a well-defined supervisor binary specification (SBI) and cpu hotplug feature. KVM support is very close to being merged and just waiting for official ratification of the H extension. NoMMU support for Linux kernel has already been merged.

Here are a few of the expected topics and current problems in RISC-V Linux land that we would like to cover.

Come join us and participate in the discussion on how we can improve the support for RISC-V in the Linux kernel.

We hope to see you there!

August 01, 2020 04:48 PM

Linux Plumbers Conference: You, Me, and IoT Two Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the You, Me, and IoT Microconference has been accepted into the 2020 Linux Plumbers Conference!

As everyday devices start to become more connected to the internet, the infrastructure around it constantly needs to be developed. The Internet of Things (IoT) in the Linux ecosystem is looking brighter every day. The
development rate of the Zephyr RTOS in particular is accelerating dramatically and we are now up to 2 commits per hour[1]! LoRa WAN made it into Zephyr release 2.2 as well.

The principles for IoT are still the same: data-driven controls for remote endpoints such as

A large focus of industry heavyweights continues to be interoperability; we are seeing a growing trend in moving toward IP-centric network communications. Using IP natively ensures that it is extremely easy for end-nodes and edge devices to communicate to The Cloud but it also means that IoT device security is more important than ever.

Last year’s successful microconference has brought about several changes in the IoT space. The Linux + Zephyr + Greybus solution now works over nearly all physical layers (#exactsteps for IEEE 802.15.4 and BLE). is also now preparing a next-gen hardware revision of the BeagleConnect to provide both a hobbyist and professional-friendly IoT platform. BlueZ has begun making quarterly releases, much to the delight of last year’s attendees, and members of the linux-wpan / netdev community have implemented RPL, an IPv6 routing protocol for lossy networks.

This year’s topics to be discussed include:

Come and join us in some heated but productive discussions in making your everyday devices communicate with the world around them.

[1]For reference, Linux receives approximately 9 commits per hour

We hope to see you there!


August 01, 2020 01:18 AM

July 30, 2020

Linux Plumbers Conference: LLVM Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the LLVM Microconference has been accepted into the 2020 Linux Plumbers Conference!

The LLVM toolchain has made significant progress over the years and many kernel developers are now using it to build their kernels. It is still the one toolchain that can natively compile C into BPF byte code. Clang (the C frontend to LLVM) is used to build Android and ChromeOS kernels and others are in the process of testing to use Clang to build their kernels.

Many topics still need to be resolved, and are planned to be discussed here.
These include (but not limited to):

Come and join us in the discussion of improving this new toolchain to make it the most useable
for everyone!

We hope to see you there!

July 30, 2020 07:49 PM

Paul E. Mc Kenney: Stupid RCU Tricks: Failure Probability and CPU Count

So rcutorture found a bug, whether in RCU or elsewhere, and it is now time to reproduce that bug, whether to make good use of git bisect or to verify an alleged fix. One problem is that, rcutorture being what it is, that bug is likely a race condition and it likely takes longer than you would like to reproduce. Assuming that it reproduces at all.

How to make it reproduce faster? Or at all, as the case may be?

One approach is to tweak the Kconfig options and maybe even the code to make the failure more probable. Another is to find a “near miss” that is related to and more probable than the actual failure.

But given that we are trying to make a race condition happen more frequently, it is only natural to try tweaking the number of CPUs. After all, one would hope that increasing the number of CPUs would increase the probability of hitting the race condition. So the straightforward answer is to use all available CPUs.

But how to use them? Run a single rcutorture scenario covering all the CPUs, give or take the limitations imposed by qemu and KVM? Or run many instances of that same scenario, with each instance using a small fraction of the available CPUs?

As is so often the case, the answer is: “It depends!”

If the race condition happens randomly between any pair of CPUs, then bigger is better. To see this, consider the following old-school ASCII-art comparison:

|        N * M        |
| N | N | N | ... | N |

If there are n CPUs that can participate in the race condition, then at any given time there are n(n-1)/2 possible races. The upper row has N*M CPUs, and thus N*M*(N*M-1)/2 possible races. The lower row has M sets of N CPUs, and thus M*N*(N-1)/2, which is almost a factor of M smaller. For this type of race condition, you should therefore run a small number of scenarios with each using as many CPUs as possible, and preferably only one scenario that uses all of the CPUs. For example, to make the TREE03 scenario run on 64 CPUs, edit the tools/testing/selftests/rcutorture/configs/rcu/TREE03 file so as to set CONFIG_NR_CPUS=64.

But there is no guarantee that the race condition will be such that all CPUs participate with equal probability. For example, suppose that the bug was due to a race between RCU's grace-period kthread (named either rcu_preempt or rcu_sched, depending on your Kconfig options) and its expedited grace period, which at any given time will be running on at most one workqueue kthread.

In this case, no matter how many CPUs were available to a given rcutorture scenario, at most two of them could be participating in this race. In this case, it is instead best to run as many two-CPU rcutorture scenarios as possible, give or take the memory footprint of that many guest OSes (one per rcutorture scenario). For example, to make 32 TREE03 scenarios run on 64 CPUs, edit the tools/testing/selftests/rcutorture/configs/rcu/TREE03 file so as to set CONFIG_NR_CPUS=2 and remember to pass either the --allcpus or the --cpus 64 argument to

What happens in real life?

For a race condition that rcutorture uncovered during the v5.8 merge window, running one large rcutorture instance instead of 14 smaller ones (very) roughly doubled the probability of locating the race condition.

In other words, real life is completely capable of lying somewhere between the two theoretical extremes outlined above.

July 30, 2020 12:30 AM

July 27, 2020

Matthew Garrett: Filesystem deduplication is a sidechannel

First off - nothing I'm going to talk about in this post is novel or overly surprising, I just haven't found a clear writeup of it before. I'm not criticising any design decisions or claiming this is an important issue, just raising something that people might otherwise be unaware of.

With that out of the way: Automatic deduplication of data is a feature of modern filesystems like zfs and btrfs. It takes two forms - inline, where the filesystem detects that data being written to disk is identical to data that already exists on disk and simply references the existing copy rather than, and offline, where tooling retroactively identifies duplicated data and removes the duplicate copies (zfs supports inline deduplication, btrfs only currently supports offline). In a world where disks end up with multiple copies of cloud or container images, deduplication can free up significant amounts of disk space.

What's the security implication? The problem is that deduplication doesn't recognise ownership - if two users have copies of the same file, only one copy of the file will be stored[1]. So, if user a stores a file, the amount of free space will decrease. If user b stores another copy of the same file, the amount of free space will remain the same. If user b is able to check how much free space is available, user b can determine whether the file already exists.

This doesn't seem like a huge deal in most cases, but it is a violation of expected behaviour (if user b doesn't have permission to read user a's files, user b shouldn't be able to determine whether user a has a specific file). But we can come up with some convoluted cases where it becomes more relevant, such as law enforcement gaining unprivileged access to a system and then being able to demonstrate that a specific file already exists on that system. Perhaps more interestingly, it's been demonstrated that free space isn't the only sidechannel exposed by deduplication - deduplication has an impact on access timing, and can be used to infer the existence of data across virtual machine boundaries.

As I said, this is almost certainly not something that matters in most real world scenarios. But with so much discussion of CPU sidechannels over the past couple of years, it's interesting to think about what other features also end up leaking information in ways that may not be obvious.

(Edit to add: deduplication isn't enabled on zfs by default and is explicitly triggered on btrfs, so unless it's something you've enabled then this isn't something that affects you)

[1] Deduplication is usually done at the block level rather than the file level, but given zfs's support for variable sized blocks, identical files should be deduplicated even if they're smaller than the maximum record size

comment count unavailable comments

July 27, 2020 10:22 PM

July 17, 2020

Linux Plumbers Conference: Open Printing Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Open Printing Microconference has been accepted into the 2020 Linux Plumbers Conference!

Building on the work already done in driverless printing since last year’s microconference session; driverless scanning has emerged as an active new topic since last year’s Plumbers. We’re seeing many new printer application projects emerge that will benefit 3D printing as well. With Driverless scanning and printing making good progress and improvements, now is the time to talk about driverless/IPP fax as well.

Topics to discuss include

Come join us and participate in the discussion to bring Linux printing,
scanning and fax a better experience.

If you already want to start the discussion right now or tell us
something before the conference starts, do it in the comments sections
of the linked pages.

We hope to see you there!

July 17, 2020 09:04 PM

July 15, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: So rcutorture is Not Aggressive Enough For You?

So you read the previous post, but simply running rcutorture did not completely vent your frustration. What can you do?

One thing you can do is to tweak a number of rcutorture settings to adjust the manner and type of torture that your testing inflicts.

RCU CPU Stall Warnings

If you are not averse to a quick act of vandalism, then you might wish to induce an RCU CPU stall warning. The --bootargs argument can be used for this, for example as follows:

tools/testing/selftests/rcutorture/bin/ --allcpus --duration 3 --trust-make \
    --bootargs "rcutorture.stall_cpu=22 rcutorture.fwd_progress=0"

The rcutorture.stall_cpu=22 says to stall a CPU for 22 seconds, that is, one second longer than the default RCU CPU stall timeout in mainline. If you are instead using a distribution kernel, you might need to specify 61 seconds (as in “rcutorture.stall_cpu=61”) in order to allow for the typical 60-second RCU CPU stall timeout. The rcutorture.fwd_progress=0 has no effect except to suppress a warning message (with stack trace included free of charge) that questions the wisdom of running both RCU-callback forward-progress tests and RCU CPU stall tests at the same time. In fact, the code not only emits the warning message, it also automatically suppresses the forward-progress tests. If you prefer living dangerously and don't mind the occasional out-of-memory (OOM) lockup accompanying your RCU CPU stall warnings, feel free to edit kernel/rcu/rcutorture.c to remove this automatic suppression.

If you are running on a large system that takes more than ten seconds to boot, you might need to increase the RCU CPU stall holdoff interval. For example, adding rcutorture.stall_cpu_holdoff=120 to the --bootargs list would wait for two minutes before stalling a CPU instead of the default holdoff of 10 seconds. If simply spinning a CPU with preemption disabled does not fully vent your ire, you could undertake a more profound act of vandalism by adding rcutorture.stall_cpu_irqsoff=1 so as to cause interrupts to be disabled on the spinning CPU.

Some flavors of RCU such as SRCU permit general blocking within their read-side critical sections, and you can exercise this capability by adding rcutorture.stall_cpu_block=1 to the --bootargs list. Better yet, you can use this kernel-boot parameter to torture flavors of RCU that forbid blocking within read-side critical sections, which allows you to see they complain about such mistreatment.

The vanilla flavor of RCU has a grace-period kthread, and stalling this kthread is another good way to torture RCU. Simply add rcutorture.stall_gp_kthread=22 to the --bootargs list, which delays the grace-period kthread for 22 seconds. Doing this will normally elicit strident protests from mainline kernels.

Finally, you could starve rcutorture of CPU time by running a large number of them concurrently (each in its own Linux-kernel source tree), thereby overcommitting the CPUs.

But maybe you would prefer to deprive RCU of memory. If so, read on!

Running rcutorture Out of Memory

By default, each rcutorture guest OS is allotted 512MB of memory. But perhaps you would like to have it make do with only 128MB:

tools/testing/selftests/rcutorture/bin/ --allcpus --trust-make --memory 128M

You could go further by making the RCU need-resched testing more aggressive,T for example, by increasing the duration of this testing from the default three-quarters of the RCU CPU stall timeout to (say) seven eighths:

tools/testing/selftests/rcutorture/bin/ --allcpus --trust-make --memory 128M \
    --bootargs "rcutorture.fwd_progress_div=8"

More to the point, you might make the RCU callback-flooding tests more aggressive, for example by adjusting the values of the MAX_FWD_CB_JIFFIES, MIN_FWD_CB_LAUNDERS, or MIN_FWD_CBS_LAUNDERED macros and rebuilding the kernel. Alternatively, you could use kill -STOP on one of the vCPUs in the middle of an rcutorture run. Either way, if you break it, you buy it!

Or perhaps you would rather attempt to drown rcutorture in memory, perhaps forcing a full 16GB onto each guest OS:

tools/testing/selftests/rcutorture/bin/ --allcpus --trust-make --memory 16G

Another productive torture method involves unusual combinations of Kconfig options, a topic take up by the next section.

Confused Kconfig Options

The Kconfig options for a given rcutorture scenario are specified by the corresponding file in the tools/testing/selftests/rcutorture/configs/rcu directory. For example, the Kconfig options for the infamous TREE03 scenario may be found in tools/testing/selftests/rcutorture/configs/rcu/TREE03.

But why not just use the --kconfig argument and be happy, as described previously?

One reason is that there are a few Kconfig options that the rcutorture scripting refers to early in the process, before the --kconfig parameter's additions have been processed, for example, changing CONFIG_NR_CPUS should be done in the file rather than via the --kconfig parameter. Another reason is to not need to keep supplying a --kconfig argument for each of many repeated rcutorture runs. But perhaps most important, if you want some scenarios to be built with one Kconfig option and others built with some other Kconfig option, modifying each scenario's file avoids the need for multiple rcutorture runs.

For example, you could edit the tools/testing/selftests/rcutorture/configs/rcu/TREE03 file to change the CONFIG_NR_CPUS=16 to instead read CONFIG_NR_CPUS=4, and then run the following on a 12-CPU system:

tools/testing/selftests/rcutorture/bin/ --allcpus --trust-make --configs "3*TREE03"

This would run three concurrent copies of TREE03, but with each guest OS restricted to only 4 CPUs.

Finally, if a given Kconfig option applies to all rcutorture runs and you are tired of repeatedly entering --kconfig arguments, you can instead add that option to the tools/testing/selftests/rcutorture/configs/rcu/CFcommon file.

But sometimes Kconfig options just aren't enough. And that is why we have kernel boot parameters, the subject of the next section.

Boisterous Boot Parameters

We have supplied kernel boot parameters using the --bootargs parameter, but sometimes ordering considerations or sheer laziness motivate greater permanent. Either way, the scenario's .boot file may be brought to bear, for example, the TREE03 scenario's file is located here: tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot.

As of the v5.7 Linux kernel, this file contains the following:

rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30

For example, the probability of RCU's grace period processing overlapping with CPU-hotplug operations may be adjusted by decreasing the value of the rcutorture.onoff_interval from its default of 200 milliseconds or by adjusting the various grace-period delays specified by the rcutree.gp_preinit_delay, rcutree.gp_init_delay, and rcutree.gp_cleanup_delay parameters. In fact, chasing bugs involving races between RCU grace periods and CPU-hotplug operations often involves tuning these four parameters to maximize race probability, thus decreasing the required rcutorture run durations.

The possibilities for the .boot file contents are limited only by the extent of the Documentation/admin-guide/kernel-parameters.txt. And actually not even by that, given the all-to-real possibility of undocumented kernel boot parameters.

You can also create your own rcutorture scenarios by creating a new set of files in the tools/testing/selftests/rcutorture/configs/rcu directory. You can make them run by default (or in response to the CFLIST string to the --configs parameter) by adding its name to the tools/testing/selftests/rcutorture/configs/rcu/CFLIST file. For example, you could create a MYSCENARIO file containing Kconfig options and (optionally) a MYSCENARIO.boot file containing kernel boot parameters in the tools/testing/selftests/rcutorture/configs/rcu directory, and make them run by default by adding a line reading MYSCENARIO to the tools/testing/selftests/rcutorture/configs/rcu/CFLIST file.


This post discussed enhancing rcutorture through use of stall warnings, memory limitations, Kconfig options, and kernel boot parameters. The special case of adjusting CONFIG_NR_CPUS deserves more attention, and that is the topic of the next post.

July 15, 2020 09:13 PM

Pete Zaitcev: Cries of the vanquished

The post at roguelazer's is so juicy from every side that I'd need to quote it whole to give it justice (h/t ~avg). But its ostensible meat is etcd.[1] In that, he's building a narative of the package being elegant at first, and bloating later.

This tool was originally written in 2013 for a ... project called CoreOS. ... etcd was greater than its original use-case. Etcd provided a convenient and simple set of primitives (set a key, get a key, set-only-if-unchanged, watch-for-changes) with a drop-dead simple HTTP API on top of them.

Kubernetes was quickly changed to use etcd as its state store. Thus began the rapid decline of etcd.

... a large number of Xooglers who decided to infect etcd with Google technologies .... Etcd's simple HTTP API was replaced by a "gRPC" version; the simple internal data model was replaced by a dense and non-orthogonal data model with different types for leases, locks, transactions, and plain-old-keys.

Completely omitted from this tale is that etcd was created as a clone of Google Chumby, which did not use HTTP. The HTTP interface was implemented in etcd for expediency. So, the nostalgic image of early etcd he's projecting is in fact a primitive early draft.

It's interesting that he only mentions leases and locks in passing, painting them as a late addition, whereas the concept of coarse locking was more important for Chumby than the registry.

[1] Other matters are taken upon in the footnotes, at length. You'd think that it would be a simple matter to create a seaprate post to decry the evils of HTTP/2, but not for this guy! I may write another entry on the evils of bloat and how sympathetic I am to his cause later.

July 15, 2020 05:41 PM

Brendan Gregg: Systems Performance: Enterprise and the Cloud, 2nd Edition

Eight years ago I wrote _Systems Performance: Enterprise and the Cloud_ (aka the "sysperf" book) on the performance of computing systems, and this year I'm excited to be releasing the second edition. The first edition was successful, selling over 10k copies and becoming required or recommended reading at many companies (and even mentioned in [job descriptions]). Thanks to everyone for their support. I've received feedback that it is useful, not just for learning performance, but also for showing how computers work internally: essential knowledge for all engineers. The second edition adds content on BPF, BCC, bpftrace, perf, and Ftrace, mostly removes Solaris, makes numerous updates to Linux and cloud computing, and includes general improvements and additions. It is written by a more experienced version of myself than I was for the first edition, including my six years of experience as a senior performance engineer at Netflix. This edition has also been improved by a new technical review team of over 30 engineers. How much has changed since first edition? It's hard to say, but easy to visualize. As an example, the following shows Chapter 6, CPUs, where black text is from the first edition and colored text are the updates (this is a color scheme I use to show reviewers when text was changed; from oldest changes to newest: yellow, green, aqua, blue, purple, red):

Chapter 6, CPUs, changes colored
Here is the entire book as a 3.1 Mbyte jpg. (Note that these visualizations are not final as I'm still making updates. And this doesn't highlight figure and copy-edit changes.) The book will be released in November 2020 by Addison Wesley, and will be around 800 pages. It's already listed on A year ago I announced [BPF Performance Tools: Linux System and Application Observability]. In a way, Systems Performance is volume 1 and BPF Performance Tools is volume 2. Sysperf provides balanced coverage of models, theory, architecture, observability tools (traditional and tracing), experimental tools, and tuning. The BPF tools book focuses on BPF tracing tools only, with brief summaries of architecture and traditional tools. Which book should you buy? Both, of course. :-) Since they are both performance books there is a little overlap between them, but not much. I think sysperf has a wider audience: it is a handbook for anyone to learn performance and computer internals. The BPF tools book will satisfy those wishing to jump ahead and run advanced tools for some quick wins. For more information, including links showing where to buy the book, please see its website: [Systems Performance: Enterprise and the Cloud, 2nd Edition]. [job descriptions]: [Systems Performance: Enterprise and the Cloud, 2nd Edition]: /systems-performance-2nd-edition-book.html [BPF Performance Tools: Linux System and Application Observability]: /blog/2019-07-15/bpf-performance-tools-book.html

July 15, 2020 07:00 AM

July 14, 2020

Linux Plumbers Conference: Reminder for LPC 2020 Town Hall: The Kernel Report

Thursday is approaching!

On July 16th at 8am PST / 11am EST / 3pm GMT the Kernel Report talk by Jon Corbet of LWN will take place on the LPC Big Blue Button platform! It will also be available on a YouTube Live stream.

Please join us at this URL:

The Linux kernel is at the core of any Linux system; the performance and capabilities of the kernel will, in the end, place an upper bound on what the system as a whole can do. This talk will review recent events in the kernel development community, discuss the current state of the kernel and the challenges it faces, and look forward to how the kernel may address those challenges. Attendees of any technical ability should gain a better understanding of how the kernel got to its current state and what can be expected in the near future.

The Plumbers Code of Conduct will be in effect for this event. The event will be recorded.

July 14, 2020 10:59 PM

Linux Plumbers Conference: linux/arch/* Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the linux/arch/* Microconference has
been accepted into the 2020 Linux Plumbers Conference!

Linux supports over twenty architectures.

Each architecture has its own sub-directory within the Linux-kernel arch/ directory containing code specific for that architecture. But that code is not always unique to the architecture.

In many cases, code in one architecture was copy-pasted from another, leaving for a lot of unnecessary code duplication. This makes it harder to fix, update and maintain functionality relying on the architecture specific code.

There’s room to improve, consolidate and generalize the code in these
directories, and that is the goal of this microconference.

Topics to discuss include:

Come join us and participate in the discussion to bring Linux architectures closer together.

We hope to see you there!

July 14, 2020 03:10 PM

July 13, 2020

Linux Plumbers Conference: Android Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Android Microconference has been accepted into the 2020 Linux Plumbers Conference!

A few years ago the Android team announced their desire to try to set a path for creating a Generic Kernel Image (GKI) which would enable the decoupling of Android kernel releases from hardware enablement. Since then, much work has been done by many parties to make this vision a reality. Last year’s Linux Plumber’s Android microconference brought about work on monitoring and stabilizing the Android in-kernel ABI, solutions to issues associated with modules and supplier-consumer dependencies have landed in the upstream Linux kernel, and vendors have started migrating from using the ION driver to the DMA-BUF heaps that are now supported in upstream Linux. For a report on progress made since last year MC see here.

This year several devices now work with GKI making their kernel upgradable without requiring porting efforts, but this work exposed several additional issues. Thus the topics for this year’s Android microconference include:

Come and join us in help making the upstream Linux kernel work out of the box on your Android device!

We hope to see you there!

July 13, 2020 02:42 AM

July 11, 2020

Linux Plumbers Conference: GNU Tools Track Added to Linux Plumbers Conference 2020

We are pleased to announce that we have added an additional track to LPC 2020: the GNU Tools track. The track will run for the 5 days of the conference.
For more information please see the track wiki page.
The call for papers is now open and will close on July 31 2020. To submit a proposal please refer to the wiki page above.

July 11, 2020 03:11 PM

Linux Plumbers Conference: Systems Boot and Security Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Systems Boot and Security Microconference has been accepted into the 2020 Linux Plumbers Conference!

Computer-system security is an important topic to many. Maintaining data security and system integrity is crucial for businesses and individuals. Computer security is paramount even at system boot up, as firmware attacks can compromise the system before the operating system starts. In order to keep the integrity of the system intact, both the firmware as well as the rest of the system must be vigilant in monitoring and preventing malware intrusion.

As a result of last year’s microconference Oracle sent out patches to support Trenchboot in the Linux kernel and in GRUB2. An agreement was also reached on problems with TPM 2.0 Linux sysfs interface.

Over the past year, 3mdeb has been working on various open-source contributions to LandingZone and also GRUB2 and Linux kernel to improve TrenchBoot support.

This year’s topics to be discussed include:

Come and join us in the discussion about how to keep your system secure even at bootup. We hope to see you there!

July 11, 2020 12:16 AM

July 06, 2020

Linux Plumbers Conference: Testing and Fuzzing Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Testing and Fuzzing Microconference has been accepted into the 2020 Linux Plumbers Conference!

Testing and Fuzzing is crucial to the stability the Linux Kernel demands. Last year’s meetup helped make Kernel CI a Linux Foundation hosted project, collaboration between Red Hat CKI and KernelCI. On the more technical side, KUnit was merged upstream, and KernelCI integration is underway, syzcaller reproducers are being included in the Linux Test Project[5], and Clang is integrated in KernelCI.

This year’s topics to be discussed include:

Come and join us in the discussion of keeping Linux the fastest moving, reliable piece of software in the world!

We hope to see you there!

July 06, 2020 03:12 PM

July 03, 2020

Linux Plumbers Conference: Linux Plumbers Conference is Not Sold Out

We’re really sorry, but apparently the Cvent registration site we use has suffered a bug which is causing it to mark the conference as “Sold Out” and, unfortunately, since today is the beginning of the American Independence day weekend, we can’t get anyone to fix it until Monday. However, rest assured there are plenty of places still available, so if you can wait until Monday, you should be able to register for the conference as soon as the site is fixed.

Again, we’re really sorry for the problem and the fact that fixing it will take a further three days.

July 03, 2020 05:32 PM

July 01, 2020

Linux Plumbers Conference: Networking and BPF Summit CfP Now Open

We are pleased to announce that the Call for Proposals for the Networking and BPF Summit at Linux Plumbers Conference 2020 is now open.

Please submit your proposals here.

Looking forward to seeing your great contributions!

July 01, 2020 10:41 PM

Linux Plumbers Conference: Announcing Town Hall #2: The Kernel Weather Report

Thank you to everyone who attended the Linux Plumbers town hall on June 25th. It was successful thanks to your participation. We’re pleased to announce another town hall on July 16th at 8am PST / 11am EST / 3pm GMT. This town hall will feature Jon Corbet of LWN giving “The Kernel Weather Report”.

The Linux kernel is at the core of any Linux system; the performance and capabilities of the kernel will, in the end, place an upper bound on what the system as a whole can do. This talk will review recent events in the kernel development community, discuss the current state of the kernel and the challenges it faces, and look forward to how the kernel may address those challenges. Attendees of any technical ability should gain a better understanding of how the kernel got to its current state and what can be expected in the near future.

Please note that the Plumbers Code of Conduct will be in effect for this event. We also plan to record this event. We will post the URL for the town hall on the LPC blog prior to the event. We hope to see you there and help make Plumbers the best conference for everyone.

July 01, 2020 10:02 PM

June 25, 2020

Linux Plumbers Conference: How to Join the LPC Town Hall

Please use the following link on Thursday June 25 2020 at 8am PDT/ 11am EDT/ 3pm GMT to join the LPC Town Hall:
Note that no account is necessary!

Please refer to the previous post about the Town Hall to get more info.
See you there!

June 25, 2020 12:41 AM

June 24, 2020

Matthew Garrett: Making my doorbell work

I recently moved house, and the new building has a Doorbird to act as a doorbell and open the entrance gate for people. There's a documented local control API (no cloud dependency!) and a Home Assistant integration, so this seemed pretty straightforward.

Unfortunately not. The Doorbird is on separate network that's shared across the building, provided by Monkeybrains. We're also a Monkeybrains customer, so our network connection is plugged into the same router and antenna as the Doorbird one. And, as is common, there's port isolation between the networks in order to avoid leakage of information between customers. Rather perversely, we are the only people with an internet connection who are unable to ping my doorbell.

I spent most of the past few weeks digging myself out from under a pile of boxes, but we'd finally reached the point where spending some time figuring out a solution to this seemed reasonable. I spent a while playing with port forwarding, but that wasn't ideal - the only server I run is in the UK, and having packets round trip almost 11,000 miles so I could speak to something a few metres away seemed like a bad plan. Then I tried tethering an old Android device with a data-only SIM, which worked fine but only in one direction (I could see what the doorbell could see, but I couldn't get notifications that someone had pushed a button, which was kind of the point here).

So I went with the obvious solution - I added a wifi access point to the doorbell network, and my home automation machine now exists on two networks simultaneously (nmcli device modify wlan0 ipv4.never-default true is the magic for "ignore the gateway that the DHCP server gives you" if you want to avoid this), and I could now do link local service discovery to find the doorbell if it changed addresses after a power cut or anything. And then, like magic, everything worked - I got notifications from the doorbell when someone hit our button.

But knowing that an event occurred without actually doing something in response seems fairly unhelpful. I have a bunch of Chromecast targets around the house (a mixture of Google Home devices and Chromecast Audios), so just pushing a message to them seemed like the easiest approach. Home Assistant has a text to speech integration that can call out to various services to turn some text into a sample, and then push that to a media player on the local network. You can group multiple Chromecast audio sinks into a group that then presents as a separate device on the network, so I could then write an automation to push audio to the speaker group in response to the button being pressed.

That's nice, but it'd also be nice to do something in response. The Doorbird exposes API control of the gate latch, and Home Assistant exposes that as a switch. I'm using Home Assistant's Google Assistant integration to expose devices Home Assistant knows about to voice control. Which means when I get a house-wide notification that someone's at the door I can just ask Google to open the door for them.

So. Someone pushes the doorbell. That sends a signal to a machine that's bridged onto that network via an access point. That machine then sends a protobuf command to speakers on a separate network, asking them to stream a sample it's providing. Those speakers call back to that machine, grab the sample and play it. At this point, multiple speakers in the house say "Someone is at the door". I then say "Hey Google, activate the front gate" - the device I'm closest to picks this up and sends it to Google, where something turns my speech back into text. It then looks at my home structure data and realises that the "Front Gate" device is associated with my Home Assistant integration. It then calls out to the home automation machine that received the notification in the first place, asking it to trigger the front gate relay. That device calls out to the Doorbird and asks it to open the gate. And now I have functionality equivalent to a doorbell that completes a circuit and rings a bell inside my home, and a button inside my home that completes a circuit and opens the gate, except it involves two networks inside my building, callouts to the cloud, at least 7 devices inside my home that are running Linux and I really don't want to know how many computational cycles.

The future is wonderful.

(I work for Google. I do not work on any of the products described in this post. Please god do not ask me how to integrate your IoT into any of this)

comment count unavailable comments

June 24, 2020 08:25 AM

June 23, 2020

Linux Plumbers Conference: Registration for Linux Plumbers Conference 2020 is now open

Registration is now open for the 2020 edition of the Linux Plumbers Conference (LPC). It will be held August 24 – 28, virtually. Go to the attend page for more information.

Note that the CFPs for microconferences, refereed track talks, and BoFs are still open, please see this page for more information.

As always, please contact the organizing committee if you have questions.

June 23, 2020 09:30 PM

June 22, 2020

Linux Plumbers Conference: Kernel Dependability and Assurance Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Kernel Dependability & Assurance Microconference has been accepted into the 2020 Linux Plumbers Conference!

Linux is now being used in applications that are going to require a high degree of confidence that the kernel is going to behave as expected. Some of the key areas we’re seeing Linux now start to be used are in medical devices, civil infrastructure, caregiving robots, automotives, etc. This brings up a number of concerns that must be addressed. What sort of uptime can we count on? Should safety analysis be reevaluated after a bug fix has been made? Are all the system requirements being satisfied by Linux? What tooling is there to solve these questions?

This microconference is the place that the kernels community can come together and discuss these major issues. Topics to be discussed include:

Come and join us in making the most popular operating system the most dependable as well. We hope to see you there!

June 22, 2020 01:52 PM

June 19, 2020

Linux Plumbers Conference: Announcing a Linux Plumbers Virtual Town Hall

The Linux Plumbers Committee is pleased to announce a Town Hall meeting on June 25 at 8am PDT/ 11am EDT/ 3pm GMT. This meeting serves two purposes. The first purpose is to test our remote conference set up. This is the first time we are holding Linux Plumbers virtually and while we can run simulated tests, it’s much more effective to test our setup with actual participants with differing hardware set ups around the world. The second purpose is to present on our planning and give everyone a little bit of an idea of what to expect when we hold Plumbers at the end of August. We plan to have time for questions.

Given this is a test, the number of participants will be capped at 250 people. The purpose of this test is to examine the scale to which the infrastructure can handle the expected demand for a virtual Linux Plumbers Conference. If you can’t make this day or are blocked by the participation cap from joining, we expect to be running more tests in the days to come.

Please note that the Plumbers Code of Conduct will be in effect for this event. We also plan to record this event as we will be recording sessions at the actual conference. We will post the URL for the town hall on the LPC blog prior to the event. We hope to see you there and help make Plumbers the best conference for everyone.

June 19, 2020 04:03 PM

June 16, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: So you want to torture RCU?

Let's face it, using synchronization primitives such as RCU can be frustrating. And it is only natural to wish to get back, somehow, at the source of such frustration. In short, it is quite understandable to want to torture RCU. (And other synchronization primitives as well, but you have to start somewhere!) Another benefit of torturing RCU is that doing so sometimes uncovers bugs in other parts of the kernel. You see, RCU is not always willing to suffer alone.

One long-standing RCU-torture approach is to use modprobe and rmmod to install and remove the rcutorture module, as described in the torture-test documentation. However, this approach requires considerable manual work to check for errors.

On the other hand, this approach avoids any concern about the underlying architecture or virtualization technology. This means that use of modprobe and rmmod is the method of choice if you wish to torture RCU on (say) SPARC or when running on Hyper-V (this last according to people actually doing this). This method is also necessary when you want to torture RCU on a very specific kernel configuration or when you need to torture RCU on bare metal.

But for those of us running mainline kernels on x86 systems supporting virtualization, the approach described in the remainder of this document will usually be more convenient.

Running rcutorture in a Guest OS

If you have an x86 system (or, with luck, an ARMv8 or PowerPC system) set up to run qemu and KVM, you can instead use the rcutorture scripting, which automates running rcutorture over a full set of configurations, as well as automating analysis of the build products and console output. Running this can be as simple as:


As of v5.8-rc1, this will build and run each of nineteen combinations of Kconfig options, with each run taking 30 minutes for a total of 8.5 hours, not including the time required to build the kernel, boot the guest OS, and analyze the test results. Given that a number of the scenarios use only a single CPU, this approach can be quite wasteful, especially on the well-endowed systems of the year 2020.

This waste can be avoided by using the --cpus argument, for example, for the 12-hardware-thread laptop on which I am typing this, you could do the following:

tools/testing/selftests/rcutorture/bin/ --cpus 12

This command would run up to 12 CPUs worth of rcutorture scenarios concurrently, so that the nineteen combinations would be run in eight batches. Because TREE03 and TREE07 each want 16 CPUs, rcutorture will complain in its run summary as follows:

 --- Mon Jun 15 10:23:02 PDT 2020 Test summary:
Results directory: /home/git/linux/tools/testing/selftests/rcutorture/res/2020.06.15-10.23.02
tools/testing/selftests/rcutorture/bin/ --cpus 12 --duration 5 --trust-make
RUDE01 ------- 2102 GPs (7.00667/s) [tasks-rude: g0 f0x0 ]
SRCU-N ------- 42229 GPs (140.763/s) [srcu: g549860 f0x0 ]
SRCU-P ------- 11887 GPs (39.6233/s) [srcud: g110444 f0x0 ]
SRCU-t ------- 59641 GPs (198.803/s) [srcu: g1 f0x0 ]
SRCU-u ------- 59209 GPs (197.363/s) [srcud: g1 f0x0 ]
TASKS01 ------- 1029 GPs (3.43/s) [tasks: g0 f0x0 ]
TASKS02 ------- 1043 GPs (3.47667/s) [tasks: g0 f0x0 ]
TASKS03 ------- 1019 GPs (3.39667/s) [tasks: g0 f0x0 ]
TINY01 ------- 43373 GPs (144.577/s) [rcu: g0 f0x0 ] n_max_cbs: 34463
TINY02 ------- 46519 GPs (155.063/s) [rcu: g0 f0x0 ] n_max_cbs: 2197
TRACE01 ------- 756 GPs (2.52/s) [tasks-tracing: g0 f0x0 ]
TRACE02 ------- 559 GPs (1.86333/s) [tasks-tracing: g0 f0x0 ]
TREE01 ------- 8930 GPs (29.7667/s) [rcu: g64765 f0x0 ]
TREE02 ------- 17514 GPs (58.38/s) [rcu: g138645 f0x0 ] n_max_cbs: 18010
TREE03 ------- 15920 GPs (53.0667/s) [rcu: g159973 f0x0 ] n_max_cbs: 1025308
CPU count limited from 16 to 12
TREE04 ------- 10821 GPs (36.07/s) [rcu: g70293 f0x0 ] n_max_cbs: 81293
TREE05 ------- 16942 GPs (56.4733/s) [rcu: g123745 f0x0 ] n_max_cbs: 99796
TREE07 ------- 8248 GPs (27.4933/s) [rcu: g52933 f0x0 ] n_max_cbs: 183589
CPU count limited from 16 to 12
TREE09 ------- 39903 GPs (133.01/s) [rcu: g717745 f0x0 ] n_max_cbs: 83002

However, other than these two complaints, this is what the summary of an uneventful rcutorture run looks like.

Whatever is the meaning of all those numbers in the summary???

The console output for each run and much else besides may be found in the /home/git/linux/tools/testing/selftests/rcutorture/res/2020.06.15-10.23.02 directory called out above.

The more CPUs you have, the fewer batches are required:


If you specify more CPUs than your system actually has, will ignore your fantasies in favor of your system's reality.

Specifying Specific Scenarios

Sometimes it is useful to take one's ire out on a specific type of RCU, for example, SRCU. You can use the --configs argument to select specific scenarios:

tools/testing/selftests/rcutorture/bin/ --cpus 12 \
    --configs "SRCU-N SRCU-P SRCU-t SRCU-u"

This runs in two batches, but the second batch uses only two CPUs, which is again wasteful. Given that SRCU-P requires eight CPUs, SRCU-N four CPUs, and SRCU-t and SRCU-u one each, it would cost nothing to run two instances of each of these scenarios other than SRCU-N as follows:

tools/testing/selftests/rcutorture/bin/ --cpus 12 \
    --configs "SRCU-N 2*SRCU-P 2*SRCU-t 2*SRCU-u"

This same notation can be used to run multiple copies of the entire list of scenarios. For example (again, in v5.7), a system with 384 CPUs can use --configs 4*CFLIST to run four copies of of the full set of scenarios as follows:

tools/testing/selftests/rcutorture/bin/ --cpus 384 --configs "4*CFLIST"

Mixing and matching is permissible, for example:

tools/testing/selftests/rcutorture/bin/ --cpus 384 --configs "3*CFLIST 12*TREE02"

A script that is to run on a wide variety of systems can benefit from --allcpus (expected to appear in v5.9), which acts like --cpus N, where N is the number of CPUs on the current system:

tools/testing/selftests/rcutorture/bin/ --allcpus --configs "3*CFLIST 12*TREE02"

Build time can dominate when running a large number of short-duration runs, for example, when chasing down a low-probability non-deterministic boot-time failure. Use of --trust-make can be very helpful in this case:

tools/testing/selftests/rcutorture/bin/ --cpus 384 --duration 2 \
    --configs "1000*TINY01" --trust-make

Without --trust-make, rcutorture will play it safe by forcing your source tree to a known state between each build. In addition to --trust-make, there are a number of tools such as ccache that can also greatly reduce build times.

Locating Test Failures

Although the ability to automatically run many tens of scenarios can be very convenient, it can also cause significant eyestrain staring through a long “summary” checking for test failures. Therefore, if there are failures, this is noted at the end of the summary, for example, as shown in the following abbreviated output from a --configs "28*TREE03" run:

TREE03.8 ------- 1195094 GPs (55.3284/s) [rcu: g11475633 f0x0 ] n_max_cbs: 1449125
TREE03.9 ------- 1202936 GPs (55.6915/s) [rcu: g11572377 f0x0 ] n_max_cbs: 1514561
3 runs with runtime errors.

Of course, picking the three errors out of the 28 runs can also cause eyestrain, so there is yet another useful little script:

tools/testing/selftests/rcutorture/bin/ \

This will run your editor on the make output for each build error and on the console output for each runtime failure, greatly reducing eyestrain. Users of vi can also edit a summary of the runtime errors from each failing run as follows:

vi /home/git/linux/tools/testing/selftests/rcutorture/res/2020.06.15-10.23.02/*/console.log.diags

Enlisting Torture Assistance

If rcutorture produces a failure-free run, that is a failure on the part of rcutorture. After all, there are bugs in there somewhere, and rcutorture failed to find them!

One approach is to increase the duration, for example, to 12 hours (also known as 720 minutes):

tools/testing/selftests/rcutorture/bin/ --cpus 12 --duration 720

Another approach is to enlist the help of other in-kernel torture features, for example, lockdep. The --kconfig parameter to can be used to this end:

tools/testing/selftests/rcutorture/bin/ --cpus 12 --configs "TREE03" \

The aid of the kernel address sanitizer (KASAN) can be enlisted using the --kasan argument:

tools/testing/selftests/rcutorture/bin/ --cpus 12 --kasan

The kernel concurrency sanitizer (KCSAN) can also be brought to bear, but proper use of KCSAN requires some thought (see part 1 and part 2 of the LWN “Concurrency bugs should fear the big bad data-race detector” article) and also version 11 or later of Clang/LLVM (and a patch for GCC has been accepted). Once you have all of that in place, the --kcsan argument invokes KCSAN and also generates a summary as described in part 1 of the aforementioned LWN article. Note again that only very recent compiler versions (such as Clang-11) support KCSAN, so a --kmake "CC=clang-11" or similar argument might also be necessary.

Selective Torturing

Sometimes enlisting debugging aid is the best approach, but other times greater selectivity is the best way forward.

Sometimes simply building a kernel is torture enough, especially when building with unusual Kconfig options (see the discussion of --kconfig above). In this case, specifying the --buildonly argument will build the kernels, but refrain from running them. This approach can also be useful for running multiple copies of the resulting binaries on multiple systems: You can use the --buildonly to build the kernels and qemu-cmd scripts, and then run these files on the other systems, given suitable adjustments to the qemu-cmd scripts.

Other times it is useful to torture some specific portion of RCU. For example, one wishing to vent their ire solely on expedited grace periods could add --bootargs "rcutorture.gp_exp=1" to the command line. This argument causes rcutorture to run a stress test using only expedited RCU grace periods, which can be helpful when attempting to work out whether a too-short RCU grace period is due to a bug in the normal or the expedited grace-period code. Similarly, the callback-flooding aspects of rcutorture stress testing can be disabled using --bootargs "rcutorture.fwd_progress=0". It is possible to specify both in one run using --bootargs "rcutorture.gp_exp=1 rcutorture.fwd_progress=0".

Enlisting Debugging Assistance

Still other times, it is helpful to enable event tracing. For example, if the rcu_barrier() event traces are of interest, use --bootargs "trace_event=rcu:rcu_barrier". The trace buffer will be dumped automatically upon specific rcutorture failures. If the failure mode is instead a non-rcutorture-specific oops, use this: --bootargs "trace_event=rcu:rcu_barrier ftrace_dump_on_oops". If it is also necessary to dump the trace buffers on warnings, a (heavy handed) way to achieve this is to use --bootargs "trace_event=rcu:rcu_barrier ftrace_dump_on_oops panic_on_warn".

If you have many tens of rcutorture instances that all decide to flush their trace buffers at about the same time, the combined flushing operations can take considerable time, especially if the underlying system features rotating rust. If only the most recent activity is of interest, specifying a small trace buffer can help: --bootargs "trace_event=rcu:rcu_barrier ftrace_dump_on_oops panic_on_warn trace_buf_size=3k".

If only the oopsing/warning CPU's traces are relevant, the orig_cpu modifier can be helpful: --bootargs "trace_event=rcu:rcu_barrier ftrace_dump_on_oops=orig_cpu panic_on_warn trace_buf_size=3k".

More information on tracing can be found in Documentation/trace, and more on kernel boot parameters in general may be found in kernel-parameters.txt. Given the multi-thousand-line heft of this latter, there is clearly great scope for tweaking your torturing of RCU!

Why Stop at Torturing RCU?

After all, locking can sometimes be almost as annoying as RCU. And it is possible to torture locking:

tools/testing/selftests/rcutorture/bin/ --allcpus --torture lock

This locktorture stress test does not get as much love and attention as does rcutorture, but it is at least a start.

There are also a couple of RCU performance tests and an upcoming smp_call_function*() stress test that use this same torture-test infrastructure. Please note that the details of the summary output varies from test to test.

In short, you can do some serious torturing of RCU, and much else besides! So show them no mercy!!! :-)

June 16, 2020 08:42 PM

James Morris: Linux Security Summit North America 2020: Online Schedule

Just a quick update on the Linux Security Summit North America (LSS-NA) for 2020.

The event will take place over two days as an online event, due to COVID-19.  The dates are now July 1-2, and the full schedule details may be found here.

The main talks are:

There are also short (30 minute) topics:

This year we will also have a Q&A panel at the end of each day, moderated by Elena Reshetova. The panel speakers are:

LSS-NA this year is included with OSS+ELC registration, which is USD $50 all up.  Register here.

Hope to see you soon!

June 16, 2020 01:28 AM

June 15, 2020

Linux Plumbers Conference: Linux Plumbers Conference Registration Opening Postponed

The committee is relentlessly working on recreating online the Linux Plumbers Conference (LPC) experience that we have all come to appreciate, and take for granted, over the past few years.

We had initially planned to open registration on June 15th. While travel planning is not one, there are still very many aspects of the conference being worked on. We are now aiming to open registration for Linux Plumbers Conference (LPC) on June 23rd.

Right now we have shortlisted BigBlueButton as our online conferencing solution. One of our objectives is to run LPC 2020 online on a full open software stack.

We anticipate running our usual set of parallel tracks, including microconferences per day. With our globally distributed participants, identifying the timezone most convenient is still work in progress. There will be a timezone question on our registration form, please make sure to answer it.

To help us test part of the online platform, and offer transparency about where things stand with LPC 2020 preparation, the committee is currently planning the first ever “LPC Town Hall Meeting”. We hope to host it very soon. More information will be made available very soon.

As previously announced, we are reducing the conference registration fee to US$50. Registration availability has been an issue in past years. We have no way to anticipate what the uptake will be for LPC 2020 registration. The committee will try its best to meet registration demand. Also, several Call for Proposals are open and awaiting your contributions.

We will be sharing more information with everyone here soon. Looking forward to LPC 2020 together with you.

June 15, 2020 05:20 PM

Linux Plumbers Conference: Real-time Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Real-time Microconference has been accepted into the 2020 Linux Plumbers Conference!

After another successful Real-time microconference at LPC last year, there’s still more to work to be done. The PREEMPT_RT patch set (aka “The Real-Time Patch”) was created in 2004 in the effort to make Linux into a hard real-time designed operating system. Over the years much of the RT patch has made it into mainline Linux, which includes: mutexes, lockdep, high resolution timers, Ftrace, RCU_PREEMPT, priority inheritance, threaded interrupts and much more. There’s just a little left to get RT fully into mainline, and the light at the end of the tunnel is finally in view. It is expected that the RT patch will be in mainline within a year (and possibly before Plumbers begins!), which changes the topics of discussion. Once it is in Linus’s tree, a whole new set of issues must be handled.

The focus on this years Plumbers events will include:

June 15, 2020 02:19 PM

June 11, 2020

Linux Plumbers Conference: Scheduler Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Scheduler Microconference has been accepted into the 2020 Linux Plumbers Conference!

The scheduler is an important functionality of the Linux kernel as it decides what gets to run, when and for how long. With different topologies and workloads this is no easy task to give the user the best experience possible. During the Scheduler microconference at LPC last year, we started the work to make SCHED_DEADLINE safe for kthreads and improving load balancing. This year, we continue working on core scheduling, unifying the interface for TurboSched and task latency nice, and continue the discussion on proxy execution.

Topics to be discussed include:

Come and join us in the discussion of controlling what tasks get to run on your machine and when. We hope to see you there!

June 11, 2020 02:11 PM

June 09, 2020

Michael Kerrisk (manpages): man-pages-5.07 is released

I've released man-pages-5.07. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from more than 80 contributors. The release includes more than 380 commits that change more than 380 pages. One new page was added in this release, and one page was removed.

The most notable of the changes in man-pages-5.07 are the following:

June 09, 2020 05:09 PM

May 27, 2020

Kees Cook: security things in Linux v5.5

Previously: v5.4.

I got a bit behind on this blog post series! Let’s get caught up. Here are a bunch of security things I found interesting in the Linux kernel v5.5 release:

restrict perf_event_open() from LSM
Given the recurring flaws in the perf subsystem, there has been a strong desire to be able to entirely disable the interface. While the kernel.perf_event_paranoid sysctl knob has existed for a while, attempts to extend its control to “block all perf_event_open() calls” have failed in the past. Distribution kernels have carried the rejected sysctl patch for many years, but now Joel Fernandes has implemented a solution that was deemed acceptable: instead of extending the sysctl, add LSM hooks so that LSMs (e.g. SELinux, Apparmor, etc) can make these choices as part of their overall system policy.

generic fast full refcount_t
Will Deacon took the recent refcount_t hardening work for both x86 and arm64 and distilled the implementations into a single architecture-agnostic C version. The result was almost as fast as the x86 assembly version, but it covered more cases (e.g. increment-from-zero), and is now available by default for all architectures. (There is no longer any Kconfig associated with refcount_t; the use of the primitive provides full coverage.)

linker script cleanup for exception tables
When Rick Edgecombe presented his work on building Execute-Only memory under a hypervisor, he noted a region of memory that the kernel was attempting to read directly (instead of execute). He rearranged things for his x86-only patch series to work around the issue. Since I’d just been working in this area, I realized the root cause of this problem was the location of the exception table (which is strictly a lookup table and is never executed) and built a fix for the issue and applied it to all architectures, since it turns out the exception tables for almost all architectures are just a data table. Hopefully this will help clear the path for more Execute-Only memory work on all architectures. In the process of this, I also updated the section fill bytes on x86 to be a trap (0xCC, int3), instead of a NOP instruction so functions would need to be targeted more precisely by attacks.

KASLR for 32-bit PowerPC
Joining many other architectures, Jason Yan added kernel text base-address offset randomization (KASLR) to 32-bit PowerPC.

seccomp for RISC-V
After a bit of long road, David Abdurachmanov has added seccomp support to the RISC-V architecture. The series uncovered some more corner cases in the seccomp self tests code, which is always nice since then we get to make it more robust for the future!

seccomp USER_NOTIF continuation
When the seccomp SECCOMP_RET_USER_NOTIF interface was added, it seemed like it would only be used in very limited conditions, so the idea of needing to handle “normal” requests didn’t seem very onerous. However, since then, it has become clear that the overhead of a monitor process needing to perform lots of “normal” open() calls on behalf of the monitored process started to look more and more slow and fragile. To deal with this, it became clear that there needed to be a way for the USER_NOTIF interface to indicate that seccomp should just continue as normal and allow the syscall without any special handling. Christian Brauner implemented SECCOMP_USER_NOTIF_FLAG_CONTINUE to get this done. It comes with a bit of a disclaimer due to the chance that monitors may use it in places where ToCToU is a risk, and for possible conflicts with SECCOMP_RET_TRACE. But overall, this is a net win for container monitoring tools.

Some EFI systems provide a Random Number Generator interface, which is useful for gaining some entropy in the kernel during very early boot. The arm64 boot stub has been using this for a while now, but Dominik Brodowski has now added support for x86 to do the same. This entropy is useful for kernel subsystems performing very earlier initialization whre random numbers are needed (like randomizing aspects of the SLUB memory allocator).

As has been enabled on many other architectures, Dmitry Korotin got MIPS building with CONFIG_FORTIFY_SOURCE, so compile-time (and some run-time) buffer overflows during calls to the memcpy() and strcpy() families of functions will be detected.

limit copy_{to,from}_user() size to INT_MAX
As done for VFS, vsnprintf(), and strscpy(), I went ahead and limited the size of copy_to_user() and copy_from_user() calls to INT_MAX in order to catch any weird overflows in size calculations.

Other things
Alexander Popov pointed out some more v5.5 features that I missed in this blog post. I’m repeating them here, with some minor edits/clarifications. Thank you Alexander!

Edit: added Alexander Popov’s notes

That’s it for v5.5! Let me know if there’s anything else that I should call out here. Next up: Linux v5.6.

© 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

May 27, 2020 08:04 PM

Rusty Russell: 57 Varieties of Pyrite: Exchanges Are Now The Enemy of Bitcoin

TL;DR: exchanges are casinos and don’t want to onboard anyone into bitcoin. Avoid.

There’s a classic scam in the “crypto” space: advertize Bitcoin to get people in, then sell suckers something else entirely. Over the last few years, this bait-and-switch has become the core competency of “bitcoin” exchanges.

I recently visited the homepage of Australian exchange what a mess. There was a list of dozens of identical-looking “cryptos”, with bitcoin second after something called “XRP”; seems like it was sorted by volume?

Incentives have driven exchanges to become casinos, and they’re doing exactly what you’d expect unregulated casinos to do. This is no place you ever want to send anyone.

Incentives For Exchanges

Exchanges make money on trading, not on buying and holding. Despite the fact that bitcoin is the only real attempt to create an open source money, scams with no future are given false equivalence, because more assets means more trading. Worse than that, they are paid directly to list new scams (the crappier, the more money they can charge!) and have recently taken the logical step of introducing and promoting their own crapcoins directly.

It’s like a gold dealer who also sells 57 varieties of pyrite, which give more margin than selling actual gold.

For a long time, I thought exchanges were merely incompetent. Most can’t even give out fresh addresses for deposits, batch their outgoing transactions, pay competent fee rates, perform RBF or use segwit.

But I misunderstood: they don’t want to sell bitcoin. They use bitcoin to get you in the door, but they want you to gamble. This matters: you’ll find subtle and not-so-subtle blockers to simply buying bitcoin on an exchange. If you send a friend off to buy their first bitcoin, they’re likely to come back with something else. That’s no accident.

Looking Deeper, It Gets Worse.

Regrettably, looking harder at specific exchanges makes the picture even bleaker.

Consider Binance: this mainland China backed exchange pretending to be a Hong Kong exchange appeared out of nowhere with fake volume and demonstrated the gullibility of the entire industry by being treated as if it were a respected member. They lost at least 40,000 bitcoin in a known hack, and they also lost all the personal information people sent them to KYC. They aggressively market their own coin. But basically, they’re just MtGox without Mark Karpales’ PHP skills or moral scruples and much better marketing.

Coinbase is more interesting: an MBA-run “bitcoin” company which really dislikes bitcoin. They got where they are by spending big on regulations compliance in the US so they could operate in (almost?) every US state. (They don’t do much to dispel the wide belief that this regulation protects their users, when in practice it seems only USD deposits have any guarantee). Their natural interest is in increasing regulation to maintain that moat, and their biggest problem is Bitcoin.

They have much more affinity for the centralized coins (Ethereum) where they can have influence and control. The anarchic nature of a genuine open source community (not to mention the developers’ oft-stated aim to improve privacy over time) is not culturally compatible with a top-down company run by the Big Dog. It’s a running joke that their CEO can’t say the word “Bitcoin”, but their recent “what will happen to cryptocurrencies in the 2020s” article is breathtaking in its boldness: innovation is mainly happening on altcoins, and they’re going to overtake bitcoin any day now. Those scaling problems which the Bitcoin developers say they don’t know how to solve? This non-technical CEO knows better.

So, don’t send anyone to an exchange, especially not a “market leading” one. Find some service that actually wants to sell them bitcoin, like CashApp or Swan Bitcoin.

May 27, 2020 12:49 AM

May 20, 2020

Dave Airlie (blogspot): DirectX on Linux - what it is/isn't

This morning I saw two things that were Microsoft and Linux graphics related.

a) DirectX on Linux for compute workloads
b) Linux GUI apps on Windows

At first I thought these were related, but it appears at least presently these are quite orthogonal projects.

First up clarify for the people who jump to insane conclusions:

The DX on Linux is a WSL2 only thing. Microsoft are not any way bringing DX12 to Linux outside of the Windows environment. They are also in no way open sourcing any of the DX12 driver code. They are recompiling the DX12 userspace drivers (from GPU vendors) into Linux shared libraries, and running them on a kernel driver shim that transfers the kernel interface up to the closed source Windows kernel driver. This is in no way useful for having DX12 on Linux baremetal or anywhere other than in a WSL2 environment. It is not useful for Linux gaming.

Microsoft have submitted to the upstream kernel the shim driver to support this. This driver exposes their D3DKMT kernel interface from Windows over virtual channels into a Linux driver that provides an ioctl interface. The kernel drivers are still all running on the Windows side.

Now I read the Linux GUI apps bit and assumed that these things were the same, well it turns out the DX12 stuff doesn't address presentation at all. It's currently only for compute/ML workloads using CUDA/DirectML. There isn't a way to put the results of DX12 rendering from the Linux guest applications onto the screen at all. The other project is a wayland/RDP integration server, that connects Linux apps via wayland to RDP client on Windows display, integrating that with DX12 will be a tricky project, and then integrating that upstream with the Linux stack another step completely.

Now I'm sure this will be resolved, but it has certain implications on how the driver architecture works and how much of the rest of the Linux graphics ecosystem you have to interact with, and that means that the current driver might not be a great fit in the long run and upstreaming it prematurely might be a bad idea.

From my point of view the kernel shim driver doesn't really bring anything to Linux, it's just a tunnel for some binary data between a host windows kernel binary and a guest linux userspace binary. It doesn't enhance the Linux graphics ecosystem in any useful direction, and as such I'm questioning why we'd want this upstream at all.

May 20, 2020 12:01 AM

May 19, 2020

Linux Plumbers Conference: Containers and Checkpoint/Restore Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Containers and Checkpoint/Restore Microconference has been accepted into the 2020 Linux Plumbers Conference!

After another successful Containers Microconference last year , there’s still a lot more work to be done. Last year we discussed the intersection between the new mount api and containers, various new vfs features including a strong and fruitful discussion about id shifting, several new security hardening aspects, and improvements when restarting syscalls during checkpoint/restore. Last year’s microconference topics led to quite a few patches that have since landed in the upstream kernel with others actively being discussed. This includes, various improvements to seccomp syscall interceptions, the implementation of a new process creation syscall, the implementation of pidfds, and the addition of time namespaces.

This year’s topics include:

Come join us and participate in the discussion with what holds “The Cloud” together.

We hope to see you there!

Christian, Mike, Stéphane

May 19, 2020 04:19 PM

May 15, 2020

Linux Plumbers Conference: Linux Plumbers Conference 2020 Goes Virtual

As previously promised, we are announcing today that we have decided to hold the the Linux Plumbers Conference 2020 virtually instead of in person. We value the safety and health of our community and do not wish to expose anyone to unnecessary risks.

We do appreciate that it is the in-person aspect of plumbers (the hallway track) which attendees find the most valuable. An online Linux Plumbers Conference will clearly be different from past events. We are working hard to find ways to preserve as much of the LPC experience as we can while also taking advantage of any new opportunities that the online setting offers us. Since we no longer have many of the fixed expenses of an in-person conference, we are able to reduce the registration fee to $50. In addition we are pushing back the opening of registration to June 15 2020.

We’ll provide more details as we figure them out, thanks for your patience and support.

Do not forget to send your contribution.

We do have great proposals and if you have submitted, thank you very much. Our microconference capacity is filling up quickly, if you want your microconference to be considered, act now! We are still looking for proposals for refereed talks as well.


The LPC 2020 Planning Committee

May 15, 2020 04:08 PM

May 14, 2020

Linux Plumbers Conference: Call for Microconferences and Refereed Talks track reopened

We are pleased to announce that we have reopened the call for both refereed talks and microconferences. Due to the current global situation with the Covid-19 pandemic we wanted to give everybody a longer time window to submit proposals.

Submit your proposals here:

Stay tuned for further upcoming communications and updates about Linux Plumbers Conference 2020.


May 14, 2020 07:15 PM

May 08, 2020

Pete Zaitcev: Recruiter spam

Recruitment spam, like conference spam, is a boring part of life. However, it raises an eyebrow sometimes.

A few days ago, a Facebook recruiter, JP Fenn, sent me a form e-mail to an address that I do not give to anyone. It is only visible as a contact for one of my domains, because the registrar does not believe in privacy. I was pondering if I should propose to give him a consideration in exchange for the explanation of just where he obtained the address. Purely out of curiosity.

Today, an Amazon recruiter, Jonte, sent a message to an appropriate address. But he did it with addresses in the To: header, not just the envelope. He used a hosted Exchange of all things, and there were 294 addresses in total. That should give you an idea just how hard these people work to spam and at what level of being disposable I am in their eyes.

It really is pure spam. I think it's likely that JP bought or stole a spam database. He didn't write a Python script that scraped whois information.

I remember a viral story a few years ago how one guy got a message from Google recruiter that combined his LinkedIn interests in amusing ways. It went like "we like people whose strength is Talking Like A Pirate. As for Telling Strangers On The Internet They Were Wrong, that's one of my favorite pastimes as well." You know you made it when you receive that kind of attention. Maybe one day!

May 08, 2020 09:22 PM

April 21, 2020

Matthew Garrett: Linux kernel lockdown, integrity, and confidentiality

The Linux kernel lockdown patches were merged into the 5.4 kernel last year, which means they're now part of multiple distributions. For me this was a 7-year journey, which means it's easy to forget that others aren't as invested in the code as I am. Here's what these patches are intended to achieve, why they're implemented in the current form and what people should take into account when deploying the feature.

Root is a user - a privileged user, but nevertheless a user. Root is not identical to the kernel. Processes running as root still can't dereference addresses that belong to the kernel, are still subject to the whims of the scheduler and so on. But historically that boundary has been very porous. Various interfaces make it straightforward for root to modify kernel code (such as loading modules or using /dev/mem), while others make it less straightforward (being able to load new ACPI tables that can cause the ACPI interpreter to overwrite the kernel, for instance). In the past that wasn't seen as a significant issue, since there were no widely deployed mechanisms for verifying the integrity of the kernel in the first place. But once UEFI secure boot became widely deployed, this was a problem. If you verify your boot chain but allow root to modify that kernel, the benefits of the verified boot chain are significantly reduced. Even if root can't modify the on-disk kernel, root can just hot-patch the kernel and then make this persistent by dropping a binary that repeats the process on system boot.

Lockdown is intended as a mechanism to avoid that, by providing an optional policy that closes off interfaces that allow root to modify the kernel. This was the sole purpose of the original implementation, which maps to the "integrity" mode that's present in the current implementation. Kernels that boot in lockdown integrity mode prevent even root from using these interfaces, increasing assurances that the running kernel corresponds to the booted kernel. But lockdown's functionality has been extended since then. There are some use cases where preventing root from being able to modify the kernel isn't enough - the kernel may hold secret information that even root shouldn't be permitted to see (such as the EVM signing key that can be used to prevent offline file modification), and the integrity mode doesn't prevent that. This is where lockdown's confidentiality mode comes in. Confidentiality mode is a superset of integrity mode, with additional restrictions on root's ability to use features that would allow the inspection of any kernel memory that could contain secrets.

Unfortunately right now we don't have strong mechanisms for marking which bits of kernel memory contain secrets, so in order to achieve that we end up blocking access to all kernel memory. Unsurprisingly, this compromises people's ability to inspect the kernel for entirely legitimate reasons, such as using the various mechanisms that allow tracing and probing of the kernel.

How can we solve this? There's a few ways:

  1. Introduce a mechanism to tag memory containing secrets, and only restrict accesses to this. I've tried to do something similar for userland and it turns out to be hard, but this is probably the best long-term solution.
  2. Add support for privileged applications with an appropriate signature that implement policy on the userland side. This is actually possible already, though not straightforward. Lockdown is implemented in the LSM layer, which means the policy can be imposed using any other existing LSM. As an example, we could use SELinux to impose the confidentiality restrictions on most processes but permit processes with a specific SELinux context to use them, and then use EVM to ensure that any process running in that context has a legitimate signature. This is quite a few hoops for a general purpose distribution to jump through.
  3. Don't use confidentiality mode in general purpose distributions. The attacks it protects against are mostly against special-purpose use cases, and they can enable it themselves.

My recommendation is for (3), and I'd encourage general purpose distributions that enable lockdown to do so only in integrity mode rather than confidentiality mode. The cost of confidentiality mode is just too high compared to the benefits it provides. People who need confidentiality mode probably already know that they do, and should be in a position to enable it themselves and handle the consequences.

comment count unavailable comments

April 21, 2020 08:21 PM

April 15, 2020

Pete Zaitcev: Seagate and SMR in 2020

Back in 2015, I wrote about Seagate Kinetic and its relation to shingles in Seagate product. Unfortunately, even if Kinetic were a success, it would only support a fraction of workloads. But the rest of Seagate customers demanded density increases. So, to nobody's surprise, Seagate started including shingles into their general purpose disk drives, perhaps only for a part of the surface, or coupled with a flash cache. The company was an enthusiastic early adopter of hybrid drives, as a vendor. Journalists are trying to make a story out of it, because caches are only caches, and once you started spilling, the drive slows down to the shingle speed. But naturally, Seagate neglected to mention in their documentation just how exactly their drive worked. Sacre bleu!

April 15, 2020 07:18 PM

April 13, 2020

Matthew Garrett: Implementing support for advanced DPTF policy in Linux

Intel's Dynamic Platform and Thermal Framework (DPTF) is a feature that's becoming increasingly common on highly portable Intel-based devices. The adaptive policy it implements is based around the idea that thermal management of a system is becoming increasingly complicated - the appropriate set of cooling constraints to place on a system may differ based on a whole bunch of criteria (eg, if a tablet is being held vertically rather than lying on a table, it's probably going to be able to dissipate heat more effectively, so you should impose different constraints). One way of providing these criteria to the OS is to embed them in the system firmware, allowing an OS-level agent to read that and then incorporate OS-level knowledge into a final policy decision.

Unfortunately, while Intel have released some amount of support for DPTF on Linux, they haven't included support for the adaptive policy. And even more annoyingly, many modern laptops run in a heavily conservative thermal state if the OS doesn't support the adaptive policy, meaning that the CPU throttles down extremely quickly and the laptop runs excessively slowly.

It's been a while since I really got stuck into a laptop reverse engineering project, and I don't have much else to do right now, so I've been working on this. It's been a combination of examining what source Intel have released, reverse engineering the Windows code and staring hard at hex dumps until they made some sort of sense. Here's where I am.

There's two main components to the adaptive policy - the adaptive conditions table (APCT) and the adaptive actions table (APAT). The adaptive conditions table contains a set of condition sets, with up to 10 conditions in each condition set. A condition is something like "is the battery above a certain charge", "is this temperature sensor below a certain value", "is the lid open or closed", "is the machine upright or horizontal" and so on. Each condition set is evaluated in turn - if all the conditions evaluate to true, the condition set's target is implemented. If not, we move onto the next condition set. There will typically be a fallback condition set to catch the case where none of the other condition sets evaluate to true.

The action table contains sets of actions associated with a specific target. Once we've picked a target by evaluating the conditions, we execute the actions that have a corresponding target. Actions are things like "Set the CPU power limit to this value" or "Load a passive policy table". Passive policy tables are simply tables associating sensors with devices and an associated temperature limit. If the limit is exceeded, the associated device should be asked to reduce its heat output until the situation is resolved.

There's a couple of twists. The first is the OEM conditions. These are conditions that refer to values that are exposed by the firmware and are otherwise entirely opaque - the firmware knows what these mean, but we don't, so conditions that rely on these values are magical. They could be temperature, they could be power consumption, they could be SKU variations. We just don't know. The other is that older versions of the APCT table didn't include a reference to a device - ie, if you specified a condition based on a temperature, you had no way to express which temperature sensor to use. So, instead, you specified a condition that's greater than 0x10000, which tells the agent to look at the APPC table to extract the device and the appropriate actual condition.

Intel already have a Linux app called Thermal Daemon that implements a subset of this - you're supposed to run the binary-only dptfxtract against your firmware to parse a few bits of the DPTF tables, and it writes out an XML file that Thermal Daemon makes use of. Unfortunately it doesn't handle most of the more interesting bits of the adaptive performance policy, so I've spent the past couple of days extending it to do so and to remove the proprietary dependency.

My current work is here - it requires a couple of kernel patches (that are in the patches directory), and it only supports a very small subset of the possible conditions. It's also entirely possible that it'll do something inappropriate and cause your computer to melt - none of this is publicly documented, I don't have access to the spec and you're relying on my best guesses in a lot of places. But it seems to behave roughly as expected on the one test machine I have here, so time to get some wider testing?

comment count unavailable comments

April 13, 2020 12:28 AM

April 12, 2020

Michael Kerrisk (manpages): man-pages-5.06 is released

I've released man-pages-5.06. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from 39 contributors. The release includes more than 250 commits that change more than 120 pages. Three new pages were added in this release.

The most notable of the changes in man-pages-5.06 are the following:

April 12, 2020 07:25 AM

April 10, 2020

Michael Kerrisk (manpages): man-pages-5.04 is released

I've released man-pages-5.04. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from 15 contributors. The release includes approximately 80 commits that change just under 30 pages.

The most notable of the changes in man-pages-5.04 are the following:

Another small but important change is the addition of documentation of the P_PIDFD idtype in the waitid(2) manual page. This feature, added in Linux 5.4, allows a parent process to wait on a child process that is referred to by a PID file descriptor, and constitutes the final cornerstone in the pidfd API.

April 10, 2020 09:10 AM

April 06, 2020

Linux Plumbers Conference: Update on the Plumbers Covid-19 Situation

We’re still planning to hold Plumbers, but adopting a wait and see attitude to the in-person component. As people have noticed, the global prospect for being able to travel to Halifax in August seems to be getting worse, so we’re posting this to give more transparency to what the Plumbers Conference decision points and options are.

Our first consideration is a go/no-go decision point for the in-person conference. Currently, the date we were planning to put the first batch of tickets on-sale (15 May) represents the ideal date for this because it gives time (another 6 weeks) for more clarity to emerge on the situation, while avoiding people doing early purchases only to be disappointed if the event has to be cancelled at a later date.

Our second consideration is planning now for how we might do a fully on-line version of Plumbers. The primary consideration people should note is that our Internet and AV contracts with the hotel in Halifax don’t give us sufficient bandwidth to do the conference partly in-person and partly on-line because we’d have to do the hosting at the hotel rather than in some high bandwidth cloud location, so our decision will be either fully in-person or fully on-line. Other conferences have already done fully on-line versions, which we’re in the process of evaluating. Over the next few weeks we’ll report back ( too is doing a helpful series of articles on on-line meeting technologies which will be worth a read).

A final thing people should note is that if we do decide to go for the fully on-line version, our scheduling constraints become less severe (not having a time limited physical location) and we could spread the tracks out rather than try to run a three day, six track event. This would allow us both to lower the bandwidth requirements for the hosting (which should reduce latency and communication issues) as well
as hold the MCs at a time most convenient to the distributed time-zones of all the participants.

April 06, 2020 11:51 PM

Pete Zaitcev: Another perspective on Swift versus Ceph today

Seen in e-mail today:

From: Mark Kirkwood

There are a number of considerations (disclaimer we run Ceph block and Swift object storage):

Purely on a level of simplicity, Swift is easier to set up.

However, if you are already using Ceph for block storage then it makes sense to keep using it for object too (since you are likely to be expert at Ceph at this point).

On the other hand, if you have multiple Ceph clusters and want a geo replicated object storage solution, then doing this with Swift is much easier than with Ceph (geo replicated RGW still looks to be real complex to set up - a long page of archane commands).

Finally (this is my 'big deal point'). I'd like my block and object storage to be completely independent - suppose a situation nukes my block storage (Ceph) - if my object storage is Swift then people's backups etc are still viable and when the Ceph cluster is rebuilt we can restore and continue. On the other hand If your object storage is Ceph too then....



Mark's perspective is largely founded in the fault tolerance and administrative overhead. However, let's a look at "keep using [Ceph] for object too".

Indeed the integration of block, POSIX, and object storage is Ceph's strength, although I should note for the record that Ceph has a large gap: all 3 APIs live in separate namespaces. So, do not expect to be able to copy a disk snapshot through CephFS or RGW. Objects in each namespace are completely invisible to two others, and the only uniform access layer is RADOS. This is why, for instance, RGW-over-NFS exists. That's right, not CephFS, but NFS. You can mount RGW.

All attempts at this sort of integration that I know in Swift always start with a uniform access first. It the opposite of Ceph in a way. Because of that, these integrations typically access from the edge inside, like making a pool that a daemon fills/spills with Swift, and mounting that. SwiftStacks's ProxyFS is a little more native to Swift, but it starts off with a shared namespace too.

Previously: Swift is faster than any competitor, says an emploee of SwiftStack.

April 06, 2020 06:19 PM

March 29, 2020

Paul E. Mc Kenney: The Old Man and His Smartphone, 2020 Spring Break Episode

Complete draining of my smartphone's battery was commonplace while working from home. After all, given laptops and browsers, to say nothing of full-sized keyboards, I rarely used it. So I started doing my daily web browsing on my smartphone at breakfast, thus forcing a daily battery-level check.

This approach has been working, except that it is quite painful to print out articles my wife might be interested in. My current approach is to email the URL to myself, which in a surprisingly ornate process:

  1. Copy the URL.
  2. Start an email.
  3. Click on the triple dot at the upper right-hand side of the keyboard.
  4. Select the text-box icon at the right.
  5. Select “paste” from the resulting menu, then hit “send”.
  6. Read email on a laptop, open the URL, and print it.

The addition of a control key to the virtual keyboard might be useful to those of us otherwise wondering “How on earth do I type control-V???” Or I could take the time required to figure out how to print directly from my smartphone. But I would not recommend holding your breath waiting.

What with COVID-19 I and the associate lockdowns, I have not used my smartphone's location services much, helpful though it was in the pre-COVID-19 days. For example, prior to a business trip to Prague, my wife let me know that she wanted additional copies of a particular local craft item that I had brought back on a prior trip almost ten years ago. Unfortunately, I could not remember the name of the shop, nor were the usual search engines any help at all.

Fortunately, some passers-by mentioned Wenceslas Square, which triggered a vague memory. So I used my smartphone to go to Wenceslas Square, and from there used the old-school approach of wandering randomly. Suddenly, I knew where I was, and sure enough, when I turned to my right, there was the shop! And the craft item was even in the same place within the shop that it had been on my earlier visit!

Of course, the minute I completed my purchase, my smartphone and laptops were full of advertisements for that craft item, including listing any number of additional shops offering it for sale. Therefore, although it is quite clear that the “A” in “AI” stands for “artificial”, I am forced to dispute the usual interpretation of the “I”.

My smartphone also took the liberty of autocomposing its first-ever reply to an email, quite likely because I failed to power it off before lying it down on its screen on a not-quite-flat surface. The resulting email was heavy on the letter “b” and contained lots of emo and angst, perhaps because the word "bad" occurred quite frequently. This draft also included an instance of the name “Bob Dylan”. I will leave any discussion of the morals and immorals of this particular AI choice to the great man's many fans and detractors.

I can only be thankful that the phone left its composition in draft mode, as opposed to actually sending it. In fact, I was so shocked by the possibility that it could well have sent it that I immediately deleted it. Of course, now I wish that I had kept it so I could show it off. As they say, haste makes waste!

However, I did find the following prior effort in my “Drafts” folder. This effort is nowhere near as entertaining as the one I so hastily deleted, but it does give some of the flavor of my smartphone's approach to email autocomposition:
But there is no doubt about the way the bldg will do it in this smartphone a while now that the company is still in its position as the world's most profitable competitor to its android smartphone and its android phone in its own right and will continue its search to make its way through its mobile app market and its customers will have to pay attention for their products to the web and other apps for their customers by clicking the button and using a new app BBC to help you get your phone back in your browser and your browser based phone number and the number one you can click to see you in your browser or the other apps that are compatible or the app you use for your browser or a computer and both have or Google and you will have a lot more to say than the one that is not the only way you could not be in a good mood to get the most of your life and the rest you are in for the next two days and the rest is not a bad for you are you in a good place and the best thing you could be doing to help your family and your friends will have a sense that they can help them get their jobs done in a way that's what you are going through with your work in a good place to work and make them work better and better for their job than you can in a long term way and you are a better parent and you are not going through the process and the process is going through a good job of thinking that you're not a teacher and a teacher who believes that the best thing to be is that your browser will have the number and access of the app you can get to the web and the app is available to users for a while to be sure you can use the internet for a while you are still in a position where I have a few more questions to ask you about being able and the app you have on your computer will have to do not use it as an app you have for a
And so I have one small request. Could those of you wishing for digital assistants please consider the option of being more careful what you wish for?

My smartphone also came in handy during a power outage: The cell towers apparently had backup generators, and my smartphone's battery, though low, was not completely drained. I posted noting my situation and battery state online, which in turn prompted a proud Tesla owner to call attention to the several hundred kilowatt-hours of electrical energy stored in his driveway. Unfortunately for me, his driveway was located the better part of a thousand miles away. However, it did remind me of the single kilowatt hour stored in my conventional automobile's lead-acid battery. But fortunately, the power outage lasted only a few hours, so my smartphone's much smaller battery was sufficient to the cause.

As you would expect, I checked my smartphone's specifications when I first received it and learned that it has eight CPUs, which is not unusual for today's smartphones.

But it only recently occurred to me that the early 1990s DYNIX/ptx system on which I developed RCU had only four CPUs.

Go figure!!!

March 29, 2020 11:29 PM

March 25, 2020

Linux Plumbers Conference: LPC 2020 Call for Refereed-Track Proposals

Updated May 11th – Changed dates information.

Submissions close: (TBD – open now)
Speakers notified: (TBD)
Slides due: (TBD)

Note: We are still hoping to hold the conference as scheduled, but we are continually monitoring the pandemic situation. For current Covid-19 updates, please see our website

We are pleased to announce the Call for Refereed-Track Proposals for the 2020 edition of the Linux Plumbers Conference, which will be held in Halifax, Nova Scotia, Canada on August 25-27 in conjunction with the Kernel Summit and Linux Maintainers Summit, which takes place on August 28th.

Refereed track presentations are 50 minutes in length (which includes time for questions and discussion) and should focus on a specific aspect of the “plumbing” in the Linux system. Examples of Linux plumbing include core kernel subsystems, toolchains, container runtimes, core libraries, windowing systems, management tools, device support, media creation/playback, accelerators, hardware interaction, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

As was the case in 2019, and because Plumbers is not co-located with Open Source Summit this year, we are scheduling the refereed-track talks across all three days. This allows attendees to choose between microconferences and refereed-track talks in all time-slots and also provides a conflict-free schedule for the refereed-track talks.

Linux Plumbers Conference Program Committee members will be reviewing all submitted sessions. High-quality submission that cannot be accepted due to the limited number of slots will be forwarded to the Microconference leads for further consideration. We also encourage submitters to consider BoF sessions.

To submit a refereed track talk proposal follow the instructions at this website:

Submissions were due on or before Wednesday May 7, 2020 at 11:59PM Pacific Time however the Call for Refereed-Track Proposals is remaining opened for the time being. Each successful submission gets a free registration, but for only one speaker per presentation.

March 25, 2020 07:15 PM

Linux Plumbers Conference: LPC 2020 Call for Microconference Proposals

Updated May 11th – Changed dates information.

Submissions close: (TBD – open now)
Speakers notified: (TBD)

Note: We are still hoping to hold the conference as scheduled, but we are continually monitoring the pandemic situation. For current Covid-19 updates, please see our website

We are pleased to announce the Call for Microconferences for the 2020 Linux Plumbers Conference, which will be held in Halifax, Nova Scotia, Canada on August 25-27 in conjunction with Kernel Summit and Linux Maintainers Summit, which takes place on August 28th.

A microconference is a collection of collaborative sessions focused on problems in a particular area of Linux plumbing, which includes the kernel, libraries, utilities, services, UI, and so forth, but can also focus on cross-cutting concerns such as security, scaling, energy efficiency, toolchains, container runtimes, or a particular use case. Good microconferences result in solutions to these problems and concerns, while the best microconferences result in patches that implement those solutions.

For more information on submitting a microconference proposal, visit our CfP page.

The first round of accepted microconferences will be announced soon.

There will also be a refereed track for pure presentations. Call for presentations for that will be coming shortly.

What is a microconference?

What makes Linux Plumbers unique is that it is focused on development, solving problems and bringing about the new features of the future. A microconference is about being productive and solving issues of the day. It is not where one discusses what has already been done, or showing off the latest shiny new product or feature. Although it is OK to have a topic on ideas of what to do with a shiny new product or feature, it should not be about the product or feature itself. A microconference is to get people together face to face to discuss issues that are difficult to solve via email and chat alone.

Topics of a microconference

As stated above, an MC topic is about the future, not the past. It should be something that helps provide a solution for a question. How do we solve foo? I want to implement bar, but there’s these issues. How do we get around them? I have feature X but want to use it for Y, is it feasible?

Please avoid presentations as they tend to take time away from discussions. Presentations may be used to help bring the audience up to speed on what is about to be discussed. Keep it focused on the necessary details to allow people to participate and limit it to 5 to 7 minutes. Slides should only be used to complement the discussion and enable wider participation.

Successful microconference proposals

When proposing a microconference, it is important to state what is expected to be accomplished for the microconfence. Remember, the best microconferences are those that solve problems. The abstract of the proposal should describe what the topic is, and then list the various problems that could be discussed at the microconference. Note, what is listed may not be what is actually discussed, but gives the Plumbers planning committee an idea of how productive the microconference will be.

No microconference can be successful if the necessary people who are responsible for the issues are not present. The proposal should list the key contributors who will make sure the results of the discussions are most likely to be implemented. The best proposals will also state that those key contributors have agreed to attend.

March 25, 2020 02:15 PM

March 12, 2020

Linux Plumbers Conference: Plumbers and Covid-19

This is our current Covid-19 Statement:

Plumbers is currently taking place as planned; however, the LPC program committee is actively monitoring the situation with regard to Covid-19.  The World Health Organization (WHO) is currently making no projections for the situation at that date, but there is hope that the spread of the disease will slow or stop entirely during the northern-hemisphere summer.  Given the uncertainty, we are currently adopting a wait-and-see approach.  Rest assured that we’ll be following precautions advised by both the WHO and the local health authorities in Halifax should they still be in effect by the time the conference starts.

We’ll post updates to the plumbers website as they become available.

March 12, 2020 03:25 PM

March 08, 2020

Brendan Gregg: LISA2019 Linux Systems Performance

Systems performance is an effective discipline for performance analysis and tuning, and can help you find performance wins for your applications and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. To serve this need I summarized Linux systems performance in 40 minutes at USENIX LISA 2019, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. The video is on youtube:

The slides are on slideshare (or PDF):
The different background colors I used for the screenshots have no meaning: I just got bored of using gray. I've heard many companies use my [Systems Performance] book as recommended or required reading for new engineers (thank you), and this is an updated talk on the topic. I've been working on Systems Performance 2nd Edition, now that the [BPF book] is done. At LISA I also ran a BPF performance tools workshop with over 200 attendees. It was a little rushed for 90 minutes, but I've heard people found it valuable anyway. At (the rescheduled) USENIX [SREcon] in June in Santa Clara I'll be running it again in a 3-hour window. So far my workshops have not been video recorded, but in the interests of supporting WFH I hope to get a workshop recorded sometime and put online. [Systems Performance]: /sysperfbook.html [BPF book]: /bpf-performance-tools-book.html [SREcon]:

March 08, 2020 08:00 AM

March 05, 2020

Pete Zaitcev: Nvidia acquires SwiftStack

In the words of Joe Arnold:

Last year, when we announced SwiftStack 7, we unveiled our focus on the SwiftStack Data Platform for AI, HPC, and accelerated computing. This included SwiftStack 1space as a valuable piece of the puzzle, enabling data acceleration in the core, at the edge, and in the cloud.

To our existing customers — we will continue to maintain, enhance, and support 1space, ProxyFS, Swift, and the Controller. SwiftStack’s technology is already a key part of NVIDIA’s GPU-powered AI infrastructure, and this acquisition will strengthen what we do for you.

Building AI supercomputers is exciting to the entire SwiftStack team. We couldn’t be more thrilled [...]

Highlighting 1space as the centerpiece of the acquisition seems strange. All I knew about it was a cloud-to-cloud data pumping service. Hardly any HPC stuff. I could see how Nvidia might want ProxyFS to replace Hadoop, but not this.

The core Swift continues unchanged for now.

March 05, 2020 11:04 PM

February 27, 2020

James Morris: Linux Security Summit North America 2020: CFP and Registration

The CFP for the 2020 Linux Security Summit North America is currently open, and closes on March 31st.

The CFP details are here:

You can register as an attendee here:

Note that the conference this year has moved from August to June (24-26).  The location is Austin, TX, and we are co-located with the Open Source Summit as usual.

We’ll be holding a 3-day event again, after the success of last year’s expansion, which provides time for tutorials and ad-hoc break out sessions.  Please note that if you intend to submit a tutorial, you should be a core developer of the project or otherwise recognized leader in the field, per this guidance from the CFP:

Tutorial sessions should be focused on advanced Linux security defense topics within areas such as the kernel, compiler, and security-related libraries.  Priority will be given to tutorials created for this conference, and those where the presenter is a leading subject matter expert on the topic.

This will be the 10th anniversary of the Linux Security Summit, which was first held in 2010 in Boston as a one day event.

Get your proposals for 2020 in soon!

February 27, 2020 09:46 PM

February 26, 2020

Linux Plumbers Conference: Videos for microconferences

The videos for all the talks in microconferences at the 2019 edition of Linux Plumbers are now linked to the schedule. Clicking on the link titled “video” will take you to the right spot in the microconference video. Hopefully, watching all of these talks will get you excited for the 2020 edition which we are busy preparing! Watch out for our call for microconferences and for our refereed track both of which are to be released soon. So now’s the time to start thinking about all the exciting problems you want to discuss and solve.

February 26, 2020 03:09 PM

February 20, 2020

Matthew Garrett: What usage restrictions can we place in a free software license?

Growing awareness of the wider social and political impact of software development has led to efforts to write licenses that prevent software being used to engage in acts that are seen as socially harmful, with the Hippocratic License being perhaps the most discussed example (although the JSON license's requirement that the software be used for good, not evil, is arguably an earlier version of the theme). The problem with these licenses is that they're pretty much universally considered to fall outside the definition of free software or open source licenses due to their restrictions on use, and there's a whole bunch of people who have very strong feelings that this is a very important thing. There's also the more fundamental underlying point that it's hard to write a license like this where everyone agrees on whether a specific thing is bad or not (eg, while many people working on a project may feel that it's reasonable to prohibit the software being used to support drone strikes, others may feel that the project shouldn't have a position on the use of the software to support drone strikes and some may even feel that some people should be the victims of drone strikes). This is, it turns out, all quite complicated.

But there is something that many (but not all) people in the free software community agree on - certain restrictions are legitimate if they ultimately provide more freedom. Traditionally this was limited to restrictions on distribution (eg, the GPL requires that your recipient be able to obtain corresponding source code, and for GPLv3 must also be able to obtain the necessary signing keys to be able to replace it in covered devices), but more recently there's been some restrictions that don't require distribution. The best known is probably the clause in the Affero GPL (or AGPL) that requires that users interacting with covered code over a network be able to download the source code, but the Cryptographic Autonomy License (recently approved as an Open Source license) goes further and requires that users be able to obtain their data in order to self-host an equivalent instance.

We can construct examples of where these prevent certain fields of endeavour, but the tradeoff has been deemed worth it - the benefits to user freedom that these licenses provide is greater than the corresponding cost to what you can do. How far can that tradeoff be pushed? So, here's a thought experiment. What if we write a license that's something like the following:

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. All permissions granted by this license must be passed on to all recipients of modified or unmodified versions of this work
2. This work may not be used in any way that impairs any individual's ability to exercise the permissions granted by this license, whether or not they have received a copy of the covered work

This feels like the logical extreme of the argument. Any way you could use the covered work that would restrict someone else's ability to do the same is prohibited. This means that, for example, you couldn't use the software to implement a DRM mechanism that the user couldn't replace (along the lines of GPLv3's anti-Tivoisation clause), but it would also mean that you couldn't use the software to kill someone with a drone (doing so would impair their ability to make use of the software). The net effect is along the lines of the Hippocratic license, but it's framed in a way that is focused on user freedom.

To be clear, I don't think this is a good license - it has a bunch of unfortunate consequences like it being impossible to use covered code in self-defence if doing so would impair your attacker's ability to use the software. I'm not advocating this as a solution to anything. But I am interested in seeing whether the perception of the argument changes when we refocus it on user freedom as opposed to an independent ethical goal.



Rich Felker on Twitter had an interesting thought - if clause 2 above is replaced with:

2. Your rights under this license terminate if you impair any individual's ability to exercise the permissions granted by this license, even if the covered work is not used to do so

how does that change things? My gut feeling is that covering actions that are unrelated to the use of the software might be a reach too far, but it gets away from the idea that it's your use of the software that triggers the clause.

comment count unavailable comments

February 20, 2020 01:33 AM

February 19, 2020

Kees Cook: security things in Linux v5.4

Previously: v5.3.

Linux kernel v5.4 was released in late November. The holidays got the best of me, but better late than never! ;) Here are some security-related things I found interesting:

waitid() gains P_PIDFD
Christian Brauner has continued his pidfd work by adding a critical mode to waitid(): P_PIDFD. This makes it possible to reap child processes via a pidfd, and completes the interfaces needed for the bulk of programs performing process lifecycle management. (i.e. a pidfd can come from /proc or clone(), and can be waited on with waitid().)

kernel lockdown
After something on the order of 8 years, Linux can now draw a bright line between “ring 0” (kernel memory) and “uid 0” (highest privilege level in userspace). The “kernel lockdown” feature, which has been an out-of-tree patch series in most Linux distros for almost as many years, attempts to enumerate all the intentional ways (i.e. interfaces not flaws) userspace might be able to read or modify kernel memory (or execute in kernel space), and disable them. While Matthew Garrett made the internal details fine-grained controllable, the basic lockdown LSM can be set to either disabled, “integrity” (kernel memory can be read but not written), or “confidentiality” (no kernel memory reads or writes). Beyond closing the many holes between userspace and the kernel, if new interfaces are added to the kernel that might violate kernel integrity or confidentiality, now there is a place to put the access control to make everyone happy and there doesn’t need to be a rehashing of the age old fight between “but root has full kernel access” vs “not in some system configurations”.

tagged memory relaxed syscall ABI
Andrey Konovalov (with Catalin Marinas and others) introduced a way to enable a “relaxed” tagged memory syscall ABI in the kernel. This means programs running on hardware that supports memory tags (or “versioning”, or “coloring”) in the upper (non-VMA) bits of a pointer address can use these addresses with the kernel without things going crazy. This is effectively teaching the kernel to ignore these high bits in places where they make no sense (i.e. mathematical comparisons) and keeping them in place where they have meaning (i.e. pointer dereferences).

As an example, if a userspace memory allocator had returned the address 0x0f00000010000000 (VMA address 0x10000000, with, say, a “high bits” tag of 0x0f), and a program used this range during a syscall that ultimately called copy_from_user() on it, the initial range check would fail if the tag bits were left in place: “that’s not a userspace address; it is greater than TASK_SIZE (0x0000800000000000)!”, so they are stripped for that check. During the actual copy into kernel memory, the tag is left in place so that when the hardware dereferences the pointer, the pointer tag can be checked against the expected tag assigned to referenced memory region. If there is a mismatch, the hardware will trigger the memory tagging protection.

Right now programs running on Sparc M7 CPUs with ADI (Application Data Integrity) can use this for hardware tagged memory, ARMv8 CPUs can use TBI (Top Byte Ignore) for software memory tagging, and eventually there will be ARMv8.5-A CPUs with MTE (Memory Tagging Extension).

boot entropy improvement
Thomas Gleixner got fed up with poor boot-time entropy and trolled Linus into coming up with reasonable way to add entropy on modern CPUs, taking advantage of timing noise, cycle counter jitter, and perhaps even the variability of speculative execution. This means that there shouldn’t be mysterious multi-second (or multi-minute!) hangs at boot when some systems don’t have enough entropy to service getrandom() syscalls from systemd or the like.

userspace writes to swap files blocked
From the department of “how did this go unnoticed for so long?”, Darrick J. Wong fixed the kernel to not allow writes from userspace to active swap files. Without this, it was possible for a user (usually root) with write access to a swap file to modify its contents, thereby changing memory contents of a process once it got paged back in. While root normally could just use CAP_PTRACE to modify a running process directly, this was a loophole that allowed lesser-privileged users (e.g. anyone in the “disk” group) without the needed capabilities to still bypass ptrace restrictions.

limit strscpy() sizes to INT_MAX
Generally speaking, if a size variable ends up larger than INT_MAX, some calculation somewhere has overflowed. And even if not, it’s probably going to hit code somewhere nearby that won’t deal well with the result. As already done in the VFS core, and vsprintf(), I added a check to strscpy() to reject sizes larger than INT_MAX. support removed
Thomas Gleixner removed support for the gold linker. While this isn’t providing a direct security benefit, has been a constant source of weird bugs. Specifically where I’ve noticed, it had been pain while developing KASLR, and has more recently been causing problems while stabilizing building the kernel with Clang. Having this linker support removed makes things much easier going forward. There are enough weird bugs to fix in Clang and ld.lld. ;)

Intel TSX disabled
Given the use of Intel’s Transactional Synchronization Extensions (TSX) CPU feature by attackers to exploit speculation flaws, Pawan Gupta disabled the feature by default on CPUs that support disabling TSX.

That’s all I have for this version. Let me know if I missed anything. :) Next up is Linux v5.5!

© 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

February 19, 2020 12:37 AM

February 14, 2020

David Sterba: Btrfs hilights in 5.4: tree checker updates

A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.

There’s not much to show in this release. Some users find that good too, a boring release. But still there are some changes of interest. The 5.4 is a long-term support stable tree, stability and core improvements are perhaps more appropriate than features that need a release or two to stabilize.

? stable not known in advance so not pushing half-baked features to stable, possibly requiring more intrusive fixups

The development cycle happened over summer and this slowed down the pace of patch reviews and update turnarounds.

Tree-checker updates

The tree-checker is a sanity checker of metadata that are read from/written to devices. Over time it’s being enhanced by more checks, let’s have a look at two of them.

ROOT_ITEM checks

The item represents root of a b-tree, of the internal or the subvolume trees.

Let’s take an example from btrfs inspect dump-tree:

       item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
                generation 5 root_dirid 0 bytenr 30523392 level 0 refs 1
                lastsnap 0 byte_limit 0 bytes_used 16384 flags 0x0(none)
                uuid 00000000-0000-0000-0000-000000000000
                drop key (0 UNKNOWN.0 0) level 0

Some of the metadata inside the item allow only simple checks, following commit 259ee7754b6793:

The refs is a reference counter and sanity check would require reading all the expected reference holders, bytes_used would need to look up the block that it accounts, etc. The subvolume trees have more data like ctime, otime and real uuids. If you wonder what’s byte_limit, this used to be a mechanism to emulate quotas by setting the limit value, but it has been deprecated and unused for a long time. One day we might to find another purpose for the bytes.

Many of the tree-checker enhancements are follow ups to fuzz testing and reports, as it was in this case. The bug report shows that some of the incorrect data were detected and even triggered auto-repair (as this was on filesystem with DUP metadata) but there was too much damage and it crashed at some point. The crash was not random but a BUG_ON of an unexpected condition, that’s sanity check of last resort. Catching inconsistent data early with a graceful error handling is of course desired and ongoing work.

Extent metadata item checks

There are two item types that represent extents and information about sharing. EXTENT_ITEM is older and bigger while METADATA_ITEM is the building block of skinny-metadata feature, smaller and more compact. Both items contain type of reference(s) and the owner (a tree id). Besides the generic checks that also the root item does (alignment, value ranges, generation), there’s a number of allowed combinations of the reference types and extent types. The commit f82d1c7ca8ae1bf implements that, however further explanation is out of scope of the overview as the sharing and references are the fundamental design of btrfs.


        item 170 key (88145920 METADATA_ITEM 0) itemoff 10640 itemsize 33
                refs 1 gen 27 flags TREE_BLOCK
                tree block skinny level 0
                tree block backref root FS_TREE


        item 27 key (20967424 EXTENT_ITEM 4096) itemoff 14895 itemsize 53
                refs 1 gen 499706 flags DATA
                extent data backref root FS_TREE objectid 8626071 offset 0 count 1

This for a simple case with one reference, tree (for metadata) and ordinary data, so comparing the sizes shows 20 bytes saved. On my 20GiB root partition with about 70 snapshots there are XXX EXTENT and YYY METADATA items.

Otherwise there can be more references inside one item (eg. many snapshots of a file that is randomly updated over time) so the overhead of the item itself is smaller

February 14, 2020 11:00 PM

February 09, 2020

Michael Kerrisk (manpages): man-pages-5.05 is released

I've released man-pages-5.05. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from more than 40 contributors. The release includes approximately 110 commits that change around 50 pages.

February 09, 2020 04:55 PM

February 01, 2020

David Sterba: Btrfs hilights in 5.5: 3-copy and 4-copy block groups

A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.

New block group profiles RAID1C3 and RAID1C4

There are two new block group profiles enhancing capabilities of the RAID1 types with more copies than 2. Brief overview of the profiles is in the table below, for table with all profiles see manual page of mkfs.brtfs, also available on wiki.

Profile Copies Utilization Min devices
RAID1 2 50% 2
RAID1C3 3 33% 3
RAID1C4 4 25% 4

The way all the RAID1 types work is that there are 2 / 3 / 4 exact copies over all available devices. The terminology is different from linux MD RAID, that can do any number of copies. We decided not to do that in btrfs to keep the implementation simple. Another point for simplicity is from the users’ perspective. That RAID1C3 provides 3 copies is clear from the type. Even after adding a new device and not doing balance, the guarantees about redundancy still hold. Newly written data will use the new device together with 2 devices from the original set.

Compare that with a hypothetical RAID1CN, on a filesystem with M devices (N <= M). When the filesystem starts with 2 devices, equivalent to RAID1, adding a new one will have mixed redundancy guarantees after writing more data. Old data with RAID1, new with RAID1C3 – but all accounted under RAID1CN profile. A full re-balance would be required to make it a reliable 3-copy RAID1. Add another device, going to RAID1C4, same problem with more data to shuffle around.

The allocation policy would depend on number of devices, making it hard for the user to know the redundancy level. This is already the case for RAID0/RAID5/RAID6. For the striped profile RAID0 it’s not much of a problem, there’s no redundancy. For the parity profiles it’s been a known problem and new balance filter stripe has been added to support fine grained selection of block groups.

Speaking about RAID6, there’s the elephant in the room, trying to cover write hole. Lack of a resiliency against 2 device damage has been bothering all of us because of the known write hole problem in the RAID6 implementation. How this is going to be addressed is for another post, but for now, the newly added RAID1C3 profile is a reasonable substitute for RAID6.

How to use it

On a freshly created filesystem it’s simple:

# mkfs.btrfs -d raid1c3 -m raid1c4 /dev/sd[abcd]

The command combines both new profiles for sake of demonstration, you should always consider the expected use and required guarantees and choose the appropriate profiles.

Changing the profile later on an existing filesystem works as usual, you can use:

# btrfs balance start -mconvert=raid1c3 /mnt/path

Provided there are enough devices and enough space to do the conversion, this will go through all metadadata block groups and after it finishes, all of them will be of the of the desired type.

Backward compatibility

The new block groups are not understood by old kernels and can’t be mounted, not even in the read-only mode. To prevent that a new incompatibility bit is introduced, called raid1c34. Its presence on a device can be checked by btrfs inspect-internal dump-super in the incompat_flags. On a running system the incompat features are exported in sysfs, /sys/fs/btrfs/UUID/features/raid1c34.


There is no demand for RAID1C5 at the moment (I asked more than once). The space utilization is low already, the RAID1C4 survives 3 dead devices so IMHO this is enough for most users. Extending resilience to more devices should perhaps take a different route.

With more copies there’s potential for parallelization of reads from multiple devices. Up to now this is not optimal, there’s a decision logic that’s semi-random based on process ID of the btrfs worker threads or process submitting the IO. Better load balancing policy is a work in progress and could appear in 5.7 at the earliest (because 5.6 development is now in fixes-only mode).

Look back

The history of the patchset is a bit bumpy. There was enough motivation and requests for the functionality, so I started the analysis what needs to be done. Several cleanups were necessary to unify code and to make it easily extendable for more copies while using the same mirroring code. In the end change a few constants and be done.

Following with testing, I tried simple mkfs and conversions, that worked well. Then scrub, overwrite some blocks and let the auto-repair do the work. No hiccups. The remaining and important part was the device replace, as the expected use case was to substitute RAID6, replacing a missing or damaged disk. I wrote the test script, replace 1 missing, replace 2 missing. And it did not work. While the filesystem was mounted, everything seemed OK. Unmount, check again and the devices were still missing. Not cool, right.

Due to lack of time before the upcoming merge window (a code freeze before next development cycle), I had to declare it not ready and put it aside. This was in late 2018. For a highly requested feature this was not an easy decision. Should it be something less important, the development cycle between rc1 and final release provides enough time to fix things up. But due to the maintainer role with its demands I was not confident that I could find enough time to debug and fix the remaining problem. Also nobody offered help to continue the work, but that’s how it goes.

At the late 2019 I had some spare time and looked at the pending work again. Enhanced the test script with more debugging messages and more checks. The code worked well, the test script was subtly broken. Oh well, what a blunder. That cost a year, but on the other hand releasing a highly requested feature that lacks an important part was not an appealing option.

The patchset was added to 5.5 development queue at about the last time before freeze, final 5.5 release happened a week ago.

February 01, 2020 11:00 PM

January 28, 2020

Matthew Garrett: Avoiding gaps in IOMMU protection at boot

When you save a large file to disk or upload a large texture to your graphics card, you probably don't want your CPU to sit there spending an extended period of time copying data between system memory and the relevant peripheral - it could be doing something more useful instead. As a result, most hardware that deals with large quantities of data is capable of Direct Memory Access (or DMA). DMA-capable devices are able to access system memory directly without the aid of the CPU - the CPU simply tells the device which region of memory to copy and then leaves it to get on with things. However, we also need to get data back to system memory, so DMA is bidirectional. This means that DMA-capable devices are able to read and write directly to system memory.

As long as devices are entirely under the control of the OS, this seems fine. However, this isn't always true - there may be bugs, the device may be passed through to a guest VM (and so no longer under the control of the host OS) or the device may be running firmware that makes it actively malicious. The third is an important point here - while we usually think of DMA as something that has to be set up by the OS, at a technical level the transactions are initiated by the device. A device that's running hostile firmware is entirely capable of choosing what and where to DMA.

Most reasonably recent hardware includes an IOMMU to handle this. The CPU's MMU exists to define which regions of memory a process can read or write - the IOMMU does the same but for external IO devices. An operating system that knows how to use the IOMMU can allocate specific regions of memory that a device can DMA to or from, and any attempt to access memory outside those regions will fail. This was originally intended to handle passing devices through to guests (the host can protect itself by restricting any DMA to memory belonging to the guest - if the guest tries to read or write to memory belonging to the host, the attempt will fail), but is just as relevant to preventing malicious devices from extracting secrets from your OS or even modifying the runtime state of the OS.

But setting things up in the OS isn't sufficient. If an attacker is able to trigger arbitrary DMA before the OS has started then they can tamper with the system firmware or your bootloader and modify the kernel before it even starts running. So ideally you want your firmware to set up the IOMMU before it even enables any external devices, and newer firmware should actually do this automatically. It sounds like the problem is solved.

Except there's a problem. Not all operating systems know how to program the IOMMU, and if a naive OS fails to remove the IOMMU mappings and asks a device to DMA to an address that the IOMMU doesn't grant access to then things are likely to explode messily. EFI has an explicit transition between the boot environment and the runtime environment triggered when the OS or bootloader calls ExitBootServices(). Various EFI components have registered callbacks that are triggered at this point, and the IOMMU driver will (in general) then tear down the IOMMU mappings before passing control to the OS. If the OS is IOMMU aware it'll then program new mappings, but there's a brief window where the IOMMU protection is missing - and a sufficiently malicious device could take advantage of that.

The ideal solution would be a protocol that allowed the OS to indicate to the firmware that it supported this functionality and request that the firmware not remove it, but in the absence of such a protocol we're left with non-ideal solutions. One is to prevent devices from being able to DMA in the first place, which means the absence of any IOMMU restrictions is largely irrelevant. Every PCI device has a busmaster bit - if the busmaster bit is disabled, the device shouldn't start any DMA transactions. Clearing that seems like a straightforward approach. Unfortunately this bit is under the control of the device itself, so a malicious device can just ignore this and do DMA anyway. Fortunately, PCI bridges and PCIe root ports should only forward DMA transactions if their busmaster bit is set. If we clear that then any devices downstream of the bridge or port shouldn't be able to DMA, no matter how malicious they are. Linux will only re-enable the bit after it's done IOMMU setup, so we should then be in a much more secure state - we still need to trust that our motherboard chipset isn't malicious, but we don't need to trust individual third party PCI devices.

This patch just got merged, adding support for this. My original version did nothing other than clear the bits on bridge devices, but this did have the potential for breaking devices that were still carrying out DMA at the moment this code ran. Ard modified it to call the driver shutdown code for each device behind a bridge before disabling DMA on the bridge, which in theory makes this safe but does still depend on the firmware drivers behaving correctly. As a result it's not enabled by default - you can either turn it on in kernel config or pass the efi=disable_early_pci_dma kernel command line argument.

In combination with firmware that does the right thing, this should ensure that Linux systems can be protected against malicious PCI devices throughout the entire boot process.

comment count unavailable comments

January 28, 2020 11:19 PM

January 27, 2020

Pete Zaitcev: Too Real

From CKS:

The first useful property Python has is that you can't misplace the source code for your deployed Python programs.

January 27, 2020 04:46 PM