Kernel Planet

October 14, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Part II

Further reading of the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file encounters RCU's publish/subscribe guarantee. This guarantee ensures that RCU readers that traverse a newly inserted element of an RCU-protected data structure never see pre-initialization garbage in that element. In CONFIG_PREEMPT_NONE=y kernels, this guarantee combined with the grace-period guarantee permits RCU readers to traverse RCU-protected data structures using exactly the same sequence of instructions that would be used if these data structures were immutable. As always, free is a very good price!

However, some care is required to make use of this publish-subscribe guarantee. When inserting a new element, updaters must take care to first initialize everything that RCU readers might access and only then use an RCU primitive to carry out the insertion. Such primitives include rcu_assign_pointer() and list_add_rcu(), but please see The RCU API, 2019 edition or the Linux-kernel source code for the full list.

For their part, readers must use an RCU primitive to carry out their traversals, for example, rcu_dereference() or list_for_each_entry_rcu(). Again, please see The RCU API, 2019 edition or the Linux-kernel source code for the full list of such primitives.

Of course, rcutorture needs to test this publish/subscribe guarantee. It does this using yet another field in the rcu_torture structure:

struct rcu_torture {
  struct rcu_head rtort_rcu;
  int rtort_pipe_count;
  struct list_head rtort_free;
  int rtort_mbtest;
};

This additional field is ->rtort_mbtest, which is set to zero when a given rcu_torture structure is freed for reuse (see the rcu_torture_pipe_update_one() function), and then set to 1 just before that structure is made available to readers (see the rcu_torture_writer() function). For its part, the rcu_torture_one_read() function checks to see if this field is zero, and if so flags the error by atomically incrementing the global n_rcu_torture_mberror counter. As you would expect, any run ending with a non-zero value in this counter is considered to be a failure.

Thus we have an important fundamental property of RCU that nevertheless happens to have a simple but effective test strategy. To the best of my knowledge, this was also the first aspect of Linux-kernel RCU that was subjected to an automated proof of correctness.

Sometimes you get lucky! ;–)

October 14, 2020 11:16 PM

October 12, 2020

Linux Plumbers Conference: LPC 2020 Survey Results

We had 185 responses to the Linux Plumbers survey in 2020, which, given the total number of conference registrants of 809, has provided confidence in the feedback. Given that the event was held virtually this year, it’s encouraging to see the community remaining engaged. So we are pleased to offer an especially heartfelt “thank you” to everyone who participated in this survey!

Overall
98.4% of respondents were positive or neutral about the event, with only 1.6% indicating they were dissatisfied. Given the fact we had to shift the event to be online this year, that is a very encouraging result. Co-location with the Kernel Summit continues to prove popular (67.5% considered it helpful/very helpful), and the first time introduction of the GNU Tools track was very well received with 68% of the respondents considering it helpful/very helpful as well. One thing we were a bit worried about is whether the online format would enable discussions to help resolve problems 73% found them useful, which compared to most online events was a great result.

The BOF track was very popular and we’re looking to include this again in 2021. Conference participation was up from 2019 and even though we increased the capacity to 810, we sold out of regular tickets again. Given that the participants adhered to the guidelines for online we didn’t bump into the capacity limits we were worried about, so are considering raising the cap next year if we need to be virtual. From the survey, the overwhelming majority of attendees prefer us to try to hold the conference in person, with a fall back to virtual. With this in mind, we’re working with the Linux Foundation events team to identify options in Dublin for a hybrid event, but may fall back to be entirely online.

Based on the fact we sold out, we live-streamed and videotaped all of the sessions. All the live streams are available for playback now on our YouTube channel. There are over 120 hours of video for 2020 already and we are adding more. The committee is in the process of re-rendering them and linking them to the detailed schedule. The Microconferences are recorded as one long video block, but clicking on the video link of a particular discussion topic will take you to the time index in that file where the chosen discussion begins. The recorded BoFs will also be posted soon.

Content
In terms of track feedback, Linux Plumbers Refereed track and Kernel Summit track were indicated as very relevant by almost all respondents who attended. The BOFs track was positively received and will continue. The hallway track continues to be regarded as very important and appreciated. Based on the feedback, if we have to be virtual again, we will look at options of making more hack rooms available, as they were well received for follow on conversations. If we are able to meet in person, we will evaluate options for making private meeting rooms available for groups who need to meet onsite.

Communication
The emails from the committee continue to be positively received as was our new website. There were some excellent suggestions in this year’s write-in comments, we’ll be looking into options to incorporate. In particular, because were online, it was possible for a more people to join us who would not have been able to get travel funding or visas approved.

There were lots of great suggestions to the “what one thing would you like to see changed” question, and the program committee has been studying them to see what is possible to implement this year. Thank you again to the participants for their input and help on improving the Linux Plumbers Conference. More information on the 2021 conference will be shared early in the new year.

October 12, 2020 03:03 PM

October 09, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Part I

A quick look at the beginning of the Documentation/RCU/Design/Requirements/Requirements.rst file in a recent Linux-kernel source tree might suggest that testing RCU's fundamental requirements is Job One. And that suggestion would be quite correct. This post describes how rcutorture tests RCU's grace-period guarantee, which is usually used to make sure that data is not freed out from under an RCU reader. Later posts will describe how the other fundamental guarantees are tested.

What Exactly is RCU's Fundamental Grace-Period Guarantee?

Any RCU reader that started before the start of a given grace period is guaranteed to complete before that grace period completes. This is shown in the following diagram:

Diagram of RCU grace-period guarantee 1

Similarly, any RCU reader that completes after the end of a given grace period is guaranteed to have started after that grace period started. And this is shown in this diagram:

Diagram of RCU grace-period guarantee 2

More information is available in the aforementioned Documentation/RCU/Design/Requirements/Requirements.rst file.

Whose Fault is This rcutorture Failure, Anyway?

Suppose an rcutorture test fails, perhaps by triggering a WARN_ON() that normally indicates a problem in some other area of the kernel. But how do we know this failure is not instead RCU's fault?

One straightforward way to test RCU's grace-period guarantee would be to maintain a single RCU-protected pointer (let's call it rcu_torture_current) to a single structure, perhaps defined as follows:

struct rcu_torture {
  struct rcu_head rtort_rcu;
  atomic_t rtort_nreaders;
  int rtort_pipe_count;
} *rcu_torture_current;

Readers could then do something like this in a loop:

rcu_read_lock();
p = rcu_dereference(rcu_torture_current);
atomic_inc(&p->rtort_nreaders);
burn_a_random_amount_of_cpu_time();
WARN_ON(READ_ONCE(p->rtort_pipe_count));
rcu_read_unlock();

An updater could do something like this, also in a loop:

p = kzalloc(sizeof(*p), GFP_KERNEL);
q = xchg(&rcu_torture_current, p);
call_rcu(&q->rtort_rcu, rcu_torture_cb);

And the rcu_torture_cb() function might be defined as follows:

static void rcu_torture_cb(struct rcu_head *p)
{
  struct rcu_torture *rp = container_of(p, struct rcu_torture, rtort_rcu);

  accumulate_stats(atomic_read(&rp->rtort_nreaders));
  WRITE_ONCE(rp->rtort_pipe_count, 1);
  burn_a_bit_more_cpu_time();
  kfree(rp);
}

This approach is of course problematic, never mind that one of rcutorture's predecessors actually did something like this. For one thing, a reader might be interrupted or (in CONFIG_PREEMPT=y kernels) preempted between its rcu_dereference() and its atomic_inc(). Then a too-short RCU grace period could result in the above reader doing its atomic_inc() on some structure that had already been freed and allocated as some other data structure used by some other part of the kernel. This could in turn result in a confusing failure in that other part of the kernel that was really RCU's fault.

In addition, the read-side atomic_inc() will result in expensive cache misses that will end up synchronizing multiple tasks concurrently executing the RCU reader code shown above. This synchronization will reduce read-side concurrency, which will in turn likely reduce the probability of these readers detecting a too-short grace period.

Finally, using the passage of time for synchronization is almost always a bad idea, so burn_a_bit_more_cpu_time() really needs to go. One might suspect that burn_a_random_amount_of_cpu_time() is also a bad idea, but we will see the wisdom in it.

Making rcutorture Preferentially Break RCU

The rcutorture module reduces the probability of false-positive non-RCU failures using these straightforward techniques:

  1. Allocate the memory to be referenced by rcu_torture_current in an array, whose elements are only ever used by rcutorture.
  2. Once an element is removed from rcu_torture_current, keep it in a special rcu_torture_removed list for some time before allowing it to be reused.
  3. Keep the random time delays in the rcutorture readers.
  4. Run rcutorture on an otherwise idle system, or, more commonly these days, within an otherwise idle guest OS.
  5. Make rcutorture place a relatively heavy load on RCU.

Use of the array keeps rcutorture from use-after-free clobbering of other kernel subsystems' data structures, keeping to-be-freed elements on the rcu_torture_removed list increases the probability that rcutorture will detect a too-short grace period, the delays in the readers increases the probability that a too-short grace period will be detected, and ensuring that most of the RCU activity is done at rcutorture's behest decreases the probability that any too-short grace periods will clobber other kernel subsystems.

The rcu_torture_alloc() and rcu_torture_free() functions manage a freelist of array elements. The freelist is a simple list creatively named rcu_torture_freelist and guarded by a global rcu_torture_lock. Because allocation and freeing happen at most once per grace period, this global lock is just fine: It is nowhere near being any sort of performance or scalability bottleneck.

The rcu_torture_removed list is handled by the rcu_torture_pipe_update_one() function that is invoked by rcutorture callbacks and the rcu_torture_pipe_update() function that is invoked by rcu_torture_writer() after completing a synchronous RCU grace period. The rcu_torture_pipe_update_one() function updates only the specified array element, and the rcu_torture_pipe_update() function updates all of the array elements residing on the rcu_torture_removed list. These updates each increment the -&gtrtort_pipe_count field. When the value of this field reaches RCU_TORTURE_PIPE_LEN (by default 10), the array element is freed for reuse.

The rcu_torture_reader() function handles the random time delays and leverages the awesome power of multiple kthreads to maintain a high read-side load on RCU. The rcu_torture_writer() function runs in a single kthread in order to simplify synchronization, but it enlists the help of several other kthreads repeatedly invoking the rcu_torture_fakewriter() in order to keep the update-side load on RCU at a respectable level.

 

This blog post described RCU's fundamental grace-period guarantee and how rcutorture stress-tests it. It also described a few simple ways that rcutorture increases the probability that any failures to provide this guarantee are attributed to RCU and not to some hapless innocent bystander.

October 09, 2020 08:49 PM

September 21, 2020

Kees Cook: security things in Linux v5.7

Previously: v5.6

Linux v5.7 was released at the end of May. Here’s my summary of various security things that caught my attention:

arm64 kernel pointer authentication
While the ARMv8.3 CPU “Pointer Authentication” (PAC) feature landed for userspace already, Kristina Martsenko has now landed PAC support in kernel mode. The current implementation uses PACIASP which protects the saved stack pointer, similar to the existing CONFIG_STACKPROTECTOR feature, only faster. This also paves the way to sign and check pointers stored in the heap, as a way to defeat function pointer overwrites in those memory regions too. Since the behavior is different from the traditional stack protector, Amit Daniel Kachhap added an LKDTM test for PAC as well.

BPF LSM
The kernel’s Linux Security Module (LSM) API provide a way to write security modules that have traditionally implemented various Mandatory Access Control (MAC) systems like SELinux, AppArmor, etc. The LSM hooks are numerous and no one LSM uses them all, as some hooks are much more specialized (like those used by IMA, Yama, LoadPin, etc). There was not, however, any way to externally attach to these hooks (not even through a regular loadable kernel module) nor build fully dynamic security policy, until KP Singh landed the API for building LSM policy using BPF. With this, it is possible (for a privileged process) to write kernel LSM hooks in BPF, allowing for totally custom security policy (and reporting).

execve() deadlock refactoring
There have been a number of long-standing races in the kernel’s process launching code where ptrace could deadlock. Fixing these has been attempted several times over the last many years, but Eric W. Biederman and Ernd Edlinger decided to dive in, and successfully landed the a series of refactorings, splitting up the problematic locking and refactoring their uses to remove the deadlocks. While he was at it, Eric also extended the exec_id counter to 64 bits to avoid the possibility of the counter wrapping and allowing an attacker to send arbitrary signals to processes they normally shouldn’t be able to.

slub freelist obfuscation improvements
After Silvio Cesare observed some weaknesses in the implementation of CONFIG_SLAB_FREELIST_HARDENED‘s freelist pointer content obfuscation, I improved their bit diffusion, which makes attacks require significantly more memory content exposures to defeat the obfuscation. As part of the conversation, Vitaly Nikolenko pointed out that the freelist pointer’s location made it relatively easy to target too (for either disclosures or overwrites), so I moved it away from the edge of the slab, making it harder to reach through small-sized overflows (which usually target the freelist pointer). As it turns out, there were a few assumptions in the kernel about the location of the freelist pointer, which had to also get cleaned up.

RISCV page table dumping
Following v5.6’s generic page table dumping work, Zong Li landed the RISCV page dumping code. This means it’s much easier to examine the kernel’s page table layout when running a debug kernel (built with PTDUMP_DEBUGFS), visible in /sys/kernel/debug/kernel_page_tables.

array index bounds checking
This is a pretty large area of work that touches a lot of overlapping elements (and history) in the Linux kernel. The short version is: C is bad at noticing when it uses an array index beyond the bounds of the declared array, and we need to fix that. For example, don’t do this:

int foo[5];
...
foo[8] = bar;

The long version gets complicated by the evolution of “flexible array” structure members, so we’ll pause for a moment and skim the surface of this topic. While things like CONFIG_FORTIFY_SOURCE try to catch these kinds of cases in the memcpy() and strcpy() family of functions, it doesn’t catch it in open-coded array indexing, as seen in the code above. GCC has a warning (-Warray-bounds) for these cases, but it was disabled by Linus because of all the false positives seen due to “fake” flexible array members. Before flexible arrays were standardized, GNU C supported “zero sized” array members. And before that, C code would use a 1-element array. These were all designed so that some structure could be the “header” in front of some data blob that could be addressable through the last structure member:

/* 1-element array */
struct foo {
    ...
    char contents[1];
};

/* GNU C extension: 0-element array */
struct foo {
    ...
    char contents[0];
};

/* C standard: flexible array */
struct foo {
    ...
    char contents[];
};

instance = kmalloc(sizeof(struct foo) + content_size);

Converting all the zero- and one-element array members to flexible arrays is one of Gustavo A. R. Silva’s goals, and hundreds of these changes started landing. Once fixed, -Warray-bounds can be re-enabled. Much more detail can be found in the kernel’s deprecation docs.

However, that will only catch the “visible at compile time” cases. For runtime checking, the Undefined Behavior Sanitizer has an option for adding runtime array bounds checking for catching things like this where the compiler cannot perform a static analysis of the index values:

int foo[5];
...
for (i = 0; i < some_argument; i++) {
    ...
    foo[i] = bar;
    ...
}

It was, however, not separate (via kernel Kconfig) until Elena Petrova and I split it out into CONFIG_UBSAN_BOUNDS, which is fast enough for production kernel use. With this enabled, it's now possible to instrument the kernel to catch these conditions, which seem to come up with some regularity in Wi-Fi and Bluetooth drivers for some reason. Since UBSAN (and the other Sanitizers) only WARN() by default, system owners need to set panic_on_warn=1 too if they want to defend against attacks targeting these kinds of flaws. Because of this, and to avoid bloating the kernel image with all the warning messages, I introduced CONFIG_UBSAN_TRAP which effectively turns these conditions into a BUG() without needing additional sysctl settings.

Fixing "additive" snprintf() usage
A common idiom in C for building up strings is to use sprintf()'s return value to increment a pointer into a string, and build a string with more sprintf() calls:

/* safe if strlen(foo) + 1 < sizeof(string) */
wrote  = sprintf(string, "Foo: %s\n", foo);
/* overflows if strlen(foo) + strlen(bar) > sizeof(string) */
wrote += sprintf(string + wrote, "Bar: %s\n", bar);
/* writing way beyond the end of "string" now ... */
wrote += sprintf(string + wrote, "Baz: %s\n", baz);

The risk is that if these calls eventually walk off the end of the string buffer, it will start writing into other memory and create some bad situations. Switching these to snprintf() does not, however, make anything safer, since snprintf() returns how much it would have written:

/* safe, assuming available <= sizeof(string), and for this example
 * assume strlen(foo) < sizeof(string) */
wrote  = snprintf(string, available, "Foo: %s\n", foo);
/* if (strlen(bar) > available - wrote), this is still safe since the
 * write into "string" will be truncated, but now "wrote" has been
 * incremented by how much snprintf() *would* have written, so "wrote"
 * is now larger than "available". */
wrote += snprintf(string + wrote, available - wrote, "Bar: %s\n", bar);
/* string + wrote is beyond the end of string, and availabe - wrote wraps
 * around to a giant positive value, making the write effectively 
 * unbounded. */
wrote += snprintf(string + wrote, available - wrote, "Baz: %s\n", baz);

So while the first overflowing call would be safe, the next one would be targeting beyond the end of the array and the size calculation will have wrapped around to a giant limit. Replacing this idiom with scnprintf() solves the issue because it only reports what was actually written. To this end, Takashi Iwai has been landing a bunch scnprintf() fixes.

That's it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.8.

© 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

September 21, 2020 11:32 PM

September 17, 2020

James Bottomley: Creating a Home IPv6 Network

One of the recent experiences of Linux Plumbers Conference convinced me that if you want to be part of a true open source WebRTC based peer to peer audio/video interaction, you need an internet address that’s not behind a NAT. In reality, the protocol still works as long as you can contact a stun server to tell you what your external address is and possibly a turn server to proxy the packets if both endpoints are NATed but all this seeking external servers takes time as those of you who complained about the echo test found. The solution to all this is to connect over IPv6 which has an address space large enough to support every device on the planet having its own address. All modern Linux distributions support IPv6 out of the box so the chances are you’ve actually accidentally used it without ever noticing, which is one of the beauties of IPv6 autoconfiguration (it’s supposed to just work).

However, I recently moved, and so lost my fibre internet connection to cable but cable that did come with an IPv6 address, so this is my story of getting it all to work. If you don’t really care about the protocol basics, you can skip down to the how. This guide is also focussed on a “dual stack” configuration (one that has both IPv6 and IPv4 addresses). Pure IPv6 configurations are possible, but because some parts of the internet are still IPv4 only, they’re not complete unless you set up an IPv4 encapsulating bridge.

The Basics of IPv6

IPv6 has been a mature protocol for a long time now, so I erroneously assumed there’d be a load of good HOWTOs about it. However, after reading 20 different descriptions of how the IPv6 128 bit address space works and not much else, I gave up in despair and read the RFCs instead. I’ll assume you’ve read at least one of these HOWTOS, so I don’t have to go into IPv6 address prefixes, suffixes, interface IDs or subnets so I’ll begin where most of the HOWTOs end.

How does IPv6 Just Work?

In IPv4 there’s a protocol called dynamic host configuration protocol (DHCP) so as long as you can find a DHCP server you can get all the information you need to connect (local address, router, DNS server, time server, etc). However, this service has to be set up by someone and IPv6 is designed to configure a network without it.

The first assumption IPv6 StateLess Address AutoConfiguration (SLAAC) makes is that it’s on a /64 subnet (So every subnet in IPv6 contains 1010 times as many addresses as the entire IPv4 internet). This means that, since most real subnets contain <100 systems, they can simply choose a random address and be very unlikely to clash with the existing systems. In fact, there are three current ways of choosing an address in the /64:

  1. EUI-64 (RFC 4291) based on the MAC address which is basically the MAC with one bit flipped and ff:fe placed in the middle.
  2. Stable Private (RFC 7217) which generate from a hash based on a static key, interface, prefix and a counter (the counter is incremented if there is a clash). These are preferred to the EUI-64 ones which give away any configuration associated with the MAC address (such as what type of network card you have)
  3. Privacy Extension Addresses (RFC 4941) which are very similar to stable private addresses except they change over time using the IPv6 address deprecation mechanism and are for client systems who want to preserve anonymity.

The next problem in Linux is who configures the interface? The Kernel IPv6 stack is actually designed to do it, and will unless told not to, but most of the modern network controllers (like NetworkManager) are control freaks and turn off the kernel’s auto configuration so they can do it themselves. They also default to stable private addressing using a static secret maintained in the filesystem (/var/lib/NetworkManager/secret_key).

The next thing to understand about IPv6 addresses is that they are divided into scopes, the most important being link local (unrouteable) addresses which conventionally always have the prefix fe80::/64. The link local address is configured first using one of the above methods and then used to probe the network.

Multicast and Neighbour Discovery

Unlike IPv4, IPv6 has no broadcast capability so all discovery is done by multicast. Nodes coming up on the network subscribe to particular multicast addresses, via special packets intercepted by the switch, and won’t receive any multicast to which they’re not subscribed. Conventionally, all link local multicast addresses have the prefix ff02::/64 (for other types of multicast address see RFC 4291). All nodes subscribe to the “all nodes” multicast address ff02::1 and also must subscribe to their own solicited node multicast address at ff02::1:ffXX:XXXX where the last 24 bits correspond to the lowest 24 bits of the node’s IPv6 address. This latter is to avoid the disruption that used to occur in IPv4 from ARP broadcasts because now you can target a specific subset of nodes for address resolution.

The IPV6 address resolution protocol is called Neighbour Solicitation (NS), described in RFC 4861 and it’s use with SLAAC described in RFC 4862, and is done by sending a multicast to the neighbor solicitation address of the node you want to discover containing the full IPv6 address you want to know, a node with the matching address replies with its link layer (MAC) address in a Neighbour Advertisement (NA) packet.

Once a node has chosen its link local address, it first sends out a NS packet to its chosen address to see if anyone replies and if no-one does it assumes it is OK to keep it otherwise it follows the collision avoidance protocol associated with its particular form of address. Once it has found a unique address, the node configures this link local address and looks for a router. Note that if an IPv6 network isn’t present, discovery stops here, which is why most network interfaces always show a link local IPv6 address.

Router Discovery

Once the node has its own unique link local address, it uses it to send out Router Solicitation (RS) packets to the “all routers” multicast address ff02::2. Every router on the network responds with a Router Advertisement (RA) packet which describes (among other things) the the router lifetime, the network MTU, a set of one or more prefixes the router is responsible for, the router’s link address and a set of option flags including the M (Managed) and O (Other Configuration) flag and possibly a set of DNS servers.

Each advertised prefix contains the prefix and prefix length, a set of flags including the A (autonomous configuration) and L (link local) and a set of lifetimes. The Link Local prefixes tell you what global prefixes the local network users (there may be more than one) and whether you are allowed to do SLAAC on the global prefix (if the A flag is clear, you must ask the router for an address using DHCPv6). If the router has a non zero lifetime, you may assume it is a default router for the subnet.

Now that the node has discovered one or more routers it may configure its own global address (note that every IPv6 routeable node has at least two addresses: a link local and a global). How it does this depends on the router and prefix flags

Global Address Configuration

The first thing a node needs to know is whether to use SLAAC for the global address or DHCPv6. This is entirely determined by the A flag of any link local prefix in the RA packet. If A is set, then the node may use SLAAC and if A is clear then the node must use DHCPv6 to obtain an address. If A is set and also the M (Managed) flag then the node may use either SLAAC or DHCPv6 (or both) to obtain an address and if the M flag is clear, but the O (Other Config) flag is present then the node must use SLAAC but may use DHCPv6 to obtain other information about the network (usually DNS).

Once the node has a global address in now needs a default route. It forms the default route list from the RA packets that have a non-zero router Lifetime. All of these are configured as default routes to their link local address with the RA specified hop count. Finally, the node may add specific prefix routes from RA packets with zero router LifeTimes but non link local prefixes.

DHCPv6 is a fairly complex configuration protocol (see RFC 8415) but it cannot specify either prefix length (meaning all obtained addresses are configured as /128) or routes (these must be obtained from RA packets). This leads to a subtlety of outbound address selection in that the most specific is always preferred, so if you configure both by SLAAC and DHCPv6, the SLAAC address will be added as /64 and the DHCPv6 address as /128 meaning your outbound IP address will always be the DHCPv6 one (although if an external entity knows your SLAAC address, they will still be able to reach you on it).

The How: Configuring your own Home Router

One of the things you’d think from the above is that IPv6 always auto configures and, while it is true that if you simply plug your laptop into the ethernet port of a cable modem it will just automatically configure, most people have a more complex home setup involving a router, which needs some special coaxing before it will work. That means you need to obtain additional features from your ISP using special DHCPv6 requests.

This section is written from my own point of view: I have a rather complex IPv4 network which has a completely open but bandwidth limited (to untrusted clients) wifi network, and several protected internal networks (one for my lab, one for my phones and one for the household video cameras), so I need at least 4 subnets to give every device in my home an IPv6 address. I also use OpenWRT as my router distribution, so all the IPv6 configuration information is highly specific to it (although it should be noted that things like NetworkManager can also do all of this if you’re prepared to dig in the documentation).

Prefix Delegation

Since DHCPv6 only hands out a /128 address, this isn’t sufficient because it’s the IP address of the router itself. In order to become a router, you must request delegation of part of the IPv6 address space via the Identity Association for Prefix Delection (IA_PD) option of DHCPv6. Once this is done the router IP address will be assumed by the ISP to be the route for all of the delegated prefixes. The subtlety here is that if you want more than one subnet, you have to ask for it specifically (the client must specify the exact prefix length it’s looking for) and since it’s a prefix length, and your default subnet should be /64, if you request a prefix length of 64 you only have one subnet. If you request 63 you have 2 and so on. The problem is how do you know how many subnets the ISP is willing to give you? Unfortunately there’s no way of finding this (I had to do an internet search to discover my ISP, Comcast, was willing to delegate a prefix length of 60, meaning 16 subnets). If searching doesn’t tell you how much your ISP is willing to delegate, you could try starting at 48 and working your way to 64 in increments of 1 to see what the largest delegation you can get away with (There have been reports of ISPs locking you at your first delegated prefix length, so don’t start at 64). The final subtlety is that the prefix you’re delegated may not be the same prefix as the address your router obtained (my current comcast configuration has my router at 2001:558:600a:… but my delegated prefix is 2601:600:8280:66d0:/60). Note you can run odhcp6c manually with the -P option if you have to probe your ISP to find out what size of prefix you can get.

Configuring the Router for Prefix Delegation

In OpenWRT terms, the router WAN DHCP(v6) configuration is controlled by /etc/default/network. You’ll already have a WAN interface (likely called ‘wan’) for DHCPv4, so you simply add an additional ‘wan6’ interface to get an additional IPv6 and become dual stack. In my configuration this looks like

config interface 'wan6'
        option ifname '@wan'
        option proto 'dhcpv6'
        option reqprefix 60

The slight oddity is the ifname: @wan simply tells the config to use the same ifname as the ‘wan’ interface. Naming it this way is essential if your wan is a bridge, but it’s good practice anyway. The other option ‘reqprefix’ tells DHCPv6 to request a /60 prefix delegation.

Handing Out Delegated Prefixes

This turns out to be remarkably simple. Firstly you have to assign a delegated prefix to each of your other interfaces on the router, but you can do this without adding a new OpenWRT interface for each of them. My internal IPv4 network has all static addresses, so you add three directives to each of the interfaces:

config interface 'lan'
        ... interface designation (bridge for me)
        option proto 'static'
        ... ipv4 addresses
        option ip6assign '64'
        option ip6hint '1'
        option ip6ifaceid '::ff'

ip6assign ‘N’ means you are a /N network (so this is always /64 for me) and ip6hint ‘N’ means use N as your subnet id and ip6ifaceid ‘S’ means use S as the IPv6 suffix (This defaults to ::1 so if you’re OK with that, omit this directive). So given I have a 2601:600:8280:66d0::/60 prefix, the global address of this interface will be 2601:600:8280:66d1::ff. Now the acid test, if you got this right, this global address should be pingable from anywhere on the IPv6 internet (if it isn’t, it’s likely a firewall issue, see below).

Advertising as a Router

Simply getting delegated a delegated prefix on a local router interface is insufficient . Now you need to get your router to respond to Router Solicitations on ff02::2 and optionally do DHCPv6. Unfortunately, OpenWRT has two mechanisms for doing this, usually both installed: odhcpd and dnsmasq. What I found was that none of my directives in /etc/config/dhcp would take effect until I disabled odhcpd completely

/etc/init.d/odhcpd stop
/etc/init.d/odhcpd disable

and since I use dnsmasq extensively elsewhere (split DNS for internal/external networks), that suited me fine. I’ll describe firstly what options you need in dnsmasq and secondly how you can achieve this using entries in the OpenWRT /etc/config/dhcp file (I find this useful because it’s always wise to check what OpenWRT has put in the /var/etc/dnsmasq.conf file).

The first dnsmasq option you need is ‘enable-ra’ which is a global parameter instructing dnsmasq to handle router advertisements. The next parameter you need is the per-interface ‘ra-param’ which specifies the global router advertisement parameters and must appear once for every interface you want to advertise on. Finally the ‘dhcp-range’ option allows more detailed configuration of the type of RA flags and optional DHCPv6.

SLAAC or DHCPv6 (or both)

In many ways this is a matter of personal choice. If you allow SLAAC, hosts which want to use privacy extension addresses (like Android phones) can do so, which is a good thing. If you also allow DHCPv6 address selection you will have a list of addresses assigned to hosts and dnsmasq will do DNS resolution for them (although it can do DNS for SLAAC addresses provided it gets told about them). A special tag ‘constructor’ exists for the ‘dhcp-range’ option which tells it to construct the supplied address (for either RA or DHCPv6) from the IPv6 global prefix of the specified interface, which is how you pass out our delegated prefix addresses. The modes for ‘dhcp-range’ are ‘ra-only’ to disallow DHCPv6 entirely, ‘slaac’ to allow DHCPv6 address selection and ‘ra-stateless’ to disallow DHCPv6 address selection but allow other DHCPv6 configuration information.

Based on trial and error (and finally examining the scripting in /etc/init.d/dnsmasq) the OpenWRT options required to achieve the above dnsmasq options are

config dhcp lan
        option interface lan
        option start 100
        option limit 150
        option leasetime 1h
        option dhcpv6 'server'
        option ra_management '1'
        option ra 'server'

with ‘ra_management’ as the key option with ‘0’ meaning SLAAC with DHCPv6 options, ‘1’ meaning SLAAC with full DHCPv6, ‘2’ meaning DHCPv6 only and ‘3’ meaning SLAAC only. Another OpenWRT oddity is that there doesn’t seem to be a way of setting the lease range: it always defaults to either static only or ::1000 to ::ffff.

Firewall Configuration

One of the things that trips people up is the fact that Linux has two completely separate firewalls: one for IPv4 and one for IPv6. If you’ve ever written any custom rules for them, the chances are you did it in the OpenWRT /etc/firewall.user file and you used the iptables command, which means you only added the rules to the IPv4 firewall. To add the same rule for IPv6 you need to duplicate it using the ip6tables command. Another significant problem, if you’re using a connection tracking for port knock detection like I am, is that Linux connection tracking has difficulty with IPv6 multicast, so packets that go out to a multicast but come back as unicast (as most of the discovery protocols do) get the wrong conntrack state. To fix this, I eventually had to have an INPUT rule just accepting all ICMPv6 and DHCPv6 (udp ports 546 [client] and 547 [server]). The other firewall considerations are that now everyone has their own IP address, there’s no need to NAT (OpenWRT can be persuaded to take care of this automatically, but if you’re duplicating the IPv4 rules manually, don’t duplicate the NAT rules). The final one is likely more applicable to me: my wifi interface is designed to be an extension of the local internet and all machines connecting to it are expected to be able to protect themselves since they’ll migrate to such hostile environments as airport wifi, thus I do complete exposure of wifi connected devices to the general internet for all ports, including port probes. For my internal devices, I have a RELATED,ESTABLISHED rule to make sure they’re not probed since they’re not designed to migrate off the internal network.

Now the problems with OpenWRT: since you want NAT on IPv4 but not on IPv6 you have to have two separate wan zones for them: if you try to combine them (as I first did), then OpenWRT will add an IPv6 –ctstate INVALID rule which will prevent Neighbour Discovery from working because of the conntrack issues with IPv6 multicast, so my wan zones are (well, this is a lie because my firewall is now hand crafted, but this is what I checked worked before I put the hand crafted firewall in place):

config zone
        option name 'wan'
        option network 'wan'
        option masq '1'
        ...

config zone
        option name 'wan6'
        option network 'wan6'
        ...

And the routing rules for the lan zone (fully accessible) are

config forwarding
        option src 'lan'
        option dest 'wan'

config forwarding
        option src 'lan'
        option dest 'wan6'

config forwarding
        option src 'wan6'
        option dest 'lan'

Putting it Together: Getting the Clients IPv6 Connected

Now that you have your router configured, everything should just work. If it did, your laptop wifi interface should now have a global IPv6 address

ip -6 address show dev wlan0

If that comes back empty, you need to enable IPv6 on your distribution. If it has only a link local (fe80:: prefix) address, IPv6 is enabled but your router isn’t advertising (suspect firewall issues with discovery packets or dnsmasq misconfiguration). If you see a global address, you’re done. Now you should be able to go to https://testv6.com and secure a 10/10 score.

The final piece of the puzzle is preferring your new IPv6 connection when DNS offers a choice of IPv4 or IPv6 addresses. All modern Linux clients should prefer IPv6 when available if connected to a dual stack network, so try … if you ping, say, www.google.com and see an IPv6 address you’re done. If not, you need to get into the murky world of IPv6 address labelling (RFC 6724) and gai.conf.

Conclusion

Adding IPv6 to and existing IPv4 setup is currently not a simple plug in and go operation. However, provided you understand a handful of differences between the two protocols, it’s not an insurmountable problem either. I have also glossed over many of the problems you might encounter with your ISP. Some people have reported that their ISPs only hand out one IPv6 address with no prefix delegation, in which case I think finding a new ISP would be wisest. Others report that the ISP will only delegate one /64 prefix so your choice here is either to only run one subnet (likely sufficient for a lot of home networks), or subnet at greater than /64 and forbid SLAAC, which is definitely not a recommended configuration. However, provided your ISP is reasonable, this blog post should at least help get you started.

September 17, 2020 10:23 PM

September 07, 2020

Paul E. Mc Kenney: The Old Man and His Smartphone, 2020 “See You in September” Episode

The continued COVID-19 situation continues to render my smartphone's location services less than useful, though a number of applications will still beg me to enable it, preferring to know my present location rather than consider my past habits. One in particular does have a “Don't ask me again” link, but it asks each time anyway. Given that I have only ever used one of that business's locations, you would think that it would not be all that hard to figure out which location I was going to be using next. But perhaps I am the only one who habitually disables location services.

Using the smartphone for breakfast-time Internet browsing has avoided almost all flat-battery incidents. One recent exception occurred while preparing for a hike. But I still have my old digital camera, so I plugged the smartphone into its charger and took my digital camera instead. I have previously commented on the excellent quality of my smartphone's cameras, but there is nothing quite like going back to the old digital camera (never mind my long-departed 35mm SLR) to drive that lesson firmly home.

I was recently asked to text a photo, and saw no obvious way to do this. There was some urgency, so I asked for an email address and emailed the photo instead. This did get the job done, but let's just say that it appears that asking for an email address is no longer a sign of youth, vigor, or with-it-ness. Thus chastened, I experimented in a calmer time, learning that the trick is to touch the greater-than icon to the left of the text-message-entry bar, which produces an option to select from your gallery and also to include a newly taken picture.

The appearance of Comet Neowise showcased my smartphone's ability to orient and to display the relevant star charts. Nevertheless, my wife expressed confidence in this approach only after seeing the large number of cars parked in the same area that my smartphone and I had selected. I hadn't intended to take a photo of the comet because the professionals a much better job, especially those who are willing to travel far away from city lights and low altitudes. But here was my smartphone and there was the comet, so why not? The resulting photo was quite unsatisfactory, with so much pixelated noise that the comet was just barely discernible.

It was some days later that I found the smartphone's night mode. This is quite impressive. In this mode, the smartphone can form low-light images almost as well as my eyes can, which is saying something. It is also extremely good with point sources of light.

One recent trend in clothing is pockets for smartphones. This trend prompted my stepfather to suggest that the smartphone is the pocket watch of the 21st century. This might well be, but I still wear a wristwatch.

My refusal to use my smartphone's location services does not mean that location services cannot get me in trouble. Far from it! One memorable incident took place on BPA Road in Forest Park. A group of hikers asked me to verify their smartphone's chosen route, which would have taken them past the end of Firelane 13 and eventually down a small cliff. I advised them to choose a different route.

But I had seen the little line that their smartphone had drawn, and a week or so later found myself unable to resist checking it out. Sure enough, when I peered through the shrubbery marking the end of Firelane 13, I saw an unassuming but very distinct trail. Of course I followed it. Isn't that what trails are for? Besides, maybe someone had found a way around the cliff I knew to be at the bottom of that route.

To make a long story short, no one had found a way around that cliff. Instead, the trail went straight down it. For all but about eight feet of the trail, it was possible to work my way down via convenient handholds in the form of ferns, bushes, and trees. My plan for that eight feet was to let gravity do the work, and to regain control through use of a sapling at the bottom of that stretch of the so-called trail. Fortunately for me, that sapling was looking out for this old man, but unfortunately this looking out took the form of ensuring that I had a subcutaneous hold on its bark. Thankfully, the remainder of the traverse down the cliff was reasonably uneventful.

Important safety tip: If you absolutely must use that trail, wear a pair of leather work gloves!

September 07, 2020 04:02 AM

September 05, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: Enlisting the Aid of a Debugger

Using Debuggers With rcutorture



So rcutorture found a bug, you have figured out how to reproduce it, git bisect was unhelpful (perhaps because the bug has been around forever), and the bug happens to be one of those rare RCU bugs for which a debugger might be helpful. What can you do?

What I have traditionally done is to get partway through figuring out how to make gdb work with rcutorture, then suddenly realize what the bug's root cause must be. At this point, I of course abandon gdb in favor of fixing the bug. As a result, although I have tried to apply gdb to the Linux kernel many times over the past 20 years, I never have actually succeeded in doing so. Now, this is not to say that gdb is useless to Linux-kernel hackers. Far from it! For one thing, the act of trying to use gdb has inspired me to perceive the root cause of a great many bugs, which means that it has served as a great productivity aid. For another thing, I frequently extract Linux-kernel code into a usermode scaffolding and use gdb in that context. And finally, there really are a number of Linux-kernel hackers who make regular use of gdb.

One of these hackers is Omar Sandoval, who happened to mention that he had used gdb to track down a Linux-kernel bug. And without first extracting the code to userspace. I figured that it was time for this old dog to learn a new trick, so I asked Omar how he made this happen.

Omar pointed out that because rcutorture runs in guest OSes, gdb can take advantage of the debugging support provided by qemu. To make this work, you build a kernel with CONFIG_DEBUG_INFO=y (which supplies gdb with additional symbols), provide the nokaslr kernel boot parameter (which prevents kernel address-space randomization from invalidating these symbols), and supply qemu with the -s -S command-line arguments (which causes it to wait for gdb to connect instead of immediately booting the kernel). You then specify the vmlinux file's pathname as the sole command-line argument to gdb. Once you see the (gdb) prompt, the target remote :1234 command will connect to qemu and then the continue command will boot the kernel.

I tried this, and it worked like a charm.

Alternatively, you can now use the shiny new rcutorture --gdb command-line argument in the -rcu tree, which will automatically set up the kernel and qemu, and will print out the required gdb commands, including the path to the newly built vmlinux file.

And yes, I do owe Omar a --drgn command-line argument, which I will supply once he lets me know how to connect drgn to qemu. :-)

In the meantime, the following sections cover a couple of uses I have made of --gdb, mostly to get practice with this approach to Linux-kernel debugging.

Case study 1: locktorture

For example, let's use gdb to investigate a long-standing locktorture hang when running scenario LOCK05:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock \
    --duration 3 --configs LOCK05 --gdb

This will print out the following once the kernel is built and qemu has started:

Waiting for you to attach a debug session, for example:
    gdb gdb /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2020.08.27-14.51.45/LOCK05/vmlinux
After symbols load and the "(gdb)" prompt appears:
    target remote :1234
    continue

Once you have started gdb and entered the two suggested commands, the kernel will start. You can track its console output by locating its console.log file as described in an earlier post. Or you can use the ps command to dump the qemu command line, looking for the -serial file: command, which is following by the pathname of the file receiving the console output.

Once the kernel is sufficiently hung, that is, more than 15 seconds elapses after the last statistics output line (Writes: Total: 27668769 Max/Min: 27403330/34661 Fail: 0), you can hit control-C at gdb. The usual info threads command will show the CPUs' states, here with the 64-bit hexadecimal addresses abbreviated:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 1 (CPU#0 [running]) stutter_wait (title=0xf... "lock_torture_writer")
    at kernel/torture.c:615
  2    Thread 2 (CPU#1 [running]) 0xf... in stutter_wait (
    title=0xf... "lock_torture_writer") at kernel/torture.c:615
  3    Thread 3 (CPU#2 [halted ]) default_idle () at arch/x86/kernel/process.c:689
  4    Thread 4 (CPU#3 [halted ]) default_idle () at arch/x86/kernel/process.c:689

It is odd that CPUs 0 and 1 are in stutter_wait(), spinning on the global variable stutter_pause_test. Even more odd is that the value of this variable is not zero, as it should be at the end of the test, but rather the value two. After all, all paths out of torture_stutter() should zero this variable.

But maybe torture_stutter() is still stuck in the loop prior to the zeroing of stutter_pause_test. A quick look at torture_stutter_init shows us that the task_struct pointer to the task running torture_stutter lives in stutter_task, which is non-NULL, meaning that this task still lives. One might hope to use sched_show_task(), but this sadly fails with Could not fetch register "fs_base"; remote failure reply 'E14'.

The value of stutter_task.state is zero, which indicates that this task is running. But on what CPU? CPUs 0 and 1 are both spinning in stutter_wait, and the other two CPUs are in the idle loop. So let's look at stutter_task.on_cpu, which is zero, as in not on a CPU. In addition, stutter_task.cpu has the value one, and CPU 1 is definitely running some other task.

It would be good to just be able to print the stack of the blocked task, but it is also worth just rerunning this test, but this time with the locktorture.stutter module parameter set to zero. This test completed successfully, in particular, with no hangs. Given that no other locktorture or rcutorture scenario suffers from similar hangs, perhaps the problem is in rt_mutex_lock() itself. To check this, let's restart the test, but with the default value of the locktorture.stutter module parameter. After letting it hang and interrupting it with control-C, even though it still feels strange to control-C a kernel:

(gdb)  print torture_rtmutex
$1 = {wait_lock = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000', pending = 0 '\000'}, {
          locked_pending = 0, tail = 0}}}}, waiters = {rb_root = {rb_node = 0xffffc9000025be50}, 
    rb_leftmost = 0xffffc90000263e50}, owner = 0x1 <fixed_percpu_data+1>}

The owner = 0x1 looks quite strange for a task_struct pointer, but the block comment preceding rt_mutex_set_owner() says that this value is legitimate, and represents one of two transitional states. So maybe it is time for CONFIG_DEBUG_RT_MUTEXES=y, but enabling this Kconfig option produces little additional enlightenment.

However, the torture_rtmutex.waiters field indicates that there really is something waiting on the lock. Of course, it might be that we just happened to catch the lock at this point in time. To check on this, let's add a variable to capture the time of the last lock release. I empirically determined that it is necessary to use WRITE_ONCE() to update this variable in order to prevent the compiler from optimizing it out of existence. Learn from my mistakes!

With the addition of WRITE_ONCE(), the next run showed that the last lock operation was more than three minutes in the past and that the transitional lock state still persisted, which provides strong evidence that this is the result of a race condition in the locking primitive itself. Except that a quick scan of the code didn't immediately identify a race condition. Furthermore, the failure happens even with CONFIG_DEBUG_RT_MUTEXES=y, which disables the lockless fastpaths (or the obvious lockless fastpaths, anyway).

Perhaps this is instead a lost wakeup? This would be fortuitous given that there are rare lost-IPI issues, and having this reproduce so easily on my laptop would be extremely convenient. And adding a bit of debug code to mark_wakeup_next_waiter() and lock_torture_writer() show that there is a task that was awakened, but that never exited from rt_mutex_lock(). And this task is runnable, that is, its ->state value is zero. But it is clearly not running very far! And further instrumentation demonstrates that control is not reaching the __smp_call_single_queue() call from __ttwu_queue_wakelist(). The chase is on!

Except that the problem ended up being in stutter_wait(). As the name suggests, this function controls stuttering, that is, periodically switching between full load and zero load. Such stuttering can expose bugs that a pure full-load stress test would miss.

The stutter_wait() uses adaptive waiting, so that schedule_timeout_interruptible() is used early in each no-load interval, but a tight loop containing cond_resched() is used near the end of the interval. The point of this is to more tightly synchronize the transition from no-load to full load. But the LOCK05 scenario's kernel is built with CONFIG_PREEMPT=y, which causes cond_resched() to be a no-op. In addition, the kthreads doing the write locking lower their priority using set_user_nice(current, MAX_NICE), which appears to be preventing preemption. (We can argue that even MAX_NICE should not indefinitely prevent preemption, but the multi-minute waits that have been observed are for all intents and purposes indefinite.)

The fix (or workaround, as the case might be) is for stutter_wait() to block periodically, thus allowing other tasks to run.

Case study 2: RCU Tasks Trace

I designed RCU Tasks Trace for the same grace-period latency that I had designed RCU Tasks for, namely roughly one second. Unfortunately, this proved to be about 40x too slow, so adjustments were called for.

After those reporting the issue kindly verified for me that this was not a case of too-long readers, I used --gdb to check statistics and state. I used rcuscale, which is a member of the rcutorture family designed to measure performance and scalability of the various RCU flavors' grace periods:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus \
    --configs TRACE01 --bootargs "rcuscale.nreaders=0 rcuscale.nwriters=10" \
    --trust-make --gdb

Once the (gdb) prompt appears, we connect to qemu, set a break point, and then continue execution:

(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in exception_stacks ()
(gdb) b rcu_scale_cleanup
Breakpoint 1 at 0xffffffff810d27a0: file kernel/rcu/rcuscale.c, line 505.
(gdb) cont
Continuing.
Remote connection closed
(gdb)

Unfortunately, as shown above, this gets us Remote connection closed instead of a breakpoint. Apparently, the Linux kernel does not take kindly to debug exception instructions being inserted into its code. Fortunately, gdb also supplies a hardware breakpoint command:

(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in exception_stacks ()
(gdb) hbreak rcu_scale_cleanup
Hardware assisted breakpoint 1 at 0xffffffff810d27a0: file kernel/rcu/rcuscale.c, line 505.
(gdb) cont
Continuing.
[Switching to Thread 12]

Thread 12 hit Breakpoint 1, rcu_scale_cleanup () at kernel/rcu/rcuscale.c:505
505     {

This works much better, and the various data structures may now be inspected to check the validity of various optimization approaches. Of course, as the optimization effort continued, hand-typing gdb commands became onerous, and was therefore replaced with crude but automatic accumulation and display of relevant statistics.

Of course, Murphy being who he is, the eventual grace-period speedup also caused a few heretofore latent race conditions to be triggered by a few tens of hours of rctorture. These race conditions resulted in rcu_torture_writer() stalls, along with the occasional full-fledged RCU-Tasks-Trace CPU stall warning.

Now, rcutorture does dump out RCU grace-period kthread state when these events occur, but in the case of the rcu_torture_writer() stalls, this state is for vanilla RCU rather than the flavor of RCU under test. Which is an rcutorture bug that will be fixed. But in the meantime, gdb provides a quick workaround by setting a hardware breakpoint on the ftrace_dump() function, which is called when either of these sorts of stalls occur. When the breakpoint triggers, it is easy to manually dump the data pertaining to the grace-period kthread of your choice.

For those who are curious, the race turned out to be an IPI arriving between a pair of stores in rcu_read_unlock_trace() that could leave the corresponding task forever blocking the current RCU Tasks Trace grace period. The solution, as with vanilla RCU in the v3.0 timeframe, is to set the read-side nesting value to a negative number while clearing the .need_qs field indicating that a quiescent state is required. The buggy code is as follows:

if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
    // BUG: IPI here sets .need_qs after check!!!
    WRITE_ONCE(t->trc_reader_nesting, nesting);
    return;  // We assume shallow reader nesting.
}

Again, the fix is to set the nesting count to a large negative number, which allows the IPI handler to detect this race and refrain from updating the .need_qs field when the ->trc_reader_nesting field is negative, thus avoiding the grace-period hang:

WRITE_ONCE(t->trc_reader_nesting, INT_MIN); // FIX
if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
    WRITE_ONCE(t->trc_reader_nesting, nesting);
    return;  // We assume shallow reader nesting.
}

This experience of course suggests testing with grace period latencies tuned much more aggressively than they are in production, with an eye to finding additional low-probability race conditions.

Case study 3: x86 IPIs

Tracing the x86 IPI code path can be challenging because function pointers are heavily used. Unfortunately, some of these function pointers are initialized at runtime, so simply running gdb on the vmlinux binary does not suffice. However, we can again set a breakpoint somewhere in the run and check these pointers after initialization is complete:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 5 --gdb --configs "NOPREEMPT" --bootargs "scftorture.stat_interval=15 scftorture.verbose=1"

We can then set a hardware-assisted breakpoint as shown above at any convenient runtime function.

Once this breakpoint is encountered:

(gdb) print smp_ops
$2 = {smp_prepare_boot_cpu = 0xffffffff82a13833 , 
  smp_prepare_cpus = 0xffffffff82a135f9 , 
  smp_cpus_done = 0xffffffff82a13897 , 
  stop_other_cpus = 0xffffffff81042c40 , 
  crash_stop_other_cpus = 0xffffffff8104d360 , 
  smp_send_reschedule = 0xffffffff81047220 , 
  cpu_up = 0xffffffff81044140 , 
  cpu_disable = 0xffffffff81044aa0 , 
  cpu_die = 0xffffffff81044b20 , 
  play_dead = 0xffffffff81044b80 , 
  send_call_func_ipi = 0xffffffff81047280 , 
  send_call_func_single_ipi = 0xffffffff81047260 }

This shows that smp_ops.send_call_func_single_ipi is native_send_call_func_single_ipi(), which helps to demystify arch_send_call_function_single_ipi(). Except that this native_send_call_func_single_ipi() function is just a wrapper around apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR). So:

(gdb) print *apic
$4 = {eoi_write = 0xffffffff8104b8c0 , 
  native_eoi_write = 0x0 , write = 0xffffffff8104b8c0 , 
  read = 0xffffffff8104b8d0 , 
  wait_icr_idle = 0xffffffff81046440 , 
  safe_wait_icr_idle = 0xffffffff81046460 , 
  send_IPI = 0xffffffff810473c0 , 
  send_IPI_mask = 0xffffffff810473f0 , 
  send_IPI_mask_allbutself = 0xffffffff81047450 , 
  send_IPI_allbutself = 0xffffffff81047500 , 
  send_IPI_all = 0xffffffff81047510 , 
  send_IPI_self = 0xffffffff81047520 , dest_logical = 0, disable_esr = 0, 
  irq_delivery_mode = 0, irq_dest_mode = 0, 
  calc_dest_apicid = 0xffffffff81046f90 , 
  icr_read = 0xffffffff810464f0 , 
  icr_write = 0xffffffff810464b0 , 
  probe = 0xffffffff8104bb00 , 
  acpi_madt_oem_check = 0xffffffff8104ba80 , 
  apic_id_valid = 0xffffffff81047010 , 
  apic_id_registered = 0xffffffff8104b9c0 , 
  check_apicid_used = 0x0 , 
  init_apic_ldr = 0xffffffff8104b9a0 , 
  ioapic_phys_id_map = 0x0 , setup_apic_routing = 0x0 ,
  cpu_present_to_apicid = 0xffffffff81046f50 ,
  apicid_to_cpu_present = 0x0 , 
  check_phys_apicid_present = 0xffffffff81046ff0 , 
  phys_pkg_id = 0xffffffff8104b980 , 
  get_apic_id = 0xffffffff8104b960 , 
  set_apic_id = 0xffffffff8104b970 , wakeup_secondary_cpu = 0x0 , 
  inquire_remote_apic = 0xffffffff8104b9b0 , 
  name = 0xffffffff821f0802 "physical flat"}

Thus, in this configuration the result is default_send_IPI_single_phys(cpu, CALL_FUNCTION_SINGLE_VECTOR). And this function invokes __default_send_IPI_dest_field() with interrupts disabled, which in turn, after some setup work, writes a command word that includes the desired IPI vector to location 0x300 offset by the APIC_BASE.

To be continued...

September 05, 2020 12:12 AM

September 03, 2020

James Bottomley: Lessons from the GNOME Patent Troll Incident

First, for all the lawyers who are eager to see the Settlement Agreement, here it is. The reason I can do this is that I’ve released software under an OSI approved licence, so I’m covered by the Releases and thus entitled to a copy of the agreement under section 10, but I’m not a party to any of the Covenants so I’m not forbidden from disclosing it.

Analysis of the attack

The Rothschild Modus Operandi is to obtain a fairly bogus patent (in this case, patent 9,936,086), form a limited liability corporation (LLC) that only holds the one patent and then sue a load of companies with vaguely related businesses for infringement. A key element of the attack is to offer a settlement licensing the patent for a sum less than it would cost even to mount an initial defence (usually around US$50k), which is how the Troll makes money: since the cost to file is fairly low, as long as there’s no court appearance, the amount gained is close to US$50k if the target accepts the settlement offer and, since most targets know how much any defence of the patent would cost, they do.

One of the problems for the target is that once the patent is issued by the USPTO, the court must presume it is valid, so any defence that impugns the validity of the patent can’t be decided at summary judgment. In the GNOME case, the sued project, shotwell, predated the filing of the patent by several years, so it should be obvious that even if shotwell did infringe the patent, it would have been prior art which should have prevented the issuing of the patent in the first place. Unfortunately such an obvious problem can’t be used to get the case tossed on summary judgement because it impugns the validity of the patent. Put simply, once the USPTO issues a patent it’s pretty much impossible to defend against accusations of infringement without an expensive trial which makes the settlement for small sums look very tempting.

If the target puts up any sort of fight, Rothschild, knowing the lack of merits to the case, will usually reduce the amount offered for settlement or, in extreme cases, simply drop the lawsuit. The last line of defence is the LLC. If the target finds some way to win damages (as ADS did in 2017) , the only thing on the hook is the LLC with the limited liability shielding Rothschild personally.

How it Played out Against GNOME

This description is somewhat brief, for a more in-depth description see the Medium article by Amanda Brock and Matt Berkowitz.

Rothschild performed the initial attack under the LLC RPI (Rothschild Patent Imaging). GNOME was fortunate enough to receive an offer of Pro Bono representation from Shearman and Sterling and immediately launched a defence fund (expecting that the cost of at least getting into court would be around US$200k, even with pro bono representation). One of its first actions, besides defending the claim was to launch a counterclaim against RPI alleging exceptional practices in bringing the claim. This serves two purposes: firstly, RPI can’t now simply decide to drop the lawsuit, because the counterclaim survives and secondly, by alleging potential misconduct it seeks to pierce the LLC liability shield. GNOME also decided to try to obtain as much as it could for the whole of open source in the settlement.

As it became clear to Rothschild that GNOME wouldn’t just pay up and they would create a potential liability problem in court, the offers of settlement came thick and fast culminating in an offer of a free licence and each side would pay their own costs. However GNOME persisted with the counter claim and insisted they could settle for nothing less than the elimination of the Rothschild patent threat from all of open source. The ultimate agreement reached, as you can read, does just that: gives a perpetual covenant not to sue any project under an OSI approved open source licence for any patent naming Leigh Rothschild as the inventor (i.e. the settlement terms go far beyond the initial patent claim and effectively free all of open source from any future litigation by Rothschild).

Analysis of the Agreement

Although the agreement achieves its aim, to rid all of Open Source of the Rothschild menace, it also contains several clauses which are suboptimal, but which had to be included to get a speedy resolution. In particular, Clause 10 forbids the GNOME foundation or its affiliates from publishing the agreement, which has caused much angst in open source circles about how watertight the agreement actually was. Secondly Clause 11 prohibits GNOME or its affiliates from pursuing any further invalidity challenges to any Rothschild patents leaving Rothschild free to pursue any non open source targets.

Fortunately the effect of clause 10 is now mitigated by me publishing the agreement and the effect of clause 11 by the fact that the Open Invention Network is now pursuing IPR invalidity actions against the Rothschild patents.

Lessons for the Future

The big lesson is that Troll based attacks are a growing threat to the Open Source movement. Even though the Rothschild source may have been neutralized, others may be tempted to follow his MO, so all open source projects have to be prepared for a troll attack.

The first lesson should necessarily be that if you’re in receipt of a Troll attack, tell everyone. As an open source organization you’re not going to be able to settle and you won’t get either pro bono representation or the funds to fight the action unless people know about it.

The second lesson is that the community will rally, especially with financial aid, if you put out a call for help (and remember, you may be looking at legal bills in the six figure range).

The third lesson is always file a counter claim to give you significant leverage over the Troll in settlement negotiations.

And the fourth lesson is always refuse to settle for nothing less than neutralization of the threat to the entirety of open source.

Conclusion

While the lessons above should work if another Rothschild like Troll comes along, it’s by no means guaranteed and the fact that Open Source project don’t have the funding to defend themselves (even if they could raise it from the community) makes them look vulnerable. One thing the entire community could do to mitigate this problem is set up a community defence fund. We did this once before 16 years ago when SCO was threatening to sue Linux users and we could do it again. Knowing there was a deep pot to draw on would certainly make any Rothschild like Troll think twice about the vulnerability of an Open Source project, and may even deter the usual NPE type troll with more resources and better crafted patents.

Finally, it should be noted that this episode demonstrates how broken the patent system still is. The key element Rothschild like trolls require is the presumption of validity of a granted patent. In theory, in the light of the Alice decision, the USPTO should never have granted the patent but it did and once that happened the troll targets have no option than either to pay up the smaller sum requested or expend a larger sum on fighting in court. Perhaps if the USPTO can’t stop the issuing of bogus patents it’s time to remove the presumption of their validity in court … or at least provide some sort of prima facia invalidity test to apply at summary judgment (like the project is older than the patent, perhaps).

September 03, 2020 04:53 PM

September 02, 2020

Kees Cook: security things in Linux v5.6

Previously: v5.5.

Linux v5.6 was released back in March. Here’s my quick summary of various features that caught my attention:

WireGuard
The widely used WireGuard VPN has been out-of-tree for a very long time. After 3 1/2 years since its initial upstream RFC, Ard Biesheuvel and Jason Donenfeld finished the work getting all the crypto prerequisites sorted out for the v5.5 kernel. For this release, Jason has gotten WireGuard itself landed. It was a twisty road, and I’m grateful to everyone involved for sticking it out and navigating the compromises and alternative solutions.

openat2() syscall and RESOLVE_* flags
Aleksa Sarai has added a number of important path resolution “scoping” options to the kernel’s open() handling, covering things like not walking above a specific point in a path hierarchy (RESOLVE_BENEATH), disabling the resolution of various “magic links” (RESOLVE_NO_MAGICLINKS) in procfs (e.g. /proc/$pid/exe) and other pseudo-filesystems, and treating a given lookup as happening relative to a different root directory (as if it were in a chroot, RESOLVE_IN_ROOT). As part of this, it became clear that there wasn’t a way to correctly extend the existing openat() syscall, so he added openat2() (which is a good example of the efforts being made to codify “Extensible Syscall” arguments). The RESOLVE_* set of flags also cover prior behaviors like RESOLVE_NO_XDEV and RESOLVE_NO_SYMLINKS.

pidfd_getfd() syscall
In the continuing growth of the much-needed pidfd APIs, Sargun Dhillon has added the pidfd_getfd() syscall which is a way to gain access to file descriptors of a process in a race-less way (or when /proc is not mounted). Before, it wasn’t always possible make sure that opening file descriptors via /proc/$pid/fd/$N was actually going to be associated with the correct PID. Much more detail about this has been written up at LWN.

openat() via io_uring
With my “attack surface reduction” hat on, I remain personally suspicious of the io_uring() family of APIs, but I can’t deny their utility for certain kinds of workloads. Being able to pipeline reads and writes without the overhead of actually making syscalls is pretty great for performance. Jens Axboe has added the IORING_OP_OPENAT command so that existing io_urings can open files to be added on the fly to the mapping of available read/write targets of a given io_uring. While LSMs are still happily able to intercept these actions, I remain wary of the growing “syscall multiplexer” that io_uring is becoming. I am, of course, glad to see that it has a comprehensive (if “out of tree”) test suite as part of liburing.

removal of blocking random pool
After making algorithmic changes to obviate separate entropy pools for random numbers, Andy Lutomirski removed the blocking random pool. This simplifies the kernel pRNG code significantly without compromising the userspace interfaces designed to fetch “cryptographically secure” random numbers. To quote Andy, “This series should not break any existing programs. /dev/urandom is unchanged. /dev/random will still block just after booting, but it will block less than it used to.” See LWN for more details on the history and discussion of the series.

arm64 support for on-chip RNG
Mark Brown added support for the future ARMv8.5’s RNG (SYS_RNDR_EL0), which is, from the kernel’s perspective, similar to x86’s RDRAND instruction. This will provide a bootloader-independent way to add entropy to the kernel’s pRNG for early boot randomness (e.g. stack canary values, memory ASLR offsets, etc). Until folks are running on ARMv8.5 systems, they can continue to depend on the bootloader for randomness (via the UEFI RNG interface) on arm64.

arm64 E0PD
Mark Brown added support for the future ARMv8.5’s E0PD feature (TCR_E0PD1), which causes all memory accesses from userspace into kernel space to fault in constant time. This is an attempt to remove any possible timing side-channel signals when probing kernel memory layout from userspace, as an alternative way to protect against Meltdown-style attacks. The expectation is that E0PD would be used instead of the more expensive Kernel Page Table Isolation (KPTI) features on arm64.

powerpc32 VMAP_STACK
Christophe Leroy added VMAP_STACK support to powerpc32, joining x86, arm64, and s390. This helps protect against the various classes of attacks that depend on exhausting the kernel stack in order to collide with neighboring kernel stacks. (Another common target, the sensitive thread_info, had already been moved away from the bottom of the stack by Christophe Leroy in Linux v5.1.)

generic Page Table dumping
Related to RISCV’s work to add page table dumping (via /sys/fs/debug/kernel_page_tables), Steven Price extracted the existing implementations from multiple architectures and created a common page table dumping framework (and then refactored all the other architectures to use it). I’m delighted to have this because I still remember when not having a working page table dumper for ARM delayed me for a while when trying to implement upstream kernel memory protections there. Anything that makes it easier for architectures to get their kernel memory protection working correctly makes me happy.

That’s in for now; let me know if there’s anything you think I missed. Next up: Linux v5.7.

© 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

September 02, 2020 11:22 PM

August 21, 2020

Linux Plumbers Conference: Watch the LPC 2020 Plenary Session

Welcome to LPC 2020! This year we have a record number of attendees,
around 950. We hope you’ll find the conference as engaging and
productive as the ones we had in person for the past 12 years.

Please watch this LPC 2020 welcome message from our Committee Chair, Laura
Abbott, in lieu of our usual plenary session, which also contains useful
information about how this year’s conference will take place.

Enjoy!!

August 21, 2020 09:56 PM

Linux Plumbers Conference: LPC 2020 Is Sold Out

LPC 2020 is sold out. No more tickets are available. We have reached the maximum capacity for our server infrastructure.

Please be considerate, there is no need to contact us asking for tickets, as we are very busy finalizing all the details of the virtual conference.

If you do not have a ticket, you will be able to watch live starting Monday!
Please follow the links here.

August 21, 2020 04:33 PM

August 20, 2020

Pete Zaitcev: Memoir

I fancied writing a memoir, put 11 short posts or chapters at Meenuvia.

August 20, 2020 10:43 PM

Dave Airlie (blogspot): Vallium: a *software* swrast vulkan layer FAQ

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.

tl;dr DO NOT USE VALLIUM OVER A GALLIUM HW DRIVER,

What is vallium?

The vallium layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
 
In order to bridge the gap, the vallium layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?

Yes.

Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the vallium layer.

The vallium execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.


August 20, 2020 08:08 PM

Linux Plumbers Conference: How to Join Virtual LPC 2020

Only 4 days to the beginning of LPC 2020!

A reminder about how to attend our virtual edition of the Linux Plumbers Conference.

If you are registered, you can participate by joining the Meeting Rooms on our Big Blue Button instance, starting Monday August 24th. You will find a front end showing the schedule for the current day with all the active sessions you can join. If you are having issues, please consult the LPC 2020 Participant Guide.

If you are not registered, you can still watch LPC live streams on YouTube. For how to do this, please refer to this page on our website.

August 20, 2020 06:33 PM

August 19, 2020

Linux Plumbers Conference: LPC 2020 Schedule Finalized, CfP closed

We are very pleased to announce that our final schedule is public!

Please take a look at all the great technical content at this year virtual LPC.
You can view the schedule by main blocks , or by track, or as a complete detailed view.

At this time we are closing the CfPs for all tracks. We have still room for a limited number of Birds of a Feather sessions. If you want to propose one, even during the conference, and the necessary participants are all registered, please send an email to our lpc-contact@lists.linuxplumbersconf.org mailing list.

A heartfelt Thank You to all our Speakers and Microconference Leaders, you all have done an incredible job in making this year’s conference successful, in spite of all the challenges that a remote and distributed conference entails.

See you virtually next week!

August 19, 2020 11:12 PM

August 17, 2020

Linux Plumbers Conference: Conference Is Sold Out, Watch Live Instead

Hello there, thank you all for the very strong interest in LPC 2020!

We have sold out the last ticket today. We have a lot of attendees and speakers this year for our virtual conference! Almost 1000 registrations!

Do not despair though, because there will be the opportunity to watch the conference live streaming on YouTube. We are still sorting out our channels, but keep an eye on our blog and social media in the next couple of days, where we’ll announce how to watch live.

Thanks, and see you next week!

August 17, 2020 09:38 PM

Linux Plumbers Conference: LPC 2020 T-Shirts and Other Items Are Available

We have received several requests for T-shirts this year.

We have always produced T-shirts for attendees since the early days of LPC, and we don’t want to miss the opportunity to offer them this year too.

Since we are all remote, we have set up a LPC Gift Shop where you can order T-shirts in your favorite colors and sizes. We also have made the designs available if you want to print your own.

Enjoy!

August 17, 2020 07:29 PM

August 14, 2020

Linux Plumbers Conference: Final passes for sale for Linux Plumbers

We hit our registration cap again and have added a few more passes. The final date for purchasing passes is August 19th at 11:59pm PST. If the passes sell out before then we will not be adding more. Thank you all once again for your enthusiasm and we look forward to seeing you August 24-28!

August 14, 2020 04:21 PM

August 13, 2020

Michael Kerrisk (manpages): man-pages-5.08 is released

I've released man-pages-5.08. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from more than 30 contributors. The release includes more than 190 commits that change around 340 pages.

The most notable of the changes in man-pages-5.08 are the following:

August 13, 2020 11:57 AM

Daniel Vetter: Lockdep False Positives, some stories about

Lockdep is giving false positives are the new the compiler is broken.

— David Airlie (@DaveAirlie) August 8, 2020

Recently we’ve looked a bit at lockdep annotations in the GPU subsystems, and I figured it’s a good opportunity to explain how this all works, and what the tradeoffs are. Creating working locking hierarchies for the kernel isn’t easy, making sure the kernel’s locking validator lockdep is happy and reviewers don’t have their brains explode even more so.

First things first, and the fundamental issue:

Lockdep is about trading false positives against better testing.

The only way to avoid false positives for deadlocks is to only report a deadlock when the kernel actually deadlocked. Which is useless, since the entire point of lockdep is to catch potential deadlock issues before they actually happen. Hence false postives are not avoidable, at least not in theory, to be able to report potential issues before they hang the machine. Read on for what to do in practice.

We need to understand how exactly lockdep trades false positives to better discovery locking inconsistencies. Lockdep makes a few assumptions about how real code does locking in practice:

Invariance of locking rules over time

First assumption baked into lockdep is that the locking rules for a given lock do not change over the lifetime of the lock’s existence. This already throws out a large chunk of perfectly correct locking designs, since state transitions can control how an object is accessed, and therefore how the lock is used. Examples include different rules for creation and destruction, or whether an object is on a specific list (e.g. only a gpu buffer object that’s in the lru can be evicted). It’s not possible to proof automatically that certain code flat out wont ever run together with some other code on the same structure, at least not in generality. Hence this is pretty much a required assumption to make lockdep useful - if every new lock() call could follow new rules there’s nothing to check. Besides realizing that an actual deadlock indeed occured and all is lost already.

And of course getting such state transitions correct, with the guarantee that all the old code will no longer run, is tricky to get right, and very hard on reviewers. It’s a good thing lockdep has problems with such code too.

Common locking rules for the same objects

Second assumption is that all locks initialized by the same code are following the same locking rules. This is achieved by making all lock initializers C macros, which create the corresponding lockdep class as a static variable within the calling function. Again this is pretty much required, since to spot inconsistencies you need as many observations of all the different code path possibilities. Best to share them all between the same object. Also a distinct lockdep class for each individual object would explode the runtime overhead in both memory and cpu cycles.

And again this is good from a code design point too, since having the same data structure and code follow different locking rules for different objects is at best very confusing for reviewers.

Fighting lockdep, badly

Now things go wrong, you have a lockdep splat at your hands, concluded it’s a false positive and go ahead trying to teach lockdep about what’s going on. The first class of annotains are special lock_nested(lock, subclass) functions. Without lockdep nothing in the generated code changes, but it tells lockdep that for this lock acquisition, we’re using a different class to track the observed locking.

This breaks both the time invariance - nothing is stopping you from using different classes for the same lock at different times - and commonality of locking for the same objects. Worse, you can write code which obviously deadlocks, but lockdep will think everything is perfectly fine:

mutex_init(&A);

mutex_lock(&A);
mutex_lock_nested(&A, SINGLE_DEPTH_NESTING);

This is no good and puts a huge burden on reviewers to carefully check all these places themselves, manually. Exactly the kind of tedious and error prone work lockdep was meant to take over.

Slightly better are the annotations which adjust the lockdep class once, when the object is initialized, using lockdep_set_class() and related functions. This at least does not break time invariance, hence will at least guarantee that lockdep spots the deadlock latest when it happens. It still reduces how much lockdep can connect what’s going on, but occasionally “rewrite the entire subsystem” to resolve a locking inconsistency is just not a reasonable option.

It still means that reviewers always need to remember what the locking rules for all types of different objects behind the same structure are, instead of just one. And then check against every path whether that code needs to work with all of them, or just some, or only one. Again tedious work that really lockdep is supposed to help with. If it’s hard to come by a system where you can easily run the code for the different types of object without rebooting, then lockdep cannot help at all.

All these annotations have in common that they don’t change the code logic, only how lockdep interprets what’s going on.

An even more insideous trick on reviewers and lockdep is to push locking into an asynchronous worker of some sorts. This hides issues because lockdep does not follow dependencies between threads through waiter/wakee relationships like wait_for_completion() and complete(), or through wait queues. There are lockdep annotations for specific dependencies, like in the kernel’s workqueue code when flushing workers or specific work items with flush_work(). Automatic annotations have been attemped with the lockdep cross-release extension, which for various reasons had to be backed out again. Therefore hand-rolled asynchronous code is a great place to create complexity and hide locking issues from both lockdep and reviewers.

Playing to lockdep’s strength

Except when there’s very strong justification for all the complexity, the real fix is to change the locking and make it simpler. Simple enough for lockdep to understand what’s going on, which also makes reviewer’s lifes a lot better. Often this means substantial code rework, but at least in some cases there are useful tricks.

A special kind of annotations are the lock_nest_lock(lock, superlock) family of functions - these tell lockdep that when multiple locks of the same class are acquired, it’s all serialized by the single superlock. Lockdep then validates that the right superlock is indeed held. A great example is mm_take_all_locks(), which as the name implies, takes all locks related to the given mm_struct. In a sense this is not a pure annotation, unlike the ones above, since it requires that the superlock is actually locked. That’s generally the easier to understand scheme than clever sorting of lock acquisition of some sort for reviewers too, not just for lockdep.

A different situation often arises when creating or destroying an object. But at that stage often no other thread has a reference to the object and therefore can take the lock, and the best way to resolve locking inconsistency over the lifetime of an object due to creation and destruction code is to not take any locks at all in these paths. There is nothing to protect against after all!

In all these cases the best option for long term maintainability is to simplify the locking design, not reduce lockdep’s power by reducing the amount of false positives it reports. And that should be the general principle.

tldr; do not fix lockdep false positives, fix your locking

August 13, 2020 12:00 AM

August 10, 2020

Linux Plumbers Conference: Linux Plumbers Releasing More Passes

After a careful review we have decided to release more passes. We are thrilled with the interest for this first ever online Linux Plumbers. The highlight of Linux Plumbers is the microconferences which are heavily focused on discussion and problem solving. To give the best experience for discussion, we have chosen to use an open source virtual platform that offers video for all participants. The platform recommends not having more than a certain number of people in each room at a time, hence putting a cap on registration to avoid hitting that limit. We do have solutions that will hopefully allow as many people as possible to experience Plumbers. We appreciate your patience and enthusiasm.

August 10, 2020 06:41 PM

August 08, 2020

Linux Plumbers Conference: Linux Plumbers currently sold out

Linux Plumbers is currently sold out of regular registration tickets. Although the conference is virtual this year our virtual platform cannot support an unlimited number of attendees, hence the cap on registration. We are currently reviewing our capacity limits to see if we can allow more people to attend without over burdening the virtual platform and potentially preventing discussion. We will make another announcement next week regarding registration.

August 08, 2020 05:09 PM

August 07, 2020

Linux Plumbers Conference: Toolchain Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Toolchain Microconference has been accepted into the 2020 Linux Plumbers Conference!

The GNU toolchain has direct impact on the development of the Linux kernel and it is imperative that the developers of both ecosystems have an understanding of each other’s needs. Linux Plumbers is the perfect venue for the two communities to interact, and the GNU Toolchain microconference’s purpose is to facilitate that happening.

Last year’s meetup at Linux Plumbers proved that it is critical that the two communities communicate with each other. As a result of last year’s microconference, the GNU toolchain has completed adding support for BPF, in a more flexible and usable way and system call wrappers in glibc were improved. There have been security features derived from the discussions, such as zeroing of registers when entering a function and implicit initialization of atomics.

This year’s topics to be discussed include:

Come and join us in the discussion about innovating the most efficient and functional toolchain for building the Linux kernel.

We hope to see you there!

August 07, 2020 07:22 PM

August 05, 2020

Linux Plumbers Conference: Application Ecosystem Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Application Ecosystem Microconference has been accepted into the 2020 Linux Plumbers Conference!

The Linux kernel is the foundation of the Linux systems, but it is not much use without applications that run on top of it. The application experience relies on the kernel for performance, stability and responsiveness. Plumbers is the perfect venue to have the kernel and app ecosystems under one roof to discuss and learn together and make a better application experience on the Linux platform.

This year’s topics to be discussed include:

Come and join the discussion on making this the year of the Linux Desktop!

We hope to see you there!

August 05, 2020 02:14 PM

Linux Plumbers Conference: Power Management and Thermal Control Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Power Management and Thermal Control Microconference has been accepted into the 2020 Linux Plumbers Conference!

Power management and thermal control is an important area in the Linux ecosystem to help with the global environment. Optimizing the amount of work that is achieved while having long battery life and keeping the box from overheating is critical in today’s world. This meeting will focus on continuing to have Linux be an efficient operating system while still lowering the cost of running a data center.

Last year’s meetup at Linux Plumbers resulted in the introduction of thermal pressure support into the CPU scheduler as well as several improvements to the thermal framework, such as a netlink implementation of thermal notification and improvements to CPU cooling. Discussions from last year also helped to improve systems-wide suspend testing tools.

This year’s topics to be discussed include:

Come and join us in the discussion about extending the battery life of your laptop and keeping it cool.

We hope to see you there!

August 05, 2020 03:15 AM

August 02, 2020

Linux Plumbers Conference: VFIO/IOMMU/PCI Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the VFIO/IOMMU/PCI Microconference has been accepted into the 2020 Linux Plumbers Conference!

The PCI interconnect specification, the devices implementing it, and the system IOMMUs providing memory/access control to them are incorporating more and more features aimed at high performance systems (eg PCI ATS (Address Translation Service)/PRI(Page Request Interface), enabling Shared Virtual Addressing (SVA) between devices and CPUs), that require the kernel to coordinate the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces that have to be designed in-sync for all three subsystems.

The kernel code that enables these new system features requires coordination between VFIO/IOMMU/PCI subsystems, so that kernel interfaces and userspace APIs can be designed in a clean way.

The following was a result of last years successful Linux Plumbers microconference:

Last year’s Plumbers resulted in a write-up justifying the dual-stageSMMUv3 integration but more work is needed to persuade the relevant maintainers.

Topics for this year include (but not limited to):

Come and join us in the discussion in helping Linux keep up with the new features being added to the PCI interconnect specification.

We hope to see you there!

August 02, 2020 02:04 PM

August 01, 2020

Linux Plumbers Conference: RISC-V Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the RISC-V Microconference has been accepted into the 2020 Linux Plumbers Conference!

The RISC-V ecosystem is gaining momentum at such an astounding speed that it wouldn’t be unfair to compare it to the early days of the Linux ecosystem’s growth. There are a plethora of Linux kernel features that have been added to RISC-V and many more are waiting to be reviewed in the mailing list. Some of them resulted from direct discussions during last year’s RISC-V microconference. For example, RISC-V has a standard boot process along with a well-defined supervisor binary specification (SBI) and cpu hotplug feature. KVM support is very close to being merged and just waiting for official ratification of the H extension. NoMMU support for Linux kernel has already been merged.

Here are a few of the expected topics and current problems in RISC-V Linux land that we would like to cover.

Come join us and participate in the discussion on how we can improve the support for RISC-V in the Linux kernel.

We hope to see you there!

August 01, 2020 04:48 PM

Linux Plumbers Conference: You, Me, and IoT Two Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the You, Me, and IoT Microconference has been accepted into the 2020 Linux Plumbers Conference!

As everyday devices start to become more connected to the internet, the infrastructure around it constantly needs to be developed. The Internet of Things (IoT) in the Linux ecosystem is looking brighter every day. The
development rate of the Zephyr RTOS in particular is accelerating dramatically and we are now up to 2 commits per hour[1]! LoRa WAN made it into Zephyr release 2.2 as well.

The principles for IoT are still the same: data-driven controls for remote endpoints such as

A large focus of industry heavyweights continues to be interoperability; we are seeing a growing trend in moving toward IP-centric network communications. Using IP natively ensures that it is extremely easy for end-nodes and edge devices to communicate to The Cloud but it also means that IoT device security is more important than ever.

Last year’s successful microconference has brought about several changes in the IoT space. The Linux + Zephyr + Greybus solution now works over nearly all physical layers (#exactsteps for IEEE 802.15.4 and BLE). BeagleBoard.org is also now preparing a next-gen hardware revision of the BeagleConnect to provide both a hobbyist and professional-friendly IoT platform. BlueZ has begun making quarterly releases, much to the delight of last year’s attendees, and members of the linux-wpan / netdev community have implemented RPL, an IPv6 routing protocol for lossy networks.

This year’s topics to be discussed include:

Come and join us in some heated but productive discussions in making your everyday devices communicate with the world around them.

[1]For reference, Linux receives approximately 9 commits per hour

We hope to see you there!

_______________________________________________

August 01, 2020 01:18 AM

July 30, 2020

Linux Plumbers Conference: LLVM Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the LLVM Microconference has been accepted into the 2020 Linux Plumbers Conference!

The LLVM toolchain has made significant progress over the years and many kernel developers are now using it to build their kernels. It is still the one toolchain that can natively compile C into BPF byte code. Clang (the C frontend to LLVM) is used to build Android and ChromeOS kernels and others are in the process of testing to use Clang to build their kernels.

Many topics still need to be resolved, and are planned to be discussed here.
These include (but not limited to):

Come and join us in the discussion of improving this new toolchain to make it the most useable
for everyone!

We hope to see you there!

July 30, 2020 07:49 PM

Paul E. Mc Kenney: Stupid RCU Tricks: Failure Probability and CPU Count

So rcutorture found a bug, whether in RCU or elsewhere, and it is now time to reproduce that bug, whether to make good use of git bisect or to verify an alleged fix. One problem is that, rcutorture being what it is, that bug is likely a race condition and it likely takes longer than you would like to reproduce. Assuming that it reproduces at all.

How to make it reproduce faster? Or at all, as the case may be?

One approach is to tweak the Kconfig options and maybe even the code to make the failure more probable. Another is to find a “near miss” that is related to and more probable than the actual failure.

But given that we are trying to make a race condition happen more frequently, it is only natural to try tweaking the number of CPUs. After all, one would hope that increasing the number of CPUs would increase the probability of hitting the race condition. So the straightforward answer is to use all available CPUs.

But how to use them? Run a single rcutorture scenario covering all the CPUs, give or take the limitations imposed by qemu and KVM? Or run many instances of that same scenario, with each instance using a small fraction of the available CPUs?

As is so often the case, the answer is: “It depends!”

If the race condition happens randomly between any pair of CPUs, then bigger is better. To see this, consider the following old-school ASCII-art comparison:

+---------------------+
|        N * M        |
+---+---+---+-----+---+
| N | N | N | ... | N |
+---+---+---+-----+---+

If there are n CPUs that can participate in the race condition, then at any given time there are n(n-1)/2 possible races. The upper row has N*M CPUs, and thus N*M*(N*M-1)/2 possible races. The lower row has M sets of N CPUs, and thus M*N*(N-1)/2, which is almost a factor of M smaller. For this type of race condition, you should therefore run a small number of scenarios with each using as many CPUs as possible, and preferably only one scenario that uses all of the CPUs. For example, to make the TREE03 scenario run on 64 CPUs, edit the tools/testing/selftests/rcutorture/configs/rcu/TREE03 file so as to set CONFIG_NR_CPUS=64.

But there is no guarantee that the race condition will be such that all CPUs participate with equal probability. For example, suppose that the bug was due to a race between RCU's grace-period kthread (named either rcu_preempt or rcu_sched, depending on your Kconfig options) and its expedited grace period, which at any given time will be running on at most one workqueue kthread.

In this case, no matter how many CPUs were available to a given rcutorture scenario, at most two of them could be participating in this race. In this case, it is instead best to run as many two-CPU rcutorture scenarios as possible, give or take the memory footprint of that many guest OSes (one per rcutorture scenario). For example, to make 32 TREE03 scenarios run on 64 CPUs, edit the tools/testing/selftests/rcutorture/configs/rcu/TREE03 file so as to set CONFIG_NR_CPUS=2 and remember to pass either the --allcpus or the --cpus 64 argument to kvm.sh.

What happens in real life?

For a race condition that rcutorture uncovered during the v5.8 merge window, running one large rcutorture instance instead of 14 smaller ones (very) roughly doubled the probability of locating the race condition.

In other words, real life is completely capable of lying somewhere between the two theoretical extremes outlined above.

July 30, 2020 12:30 AM

July 27, 2020

Matthew Garrett: Filesystem deduplication is a sidechannel

First off - nothing I'm going to talk about in this post is novel or overly surprising, I just haven't found a clear writeup of it before. I'm not criticising any design decisions or claiming this is an important issue, just raising something that people might otherwise be unaware of.

With that out of the way: Automatic deduplication of data is a feature of modern filesystems like zfs and btrfs. It takes two forms - inline, where the filesystem detects that data being written to disk is identical to data that already exists on disk and simply references the existing copy rather than, and offline, where tooling retroactively identifies duplicated data and removes the duplicate copies (zfs supports inline deduplication, btrfs only currently supports offline). In a world where disks end up with multiple copies of cloud or container images, deduplication can free up significant amounts of disk space.

What's the security implication? The problem is that deduplication doesn't recognise ownership - if two users have copies of the same file, only one copy of the file will be stored[1]. So, if user a stores a file, the amount of free space will decrease. If user b stores another copy of the same file, the amount of free space will remain the same. If user b is able to check how much free space is available, user b can determine whether the file already exists.

This doesn't seem like a huge deal in most cases, but it is a violation of expected behaviour (if user b doesn't have permission to read user a's files, user b shouldn't be able to determine whether user a has a specific file). But we can come up with some convoluted cases where it becomes more relevant, such as law enforcement gaining unprivileged access to a system and then being able to demonstrate that a specific file already exists on that system. Perhaps more interestingly, it's been demonstrated that free space isn't the only sidechannel exposed by deduplication - deduplication has an impact on access timing, and can be used to infer the existence of data across virtual machine boundaries.

As I said, this is almost certainly not something that matters in most real world scenarios. But with so much discussion of CPU sidechannels over the past couple of years, it's interesting to think about what other features also end up leaking information in ways that may not be obvious.

(Edit to add: deduplication isn't enabled on zfs by default and is explicitly triggered on btrfs, so unless it's something you've enabled then this isn't something that affects you)

[1] Deduplication is usually done at the block level rather than the file level, but given zfs's support for variable sized blocks, identical files should be deduplicated even if they're smaller than the maximum record size

comment count unavailable comments

July 27, 2020 10:22 PM

July 17, 2020

Linux Plumbers Conference: Open Printing Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Open Printing Microconference has been accepted into the 2020 Linux Plumbers Conference!

Building on the work already done in driverless printing since last year’s microconference session; driverless scanning has emerged as an active new topic since last year’s Plumbers. We’re seeing many new printer application projects emerge that will benefit 3D printing as well. With Driverless scanning and printing making good progress and improvements, now is the time to talk about driverless/IPP fax as well.

Topics to discuss include

Come join us and participate in the discussion to bring Linux printing,
scanning and fax a better experience.

If you already want to start the discussion right now or tell us
something before the conference starts, do it in the comments sections
of the linked pages.

We hope to see you there!

July 17, 2020 09:04 PM

July 15, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: So rcutorture is Not Aggressive Enough For You?

So you read the previous post, but simply running rcutorture did not completely vent your frustration. What can you do?

One thing you can do is to tweak a number of rcutorture settings to adjust the manner and type of torture that your testing inflicts.

RCU CPU Stall Warnings

If you are not averse to a quick act of vandalism, then you might wish to induce an RCU CPU stall warning. The --bootargs argument can be used for this, for example as follows:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make \
    --bootargs "rcutorture.stall_cpu=22 rcutorture.fwd_progress=0"

The rcutorture.stall_cpu=22 says to stall a CPU for 22 seconds, that is, one second longer than the default RCU CPU stall timeout in mainline. If you are instead using a distribution kernel, you might need to specify 61 seconds (as in “rcutorture.stall_cpu=61”) in order to allow for the typical 60-second RCU CPU stall timeout. The rcutorture.fwd_progress=0 has no effect except to suppress a warning message (with stack trace included free of charge) that questions the wisdom of running both RCU-callback forward-progress tests and RCU CPU stall tests at the same time. In fact, the code not only emits the warning message, it also automatically suppresses the forward-progress tests. If you prefer living dangerously and don't mind the occasional out-of-memory (OOM) lockup accompanying your RCU CPU stall warnings, feel free to edit kernel/rcu/rcutorture.c to remove this automatic suppression.

If you are running on a large system that takes more than ten seconds to boot, you might need to increase the RCU CPU stall holdoff interval. For example, adding rcutorture.stall_cpu_holdoff=120 to the --bootargs list would wait for two minutes before stalling a CPU instead of the default holdoff of 10 seconds. If simply spinning a CPU with preemption disabled does not fully vent your ire, you could undertake a more profound act of vandalism by adding rcutorture.stall_cpu_irqsoff=1 so as to cause interrupts to be disabled on the spinning CPU.

Some flavors of RCU such as SRCU permit general blocking within their read-side critical sections, and you can exercise this capability by adding rcutorture.stall_cpu_block=1 to the --bootargs list. Better yet, you can use this kernel-boot parameter to torture flavors of RCU that forbid blocking within read-side critical sections, which allows you to see they complain about such mistreatment.

The vanilla flavor of RCU has a grace-period kthread, and stalling this kthread is another good way to torture RCU. Simply add rcutorture.stall_gp_kthread=22 to the --bootargs list, which delays the grace-period kthread for 22 seconds. Doing this will normally elicit strident protests from mainline kernels.

Finally, you could starve rcutorture of CPU time by running a large number of them concurrently (each in its own Linux-kernel source tree), thereby overcommitting the CPUs.

But maybe you would prefer to deprive RCU of memory. If so, read on!

Running rcutorture Out of Memory

By default, each rcutorture guest OS is allotted 512MB of memory. But perhaps you would like to have it make do with only 128MB:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --trust-make --memory 128M

You could go further by making the RCU need-resched testing more aggressive,T for example, by increasing the duration of this testing from the default three-quarters of the RCU CPU stall timeout to (say) seven eighths:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --trust-make --memory 128M \
    --bootargs "rcutorture.fwd_progress_div=8"

More to the point, you might make the RCU callback-flooding tests more aggressive, for example by adjusting the values of the MAX_FWD_CB_JIFFIES, MIN_FWD_CB_LAUNDERS, or MIN_FWD_CBS_LAUNDERED macros and rebuilding the kernel. Alternatively, you could use kill -STOP on one of the vCPUs in the middle of an rcutorture run. Either way, if you break it, you buy it!

Or perhaps you would rather attempt to drown rcutorture in memory, perhaps forcing a full 16GB onto each guest OS:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --trust-make --memory 16G

Another productive torture method involves unusual combinations of Kconfig options, a topic take up by the next section.

Confused Kconfig Options

The Kconfig options for a given rcutorture scenario are specified by the corresponding file in the tools/testing/selftests/rcutorture/configs/rcu directory. For example, the Kconfig options for the infamous TREE03 scenario may be found in tools/testing/selftests/rcutorture/configs/rcu/TREE03.

But why not just use the --kconfig argument and be happy, as described previously?

One reason is that there are a few Kconfig options that the rcutorture scripting refers to early in the process, before the --kconfig parameter's additions have been processed, for example, changing CONFIG_NR_CPUS should be done in the file rather than via the --kconfig parameter. Another reason is to not need to keep supplying a --kconfig argument for each of many repeated rcutorture runs. But perhaps most important, if you want some scenarios to be built with one Kconfig option and others built with some other Kconfig option, modifying each scenario's file avoids the need for multiple rcutorture runs.

For example, you could edit the tools/testing/selftests/rcutorture/configs/rcu/TREE03 file to change the CONFIG_NR_CPUS=16 to instead read CONFIG_NR_CPUS=4, and then run the following on a 12-CPU system:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --trust-make --configs "3*TREE03"

This would run three concurrent copies of TREE03, but with each guest OS restricted to only 4 CPUs.

Finally, if a given Kconfig option applies to all rcutorture runs and you are tired of repeatedly entering --kconfig arguments, you can instead add that option to the tools/testing/selftests/rcutorture/configs/rcu/CFcommon file.

But sometimes Kconfig options just aren't enough. And that is why we have kernel boot parameters, the subject of the next section.

Boisterous Boot Parameters

We have supplied kernel boot parameters using the --bootargs parameter, but sometimes ordering considerations or sheer laziness motivate greater permanent. Either way, the scenario's .boot file may be brought to bear, for example, the TREE03 scenario's file is located here: tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot.

As of the v5.7 Linux kernel, this file contains the following:

rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30
rcutree.gp_preinit_delay=12
rcutree.gp_init_delay=3
rcutree.gp_cleanup_delay=3
rcutree.kthread_prio=2
threadirqs

For example, the probability of RCU's grace period processing overlapping with CPU-hotplug operations may be adjusted by decreasing the value of the rcutorture.onoff_interval from its default of 200 milliseconds or by adjusting the various grace-period delays specified by the rcutree.gp_preinit_delay, rcutree.gp_init_delay, and rcutree.gp_cleanup_delay parameters. In fact, chasing bugs involving races between RCU grace periods and CPU-hotplug operations often involves tuning these four parameters to maximize race probability, thus decreasing the required rcutorture run durations.

The possibilities for the .boot file contents are limited only by the extent of the Documentation/admin-guide/kernel-parameters.txt. And actually not even by that, given the all-to-real possibility of undocumented kernel boot parameters.

You can also create your own rcutorture scenarios by creating a new set of files in the tools/testing/selftests/rcutorture/configs/rcu directory. You can make them run by default (or in response to the CFLIST string to the --configs parameter) by adding its name to the tools/testing/selftests/rcutorture/configs/rcu/CFLIST file. For example, you could create a MYSCENARIO file containing Kconfig options and (optionally) a MYSCENARIO.boot file containing kernel boot parameters in the tools/testing/selftests/rcutorture/configs/rcu directory, and make them run by default by adding a line reading MYSCENARIO to the tools/testing/selftests/rcutorture/configs/rcu/CFLIST file.

Summary

This post discussed enhancing rcutorture through use of stall warnings, memory limitations, Kconfig options, and kernel boot parameters. The special case of adjusting CONFIG_NR_CPUS deserves more attention, and that is the topic of the next post.

July 15, 2020 09:13 PM

Pete Zaitcev: Cries of the vanquished

The post at roguelazer's is so juicy from every side that I'd need to quote it whole to give it justice (h/t ~avg). But its ostensible meat is etcd.[1] In that, he's building a narative of the package being elegant at first, and bloating later.

This tool was originally written in 2013 for a ... project called CoreOS. ... etcd was greater than its original use-case. Etcd provided a convenient and simple set of primitives (set a key, get a key, set-only-if-unchanged, watch-for-changes) with a drop-dead simple HTTP API on top of them.

Kubernetes was quickly changed to use etcd as its state store. Thus began the rapid decline of etcd.

... a large number of Xooglers who decided to infect etcd with Google technologies .... Etcd's simple HTTP API was replaced by a "gRPC" version; the simple internal data model was replaced by a dense and non-orthogonal data model with different types for leases, locks, transactions, and plain-old-keys.

Completely omitted from this tale is that etcd was created as a clone of Google Chumby, which did not use HTTP. The HTTP interface was implemented in etcd for expediency. So, the nostalgic image of early etcd he's projecting is in fact a primitive early draft.

It's interesting that he only mentions leases and locks in passing, painting them as a late addition, whereas the concept of coarse locking was more important for Chumby than the registry.

[1] Other matters are taken upon in the footnotes, at length. You'd think that it would be a simple matter to create a seaprate post to decry the evils of HTTP/2, but not for this guy! I may write another entry on the evils of bloat and how sympathetic I am to his cause later.

July 15, 2020 05:41 PM

Brendan Gregg: Systems Performance: Enterprise and the Cloud, 2nd Edition

Eight years ago I wrote _Systems Performance: Enterprise and the Cloud_ (aka the "sysperf" book) on the performance of computing systems, and this year I'm excited to be releasing the second edition. The first edition was successful, selling over 10k copies and becoming required or recommended reading at many companies (and even mentioned in [job descriptions]). Thanks to everyone for their support. I've received feedback that it is useful, not just for learning performance, but also for showing how computers work internally: essential knowledge for all engineers. The second edition adds content on BPF, BCC, bpftrace, perf, and Ftrace, mostly removes Solaris, makes numerous updates to Linux and cloud computing, and includes general improvements and additions. It is written by a more experienced version of myself than I was for the first edition, including my six years of experience as a senior performance engineer at Netflix. This edition has also been improved by a new technical review team of over 30 engineers. How much has changed since first edition? It's hard to say, but easy to visualize. As an example, the following shows Chapter 6, CPUs, where black text is from the first edition and colored text are the updates (this is a color scheme I use to show reviewers when text was changed; from oldest changes to newest: yellow, green, aqua, blue, purple, red):

Chapter 6, CPUs, changes colored
Here is the entire book as a 3.1 Mbyte jpg. (Note that these visualizations are not final as I'm still making updates. And this doesn't highlight figure and copy-edit changes.) The book will be released in November 2020 by Addison Wesley, and will be around 800 pages. It's already listed on Amazon.com. A year ago I announced [BPF Performance Tools: Linux System and Application Observability]. In a way, Systems Performance is volume 1 and BPF Performance Tools is volume 2. Sysperf provides balanced coverage of models, theory, architecture, observability tools (traditional and tracing), experimental tools, and tuning. The BPF tools book focuses on BPF tracing tools only, with brief summaries of architecture and traditional tools. Which book should you buy? Both, of course. :-) Since they are both performance books there is a little overlap between them, but not much. I think sysperf has a wider audience: it is a handbook for anyone to learn performance and computer internals. The BPF tools book will satisfy those wishing to jump ahead and run advanced tools for some quick wins. For more information, including links showing where to buy the book, please see its website: [Systems Performance: Enterprise and the Cloud, 2nd Edition]. [job descriptions]: https://www.themuse.com/jobs/affirm/senior-performance-engineer [Systems Performance: Enterprise and the Cloud, 2nd Edition]: /systems-performance-2nd-edition-book.html [BPF Performance Tools: Linux System and Application Observability]: /blog/2019-07-15/bpf-performance-tools-book.html

July 15, 2020 07:00 AM

July 14, 2020

Linux Plumbers Conference: Reminder for LPC 2020 Town Hall: The Kernel Report

Thursday is approaching!

On July 16th at 8am PST / 11am EST / 3pm GMT the Kernel Report talk by Jon Corbet of LWN will take place on the LPC Big Blue Button platform! It will also be available on a YouTube Live stream.

Please join us at this URL: https://linuxplumbers.lwn.net/b/LPC-kernel-report.

The Linux kernel is at the core of any Linux system; the performance and capabilities of the kernel will, in the end, place an upper bound on what the system as a whole can do. This talk will review recent events in the kernel development community, discuss the current state of the kernel and the challenges it faces, and look forward to how the kernel may address those challenges. Attendees of any technical ability should gain a better understanding of how the kernel got to its current state and what can be expected in the near future.

The Plumbers Code of Conduct will be in effect for this event. The event will be recorded.

July 14, 2020 10:59 PM

Linux Plumbers Conference: linux/arch/* Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the linux/arch/* Microconference has
been accepted into the 2020 Linux Plumbers Conference!

Linux supports over twenty architectures.

Each architecture has its own sub-directory within the Linux-kernel arch/ directory containing code specific for that architecture. But that code is not always unique to the architecture.

In many cases, code in one architecture was copy-pasted from another, leaving for a lot of unnecessary code duplication. This makes it harder to fix, update and maintain functionality relying on the architecture specific code.

There’s room to improve, consolidate and generalize the code in these
directories, and that is the goal of this microconference.

Topics to discuss include:

Come join us and participate in the discussion to bring Linux architectures closer together.

We hope to see you there!
_______________________________________________

July 14, 2020 03:10 PM

July 13, 2020

Linux Plumbers Conference: Android Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Android Microconference has been accepted into the 2020 Linux Plumbers Conference!

A few years ago the Android team announced their desire to try to set a path for creating a Generic Kernel Image (GKI) which would enable the decoupling of Android kernel releases from hardware enablement. Since then, much work has been done by many parties to make this vision a reality. Last year’s Linux Plumber’s Android microconference brought about work on monitoring and stabilizing the Android in-kernel ABI, solutions to issues associated with modules and supplier-consumer dependencies have landed in the upstream Linux kernel, and vendors have started migrating from using the ION driver to the DMA-BUF heaps that are now supported in upstream Linux. For a report on progress made since last year MC see here.

This year several devices now work with GKI making their kernel upgradable without requiring porting efforts, but this work exposed several additional issues. Thus the topics for this year’s Android microconference include:

Come and join us in help making the upstream Linux kernel work out of the box on your Android device!

We hope to see you there!

July 13, 2020 02:42 AM

July 11, 2020

Linux Plumbers Conference: GNU Tools Track Added to Linux Plumbers Conference 2020

We are pleased to announce that we have added an additional track to LPC 2020: the GNU Tools track. The track will run for the 5 days of the conference.
For more information please see the track wiki page.
The call for papers is now open and will close on July 31 2020. To submit a proposal please refer to the wiki page above.

July 11, 2020 03:11 PM

Linux Plumbers Conference: Systems Boot and Security Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Systems Boot and Security Microconference has been accepted into the 2020 Linux Plumbers Conference!

Computer-system security is an important topic to many. Maintaining data security and system integrity is crucial for businesses and individuals. Computer security is paramount even at system boot up, as firmware attacks can compromise the system before the operating system starts. In order to keep the integrity of the system intact, both the firmware as well as the rest of the system must be vigilant in monitoring and preventing malware intrusion.

As a result of last year’s microconference Oracle sent out patches to support Trenchboot in the Linux kernel and in GRUB2. An agreement was also reached on problems with TPM 2.0 Linux sysfs interface.

Over the past year, 3mdeb has been working on various open-source contributions to LandingZone and also GRUB2 and Linux kernel to improve TrenchBoot support.

This year’s topics to be discussed include:

Come and join us in the discussion about how to keep your system secure even at bootup. We hope to see you there!

July 11, 2020 12:16 AM

July 06, 2020

Linux Plumbers Conference: Testing and Fuzzing Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Testing and Fuzzing Microconference has been accepted into the 2020 Linux Plumbers Conference!

Testing and Fuzzing is crucial to the stability the Linux Kernel demands. Last year’s meetup helped make Kernel CI a Linux Foundation hosted project, collaboration between Red Hat CKI and KernelCI. On the more technical side, KUnit was merged upstream, and KernelCI integration is underway, syzcaller reproducers are being included in the Linux Test Project[5], and Clang is integrated in KernelCI.

This year’s topics to be discussed include:

Come and join us in the discussion of keeping Linux the fastest moving, reliable piece of software in the world!

We hope to see you there!

July 06, 2020 03:12 PM

July 03, 2020

Linux Plumbers Conference: Linux Plumbers Conference is Not Sold Out

We’re really sorry, but apparently the Cvent registration site we use has suffered a bug which is causing it to mark the conference as “Sold Out” and, unfortunately, since today is the beginning of the American Independence day weekend, we can’t get anyone to fix it until Monday. However, rest assured there are plenty of places still available, so if you can wait until Monday, you should be able to register for the conference as soon as the site is fixed.

Again, we’re really sorry for the problem and the fact that fixing it will take a further three days.

July 03, 2020 05:32 PM

July 01, 2020

Linux Plumbers Conference: Networking and BPF Summit CfP Now Open

We are pleased to announce that the Call for Proposals for the Networking and BPF Summit at Linux Plumbers Conference 2020 is now open.

Please submit your proposals here.

Looking forward to seeing your great contributions!

July 01, 2020 10:41 PM

Linux Plumbers Conference: Announcing Town Hall #2: The Kernel Weather Report

Thank you to everyone who attended the Linux Plumbers town hall on June 25th. It was successful thanks to your participation. We’re pleased to announce another town hall on July 16th at 8am PST / 11am EST / 3pm GMT. This town hall will feature Jon Corbet of LWN giving “The Kernel Weather Report”.

The Linux kernel is at the core of any Linux system; the performance and capabilities of the kernel will, in the end, place an upper bound on what the system as a whole can do. This talk will review recent events in the kernel development community, discuss the current state of the kernel and the challenges it faces, and look forward to how the kernel may address those challenges. Attendees of any technical ability should gain a better understanding of how the kernel got to its current state and what can be expected in the near future.

Please note that the Plumbers Code of Conduct will be in effect for this event. We also plan to record this event. We will post the URL for the town hall on the LPC blog prior to the event. We hope to see you there and help make Plumbers the best conference for everyone.

July 01, 2020 10:02 PM

June 25, 2020

Linux Plumbers Conference: How to Join the LPC Town Hall

Please use the following link on Thursday June 25 2020 at 8am PDT/ 11am EDT/ 3pm GMT to join the LPC Town Hall:
https://linuxplumbers.lwn.net/b/lin-rs8-zoh
Note that no account is necessary!

Please refer to the previous post about the Town Hall to get more info.
See you there!

June 25, 2020 12:41 AM

June 24, 2020

Matthew Garrett: Making my doorbell work

I recently moved house, and the new building has a Doorbird to act as a doorbell and open the entrance gate for people. There's a documented local control API (no cloud dependency!) and a Home Assistant integration, so this seemed pretty straightforward.

Unfortunately not. The Doorbird is on separate network that's shared across the building, provided by Monkeybrains. We're also a Monkeybrains customer, so our network connection is plugged into the same router and antenna as the Doorbird one. And, as is common, there's port isolation between the networks in order to avoid leakage of information between customers. Rather perversely, we are the only people with an internet connection who are unable to ping my doorbell.

I spent most of the past few weeks digging myself out from under a pile of boxes, but we'd finally reached the point where spending some time figuring out a solution to this seemed reasonable. I spent a while playing with port forwarding, but that wasn't ideal - the only server I run is in the UK, and having packets round trip almost 11,000 miles so I could speak to something a few metres away seemed like a bad plan. Then I tried tethering an old Android device with a data-only SIM, which worked fine but only in one direction (I could see what the doorbell could see, but I couldn't get notifications that someone had pushed a button, which was kind of the point here).

So I went with the obvious solution - I added a wifi access point to the doorbell network, and my home automation machine now exists on two networks simultaneously (nmcli device modify wlan0 ipv4.never-default true is the magic for "ignore the gateway that the DHCP server gives you" if you want to avoid this), and I could now do link local service discovery to find the doorbell if it changed addresses after a power cut or anything. And then, like magic, everything worked - I got notifications from the doorbell when someone hit our button.

But knowing that an event occurred without actually doing something in response seems fairly unhelpful. I have a bunch of Chromecast targets around the house (a mixture of Google Home devices and Chromecast Audios), so just pushing a message to them seemed like the easiest approach. Home Assistant has a text to speech integration that can call out to various services to turn some text into a sample, and then push that to a media player on the local network. You can group multiple Chromecast audio sinks into a group that then presents as a separate device on the network, so I could then write an automation to push audio to the speaker group in response to the button being pressed.

That's nice, but it'd also be nice to do something in response. The Doorbird exposes API control of the gate latch, and Home Assistant exposes that as a switch. I'm using Home Assistant's Google Assistant integration to expose devices Home Assistant knows about to voice control. Which means when I get a house-wide notification that someone's at the door I can just ask Google to open the door for them.

So. Someone pushes the doorbell. That sends a signal to a machine that's bridged onto that network via an access point. That machine then sends a protobuf command to speakers on a separate network, asking them to stream a sample it's providing. Those speakers call back to that machine, grab the sample and play it. At this point, multiple speakers in the house say "Someone is at the door". I then say "Hey Google, activate the front gate" - the device I'm closest to picks this up and sends it to Google, where something turns my speech back into text. It then looks at my home structure data and realises that the "Front Gate" device is associated with my Home Assistant integration. It then calls out to the home automation machine that received the notification in the first place, asking it to trigger the front gate relay. That device calls out to the Doorbird and asks it to open the gate. And now I have functionality equivalent to a doorbell that completes a circuit and rings a bell inside my home, and a button inside my home that completes a circuit and opens the gate, except it involves two networks inside my building, callouts to the cloud, at least 7 devices inside my home that are running Linux and I really don't want to know how many computational cycles.

The future is wonderful.

(I work for Google. I do not work on any of the products described in this post. Please god do not ask me how to integrate your IoT into any of this)

comment count unavailable comments

June 24, 2020 08:25 AM

June 23, 2020

Linux Plumbers Conference: Registration for Linux Plumbers Conference 2020 is now open

Registration is now open for the 2020 edition of the Linux Plumbers Conference (LPC). It will be held August 24 – 28, virtually. Go to the attend page for more information.

Note that the CFPs for microconferences, refereed track talks, and BoFs are still open, please see this page for more information.

As always, please contact the organizing committee if you have questions.

June 23, 2020 09:30 PM

June 22, 2020

Linux Plumbers Conference: Kernel Dependability and Assurance Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Kernel Dependability & Assurance Microconference has been accepted into the 2020 Linux Plumbers Conference!

Linux is now being used in applications that are going to require a high degree of confidence that the kernel is going to behave as expected. Some of the key areas we’re seeing Linux now start to be used are in medical devices, civil infrastructure, caregiving robots, automotives, etc. This brings up a number of concerns that must be addressed. What sort of uptime can we count on? Should safety analysis be reevaluated after a bug fix has been made? Are all the system requirements being satisfied by Linux? What tooling is there to solve these questions?

This microconference is the place that the kernels community can come together and discuss these major issues. Topics to be discussed include:

Come and join us in making the most popular operating system the most dependable as well. We hope to see you there!

June 22, 2020 01:52 PM

June 19, 2020

Linux Plumbers Conference: Announcing a Linux Plumbers Virtual Town Hall

The Linux Plumbers Committee is pleased to announce a Town Hall meeting on June 25 at 8am PDT/ 11am EDT/ 3pm GMT. This meeting serves two purposes. The first purpose is to test our remote conference set up. This is the first time we are holding Linux Plumbers virtually and while we can run simulated tests, it’s much more effective to test our setup with actual participants with differing hardware set ups around the world. The second purpose is to present on our planning and give everyone a little bit of an idea of what to expect when we hold Plumbers at the end of August. We plan to have time for questions.

Given this is a test, the number of participants will be capped at 250 people. The purpose of this test is to examine the scale to which the infrastructure can handle the expected demand for a virtual Linux Plumbers Conference. If you can’t make this day or are blocked by the participation cap from joining, we expect to be running more tests in the days to come.

Please note that the Plumbers Code of Conduct will be in effect for this event. We also plan to record this event as we will be recording sessions at the actual conference. We will post the URL for the town hall on the LPC blog prior to the event. We hope to see you there and help make Plumbers the best conference for everyone.

June 19, 2020 04:03 PM

June 16, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: So you want to torture RCU?

Let's face it, using synchronization primitives such as RCU can be frustrating. And it is only natural to wish to get back, somehow, at the source of such frustration. In short, it is quite understandable to want to torture RCU. (And other synchronization primitives as well, but you have to start somewhere!) Another benefit of torturing RCU is that doing so sometimes uncovers bugs in other parts of the kernel. You see, RCU is not always willing to suffer alone.

One long-standing RCU-torture approach is to use modprobe and rmmod to install and remove the rcutorture module, as described in the torture-test documentation. However, this approach requires considerable manual work to check for errors.

On the other hand, this approach avoids any concern about the underlying architecture or virtualization technology. This means that use of modprobe and rmmod is the method of choice if you wish to torture RCU on (say) SPARC or when running on Hyper-V (this last according to people actually doing this). This method is also necessary when you want to torture RCU on a very specific kernel configuration or when you need to torture RCU on bare metal.

But for those of us running mainline kernels on x86 systems supporting virtualization, the approach described in the remainder of this document will usually be more convenient.

Running rcutorture in a Guest OS

If you have an x86 system (or, with luck, an ARMv8 or PowerPC system) set up to run qemu and KVM, you can instead use the rcutorture scripting, which automates running rcutorture over a full set of configurations, as well as automating analysis of the build products and console output. Running this can be as simple as:

tools/testing/selftests/rcutorture/bin/kvm.sh

As of v5.8-rc1, this will build and run each of nineteen combinations of Kconfig options, with each run taking 30 minutes for a total of 8.5 hours, not including the time required to build the kernel, boot the guest OS, and analyze the test results. Given that a number of the scenarios use only a single CPU, this approach can be quite wasteful, especially on the well-endowed systems of the year 2020.

This waste can be avoided by using the --cpus argument, for example, for the 12-hardware-thread laptop on which I am typing this, you could do the following:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12

This command would run up to 12 CPUs worth of rcutorture scenarios concurrently, so that the nineteen combinations would be run in eight batches. Because TREE03 and TREE07 each want 16 CPUs, rcutorture will complain in its run summary as follows:

 --- Mon Jun 15 10:23:02 PDT 2020 Test summary:
Results directory: /home/git/linux/tools/testing/selftests/rcutorture/res/2020.06.15-10.23.02
tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12 --duration 5 --trust-make
RUDE01 ------- 2102 GPs (7.00667/s) [tasks-rude: g0 f0x0 ]
SRCU-N ------- 42229 GPs (140.763/s) [srcu: g549860 f0x0 ]
SRCU-P ------- 11887 GPs (39.6233/s) [srcud: g110444 f0x0 ]
SRCU-t ------- 59641 GPs (198.803/s) [srcu: g1 f0x0 ]
SRCU-u ------- 59209 GPs (197.363/s) [srcud: g1 f0x0 ]
TASKS01 ------- 1029 GPs (3.43/s) [tasks: g0 f0x0 ]
TASKS02 ------- 1043 GPs (3.47667/s) [tasks: g0 f0x0 ]
TASKS03 ------- 1019 GPs (3.39667/s) [tasks: g0 f0x0 ]
TINY01 ------- 43373 GPs (144.577/s) [rcu: g0 f0x0 ] n_max_cbs: 34463
TINY02 ------- 46519 GPs (155.063/s) [rcu: g0 f0x0 ] n_max_cbs: 2197
TRACE01 ------- 756 GPs (2.52/s) [tasks-tracing: g0 f0x0 ]
TRACE02 ------- 559 GPs (1.86333/s) [tasks-tracing: g0 f0x0 ]
TREE01 ------- 8930 GPs (29.7667/s) [rcu: g64765 f0x0 ]
TREE02 ------- 17514 GPs (58.38/s) [rcu: g138645 f0x0 ] n_max_cbs: 18010
TREE03 ------- 15920 GPs (53.0667/s) [rcu: g159973 f0x0 ] n_max_cbs: 1025308
CPU count limited from 16 to 12
TREE04 ------- 10821 GPs (36.07/s) [rcu: g70293 f0x0 ] n_max_cbs: 81293
TREE05 ------- 16942 GPs (56.4733/s) [rcu: g123745 f0x0 ] n_max_cbs: 99796
TREE07 ------- 8248 GPs (27.4933/s) [rcu: g52933 f0x0 ] n_max_cbs: 183589
CPU count limited from 16 to 12
TREE09 ------- 39903 GPs (133.01/s) [rcu: g717745 f0x0 ] n_max_cbs: 83002

However, other than these two complaints, this is what the summary of an uneventful rcutorture run looks like.

Whatever is the meaning of all those numbers in the summary???

The console output for each run and much else besides may be found in the /home/git/linux/tools/testing/selftests/rcutorture/res/2020.06.15-10.23.02 directory called out above.

The more CPUs you have, the fewer batches are required:

CPUsBatches
119
216
413
810
166
323
642
1281


If you specify more CPUs than your system actually has, kvm.sh will ignore your fantasies in favor of your system's reality.

Specifying Specific Scenarios

Sometimes it is useful to take one's ire out on a specific type of RCU, for example, SRCU. You can use the --configs argument to select specific scenarios:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12 \
    --configs "SRCU-N SRCU-P SRCU-t SRCU-u"

This runs in two batches, but the second batch uses only two CPUs, which is again wasteful. Given that SRCU-P requires eight CPUs, SRCU-N four CPUs, and SRCU-t and SRCU-u one each, it would cost nothing to run two instances of each of these scenarios other than SRCU-N as follows:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12 \
    --configs "SRCU-N 2*SRCU-P 2*SRCU-t 2*SRCU-u"

This same notation can be used to run multiple copies of the entire list of scenarios. For example (again, in v5.7), a system with 384 CPUs can use --configs 4*CFLIST to run four copies of of the full set of scenarios as follows:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 384 --configs "4*CFLIST"

Mixing and matching is permissible, for example:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 384 --configs "3*CFLIST 12*TREE02"

A kvm.sh script that is to run on a wide variety of systems can benefit from --allcpus (expected to appear in v5.9), which acts like --cpus N, where N is the number of CPUs on the current system:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --configs "3*CFLIST 12*TREE02"

Build time can dominate when running a large number of short-duration runs, for example, when chasing down a low-probability non-deterministic boot-time failure. Use of --trust-make can be very helpful in this case:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 384 --duration 2 \
    --configs "1000*TINY01" --trust-make

Without --trust-make, rcutorture will play it safe by forcing your source tree to a known state between each build. In addition to --trust-make, there are a number of tools such as ccache that can also greatly reduce build times.

Locating Test Failures

Although the ability to automatically run many tens of scenarios can be very convenient, it can also cause significant eyestrain staring through a long “summary” checking for test failures. Therefore, if there are failures, this is noted at the end of the summary, for example, as shown in the following abbreviated output from a --configs "28*TREE03" run:

TREE03.8 ------- 1195094 GPs (55.3284/s) [rcu: g11475633 f0x0 ] n_max_cbs: 1449125
TREE03.9 ------- 1202936 GPs (55.6915/s) [rcu: g11572377 f0x0 ] n_max_cbs: 1514561
3 runs with runtime errors.

Of course, picking the three errors out of the 28 runs can also cause eyestrain, so there is yet another useful little script:

tools/testing/selftests/rcutorture/bin/kvm-find-errors.sh \
    /home/git/linux/tools/testing/selftests/rcutorture/res/2020.06.15-10.23.02

This will run your editor on the make output for each build error and on the console output for each runtime failure, greatly reducing eyestrain. Users of vi can also edit a summary of the runtime errors from each failing run as follows:

vi /home/git/linux/tools/testing/selftests/rcutorture/res/2020.06.15-10.23.02/*/console.log.diags

Enlisting Torture Assistance

If rcutorture produces a failure-free run, that is a failure on the part of rcutorture. After all, there are bugs in there somewhere, and rcutorture failed to find them!

One approach is to increase the duration, for example, to 12 hours (also known as 720 minutes):

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12 --duration 720

Another approach is to enlist the help of other in-kernel torture features, for example, lockdep. The --kconfig parameter to kvm.sh can be used to this end:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12 --configs "TREE03" \
    --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y"

The aid of the kernel address sanitizer (KASAN) can be enlisted using the --kasan argument:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 12 --kasan

The kernel concurrency sanitizer (KCSAN) can also be brought to bear, but proper use of KCSAN requires some thought (see part 1 and part 2 of the LWN “Concurrency bugs should fear the big bad data-race detector” article) and also version 11 or later of Clang/LLVM (and a patch for GCC has been accepted). Once you have all of that in place, the --kcsan argument invokes KCSAN and also generates a summary as described in part 1 of the aforementioned LWN article. Note again that only very recent compiler versions (such as Clang-11) support KCSAN, so a --kmake "CC=clang-11" or similar argument might also be necessary.

Selective Torturing

Sometimes enlisting debugging aid is the best approach, but other times greater selectivity is the best way forward.

Sometimes simply building a kernel is torture enough, especially when building with unusual Kconfig options (see the discussion of --kconfig above). In this case, specifying the --buildonly argument will build the kernels, but refrain from running them. This approach can also be useful for running multiple copies of the resulting binaries on multiple systems: You can use the --buildonly to build the kernels and qemu-cmd scripts, and then run these files on the other systems, given suitable adjustments to the qemu-cmd scripts.

Other times it is useful to torture some specific portion of RCU. For example, one wishing to vent their ire solely on expedited grace periods could add --bootargs "rcutorture.gp_exp=1" to the kvm.sh command line. This argument causes rcutorture to run a stress test using only expedited RCU grace periods, which can be helpful when attempting to work out whether a too-short RCU grace period is due to a bug in the normal or the expedited grace-period code. Similarly, the callback-flooding aspects of rcutorture stress testing can be disabled using --bootargs "rcutorture.fwd_progress=0". It is possible to specify both in one run using --bootargs "rcutorture.gp_exp=1 rcutorture.fwd_progress=0".

Enlisting Debugging Assistance

Still other times, it is helpful to enable event tracing. For example, if the rcu_barrier() event traces are of interest, use --bootargs "trace_event=rcu:rcu_barrier". The trace buffer will be dumped automatically upon specific rcutorture failures. If the failure mode is instead a non-rcutorture-specific oops, use this: --bootargs "trace_event=rcu:rcu_barrier ftrace_dump_on_oops". If it is also necessary to dump the trace buffers on warnings, a (heavy handed) way to achieve this is to use --bootargs "trace_event=rcu:rcu_barrier ftrace_dump_on_oops panic_on_warn".

If you have many tens of rcutorture instances that all decide to flush their trace buffers at about the same time, the combined flushing operations can take considerable time, especially if the underlying system features rotating rust. If only the most recent activity is of interest, specifying a small trace buffer can help: --bootargs "trace_event=rcu:rcu_barrier ftrace_dump_on_oops panic_on_warn trace_buf_size=3k".

If only the oopsing/warning CPU's traces are relevant, the orig_cpu modifier can be helpful: --bootargs "trace_event=rcu:rcu_barrier ftrace_dump_on_oops=orig_cpu panic_on_warn trace_buf_size=3k".

More information on tracing can be found in Documentation/trace, and more on kernel boot parameters in general may be found in kernel-parameters.txt. Given the multi-thousand-line heft of this latter, there is clearly great scope for tweaking your torturing of RCU!

Why Stop at Torturing RCU?

After all, locking can sometimes be almost as annoying as RCU. And it is possible to torture locking:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock

This locktorture stress test does not get as much love and attention as does rcutorture, but it is at least a start.

There are also a couple of RCU performance tests and an upcoming smp_call_function*() stress test that use this same torture-test infrastructure. Please note that the details of the summary output varies from test to test.

In short, you can do some serious torturing of RCU, and much else besides! So show them no mercy!!! :-)

June 16, 2020 08:42 PM

James Morris: Linux Security Summit North America 2020: Online Schedule

Just a quick update on the Linux Security Summit North America (LSS-NA) for 2020.

The event will take place over two days as an online event, due to COVID-19.  The dates are now July 1-2, and the full schedule details may be found here.

The main talks are:

There are also short (30 minute) topics:

This year we will also have a Q&A panel at the end of each day, moderated by Elena Reshetova. The panel speakers are:

LSS-NA this year is included with OSS+ELC registration, which is USD $50 all up.  Register here.

Hope to see you soon!

June 16, 2020 01:28 AM

June 15, 2020

Linux Plumbers Conference: Linux Plumbers Conference Registration Opening Postponed

The committee is relentlessly working on recreating online the Linux Plumbers Conference (LPC) experience that we have all come to appreciate, and take for granted, over the past few years.

We had initially planned to open registration on June 15th. While travel planning is not one, there are still very many aspects of the conference being worked on. We are now aiming to open registration for Linux Plumbers Conference (LPC) on June 23rd.

Right now we have shortlisted BigBlueButton as our online conferencing solution. One of our objectives is to run LPC 2020 online on a full open software stack.

We anticipate running our usual set of parallel tracks, including microconferences per day. With our globally distributed participants, identifying the timezone most convenient is still work in progress. There will be a timezone question on our registration form, please make sure to answer it.

To help us test part of the online platform, and offer transparency about where things stand with LPC 2020 preparation, the committee is currently planning the first ever “LPC Town Hall Meeting”. We hope to host it very soon. More information will be made available very soon.

As previously announced, we are reducing the conference registration fee to US$50. Registration availability has been an issue in past years. We have no way to anticipate what the uptake will be for LPC 2020 registration. The committee will try its best to meet registration demand. Also, several Call for Proposals are open and awaiting your contributions.

We will be sharing more information with everyone here soon. Looking forward to LPC 2020 together with you.

June 15, 2020 05:20 PM

Linux Plumbers Conference: Real-time Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Real-time Microconference has been accepted into the 2020 Linux Plumbers Conference!

After another successful Real-time microconference at LPC last year, there’s still more to work to be done. The PREEMPT_RT patch set (aka “The Real-Time Patch”) was created in 2004 in the effort to make Linux into a hard real-time designed operating system. Over the years much of the RT patch has made it into mainline Linux, which includes: mutexes, lockdep, high resolution timers, Ftrace, RCU_PREEMPT, priority inheritance, threaded interrupts and much more. There’s just a little left to get RT fully into mainline, and the light at the end of the tunnel is finally in view. It is expected that the RT patch will be in mainline within a year (and possibly before Plumbers begins!), which changes the topics of discussion. Once it is in Linus’s tree, a whole new set of issues must be handled.

The focus on this years Plumbers events will include:

June 15, 2020 02:19 PM

June 11, 2020

Linux Plumbers Conference: Scheduler Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Scheduler Microconference has been accepted into the 2020 Linux Plumbers Conference!

The scheduler is an important functionality of the Linux kernel as it decides what gets to run, when and for how long. With different topologies and workloads this is no easy task to give the user the best experience possible. During the Scheduler microconference at LPC last year, we started the work to make SCHED_DEADLINE safe for kthreads and improving load balancing. This year, we continue working on core scheduling, unifying the interface for TurboSched and task latency nice, and continue the discussion on proxy execution.

Topics to be discussed include:

Come and join us in the discussion of controlling what tasks get to run on your machine and when. We hope to see you there!

June 11, 2020 02:11 PM

June 09, 2020

Michael Kerrisk (manpages): man-pages-5.07 is released

I've released man-pages-5.07. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from more than 80 contributors. The release includes more than 380 commits that change more than 380 pages. One new page was added in this release, and one page was removed.

The most notable of the changes in man-pages-5.07 are the following:

June 09, 2020 05:09 PM

May 27, 2020

Kees Cook: security things in Linux v5.5

Previously: v5.4.

I got a bit behind on this blog post series! Let’s get caught up. Here are a bunch of security things I found interesting in the Linux kernel v5.5 release:

restrict perf_event_open() from LSM
Given the recurring flaws in the perf subsystem, there has been a strong desire to be able to entirely disable the interface. While the kernel.perf_event_paranoid sysctl knob has existed for a while, attempts to extend its control to “block all perf_event_open() calls” have failed in the past. Distribution kernels have carried the rejected sysctl patch for many years, but now Joel Fernandes has implemented a solution that was deemed acceptable: instead of extending the sysctl, add LSM hooks so that LSMs (e.g. SELinux, Apparmor, etc) can make these choices as part of their overall system policy.

generic fast full refcount_t
Will Deacon took the recent refcount_t hardening work for both x86 and arm64 and distilled the implementations into a single architecture-agnostic C version. The result was almost as fast as the x86 assembly version, but it covered more cases (e.g. increment-from-zero), and is now available by default for all architectures. (There is no longer any Kconfig associated with refcount_t; the use of the primitive provides full coverage.)

linker script cleanup for exception tables
When Rick Edgecombe presented his work on building Execute-Only memory under a hypervisor, he noted a region of memory that the kernel was attempting to read directly (instead of execute). He rearranged things for his x86-only patch series to work around the issue. Since I’d just been working in this area, I realized the root cause of this problem was the location of the exception table (which is strictly a lookup table and is never executed) and built a fix for the issue and applied it to all architectures, since it turns out the exception tables for almost all architectures are just a data table. Hopefully this will help clear the path for more Execute-Only memory work on all architectures. In the process of this, I also updated the section fill bytes on x86 to be a trap (0xCC, int3), instead of a NOP instruction so functions would need to be targeted more precisely by attacks.

KASLR for 32-bit PowerPC
Joining many other architectures, Jason Yan added kernel text base-address offset randomization (KASLR) to 32-bit PowerPC.

seccomp for RISC-V
After a bit of long road, David Abdurachmanov has added seccomp support to the RISC-V architecture. The series uncovered some more corner cases in the seccomp self tests code, which is always nice since then we get to make it more robust for the future!

seccomp USER_NOTIF continuation
When the seccomp SECCOMP_RET_USER_NOTIF interface was added, it seemed like it would only be used in very limited conditions, so the idea of needing to handle “normal” requests didn’t seem very onerous. However, since then, it has become clear that the overhead of a monitor process needing to perform lots of “normal” open() calls on behalf of the monitored process started to look more and more slow and fragile. To deal with this, it became clear that there needed to be a way for the USER_NOTIF interface to indicate that seccomp should just continue as normal and allow the syscall without any special handling. Christian Brauner implemented SECCOMP_USER_NOTIF_FLAG_CONTINUE to get this done. It comes with a bit of a disclaimer due to the chance that monitors may use it in places where ToCToU is a risk, and for possible conflicts with SECCOMP_RET_TRACE. But overall, this is a net win for container monitoring tools.

EFI_RNG_PROTOCOL for x86
Some EFI systems provide a Random Number Generator interface, which is useful for gaining some entropy in the kernel during very early boot. The arm64 boot stub has been using this for a while now, but Dominik Brodowski has now added support for x86 to do the same. This entropy is useful for kernel subsystems performing very earlier initialization whre random numbers are needed (like randomizing aspects of the SLUB memory allocator).

FORTIFY_SOURCE for MIPS
As has been enabled on many other architectures, Dmitry Korotin got MIPS building with CONFIG_FORTIFY_SOURCE, so compile-time (and some run-time) buffer overflows during calls to the memcpy() and strcpy() families of functions will be detected.

limit copy_{to,from}_user() size to INT_MAX
As done for VFS, vsnprintf(), and strscpy(), I went ahead and limited the size of copy_to_user() and copy_from_user() calls to INT_MAX in order to catch any weird overflows in size calculations.

Other things
Alexander Popov pointed out some more v5.5 features that I missed in this blog post. I’m repeating them here, with some minor edits/clarifications. Thank you Alexander!

Edit: added Alexander Popov’s notes

That’s it for v5.5! Let me know if there’s anything else that I should call out here. Next up: Linux v5.6.

© 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

May 27, 2020 08:04 PM

Rusty Russell: 57 Varieties of Pyrite: Exchanges Are Now The Enemy of Bitcoin

TL;DR: exchanges are casinos and don’t want to onboard anyone into bitcoin. Avoid.

There’s a classic scam in the “crypto” space: advertize Bitcoin to get people in, then sell suckers something else entirely. Over the last few years, this bait-and-switch has become the core competency of “bitcoin” exchanges.

I recently visited the homepage of Australian exchange btcmarkets.net: what a mess. There was a list of dozens of identical-looking “cryptos”, with bitcoin second after something called “XRP”; seems like it was sorted by volume?

Incentives have driven exchanges to become casinos, and they’re doing exactly what you’d expect unregulated casinos to do. This is no place you ever want to send anyone.

Incentives For Exchanges

Exchanges make money on trading, not on buying and holding. Despite the fact that bitcoin is the only real attempt to create an open source money, scams with no future are given false equivalence, because more assets means more trading. Worse than that, they are paid directly to list new scams (the crappier, the more money they can charge!) and have recently taken the logical step of introducing and promoting their own crapcoins directly.

It’s like a gold dealer who also sells 57 varieties of pyrite, which give more margin than selling actual gold.

For a long time, I thought exchanges were merely incompetent. Most can’t even give out fresh addresses for deposits, batch their outgoing transactions, pay competent fee rates, perform RBF or use segwit.

But I misunderstood: they don’t want to sell bitcoin. They use bitcoin to get you in the door, but they want you to gamble. This matters: you’ll find subtle and not-so-subtle blockers to simply buying bitcoin on an exchange. If you send a friend off to buy their first bitcoin, they’re likely to come back with something else. That’s no accident.

Looking Deeper, It Gets Worse.

Regrettably, looking harder at specific exchanges makes the picture even bleaker.

Consider Binance: this mainland China backed exchange pretending to be a Hong Kong exchange appeared out of nowhere with fake volume and demonstrated the gullibility of the entire industry by being treated as if it were a respected member. They lost at least 40,000 bitcoin in a known hack, and they also lost all the personal information people sent them to KYC. They aggressively market their own coin. But basically, they’re just MtGox without Mark Karpales’ PHP skills or moral scruples and much better marketing.

Coinbase is more interesting: an MBA-run “bitcoin” company which really dislikes bitcoin. They got where they are by spending big on regulations compliance in the US so they could operate in (almost?) every US state. (They don’t do much to dispel the wide belief that this regulation protects their users, when in practice it seems only USD deposits have any guarantee). Their natural interest is in increasing regulation to maintain that moat, and their biggest problem is Bitcoin.

They have much more affinity for the centralized coins (Ethereum) where they can have influence and control. The anarchic nature of a genuine open source community (not to mention the developers’ oft-stated aim to improve privacy over time) is not culturally compatible with a top-down company run by the Big Dog. It’s a running joke that their CEO can’t say the word “Bitcoin”, but their recent “what will happen to cryptocurrencies in the 2020s” article is breathtaking in its boldness: innovation is mainly happening on altcoins, and they’re going to overtake bitcoin any day now. Those scaling problems which the Bitcoin developers say they don’t know how to solve? This non-technical CEO knows better.

So, don’t send anyone to an exchange, especially not a “market leading” one. Find some service that actually wants to sell them bitcoin, like CashApp or Swan Bitcoin.

May 27, 2020 12:49 AM

May 20, 2020

Dave Airlie (blogspot): DirectX on Linux - what it is/isn't

This morning I saw two things that were Microsoft and Linux graphics related.

https://devblogs.microsoft.com/commandline/the-windows-subsystem-for-linux-build-2020-summary/

a) DirectX on Linux for compute workloads
b) Linux GUI apps on Windows

At first I thought these were related, but it appears at least presently these are quite orthogonal projects.

First up clarify for the people who jump to insane conclusions:

The DX on Linux is a WSL2 only thing. Microsoft are not any way bringing DX12 to Linux outside of the Windows environment. They are also in no way open sourcing any of the DX12 driver code. They are recompiling the DX12 userspace drivers (from GPU vendors) into Linux shared libraries, and running them on a kernel driver shim that transfers the kernel interface up to the closed source Windows kernel driver. This is in no way useful for having DX12 on Linux baremetal or anywhere other than in a WSL2 environment. It is not useful for Linux gaming.

Microsoft have submitted to the upstream kernel the shim driver to support this. This driver exposes their D3DKMT kernel interface from Windows over virtual channels into a Linux driver that provides an ioctl interface. The kernel drivers are still all running on the Windows side.

Now I read the Linux GUI apps bit and assumed that these things were the same, well it turns out the DX12 stuff doesn't address presentation at all. It's currently only for compute/ML workloads using CUDA/DirectML. There isn't a way to put the results of DX12 rendering from the Linux guest applications onto the screen at all. The other project is a wayland/RDP integration server, that connects Linux apps via wayland to RDP client on Windows display, integrating that with DX12 will be a tricky project, and then integrating that upstream with the Linux stack another step completely.

Now I'm sure this will be resolved, but it has certain implications on how the driver architecture works and how much of the rest of the Linux graphics ecosystem you have to interact with, and that means that the current driver might not be a great fit in the long run and upstreaming it prematurely might be a bad idea.

From my point of view the kernel shim driver doesn't really bring anything to Linux, it's just a tunnel for some binary data between a host windows kernel binary and a guest linux userspace binary. It doesn't enhance the Linux graphics ecosystem in any useful direction, and as such I'm questioning why we'd want this upstream at all.

May 20, 2020 12:01 AM

May 19, 2020

Linux Plumbers Conference: Containers and Checkpoint/Restore Microconference Accepted into 2020 Linux Plumbers Conference

We are pleased to announce that the Containers and Checkpoint/Restore Microconference has been accepted into the 2020 Linux Plumbers Conference!

After another successful Containers Microconference last year , there’s still a lot more work to be done. Last year we discussed the intersection between the new mount api and containers, various new vfs features including a strong and fruitful discussion about id shifting, several new security hardening aspects, and improvements when restarting syscalls during checkpoint/restore. Last year’s microconference topics led to quite a few patches that have since landed in the upstream kernel with others actively being discussed. This includes, various improvements to seccomp syscall interceptions, the implementation of a new process creation syscall, the implementation of pidfds, and the addition of time namespaces.

This year’s topics include:

Come join us and participate in the discussion with what holds “The Cloud” together.

We hope to see you there!

Christian, Mike, Stéphane

May 19, 2020 04:19 PM

May 15, 2020

Linux Plumbers Conference: Linux Plumbers Conference 2020 Goes Virtual

As previously promised, we are announcing today that we have decided to hold the the Linux Plumbers Conference 2020 virtually instead of in person. We value the safety and health of our community and do not wish to expose anyone to unnecessary risks.

We do appreciate that it is the in-person aspect of plumbers (the hallway track) which attendees find the most valuable. An online Linux Plumbers Conference will clearly be different from past events. We are working hard to find ways to preserve as much of the LPC experience as we can while also taking advantage of any new opportunities that the online setting offers us. Since we no longer have many of the fixed expenses of an in-person conference, we are able to reduce the registration fee to $50. In addition we are pushing back the opening of registration to June 15 2020.

We’ll provide more details as we figure them out, thanks for your patience and support.

Do not forget to send your contribution.

We do have great proposals and if you have submitted, thank you very much. Our microconference capacity is filling up quickly, if you want your microconference to be considered, act now! We are still looking for proposals for refereed talks as well.

CfP: https://www.linuxplumbersconf.org/event/7/abstracts/

The LPC 2020 Planning Committee

May 15, 2020 04:08 PM