Kernel Planet

August 03, 2022

Daniel Vetter: Locking Engineering Hierarchy

The first part of this series covered principles of locking engineering. This part goes through a pile of locking patterns and designs, from most favourable and easiest to adjust and hence resulting in a long term maintainable code base, to the least favourable since hardest to ensure it works correctly and stays that way while the code evolves. For convenience even color coded, with the dangerous levels getting progressively more crispy red indicating how close to the burning fire you are! Think of it as Dante’s Inferno, but for locking.

As a reminder from the intro of the first part, with locking engineering I mean the art of ensuring that there’s sufficient consistency in reading and manipulating data structures, and not just sprinkling mutex_lock() and mutex_unlock() calls around until the result looks reasonable and lockdep has gone quiet.

Level 0: No Locking

The dumbest possible locking is no need for locking at all. Which does not mean extremely clever lockless tricks for a “look, no calls to mutex_lock()” feint, but an overall design which guarantees that any writers cannot exist concurrently with any other access at all. This removes the need for consistency guarantees while accessing an object at the architectural level.

There’s a few standard patterns to achieve locking nirvana.

Locking Pattern: Immutable State

The lesson in graphics API design over the last decade is that immutable state objects rule, because they both lead to simpler driver stacks and also better performance. Vulkan instead of the OpenGL with it’s ridiculous amount of mutable and implicit state is the big example, but atomic instead of legacy kernel mode setting or Wayland instead of the X11 are also built on the assumption that immutable state objects are a Great Thing (tm).

The usual pattern is:

  1. A single thread fully constructs an object, including any sub structures and anything else you might need. Often subsystems provide initialization helpers for objects that driver can subclass through embedding, e.g. drm_connector_init() for initializing a kernel modesetting output object. Additional functions can set up different or optional aspects of an object, e.g. drm_connector_attach_encoder() sets up the invariant links to the preceding element in a kernel modesetting display chain.

  2. The fully formed object is published to the world, in the kernel this often happens by registering it under some kind of identifier. This could be a global identifier like register_chrdev() for character devices, something attached to a device like registering a new display output on a driver with drm_connector_register() or some struct xarray in the file private structure. Note that this step here requires memory barriers of some sort. If you hand roll the data structure like a list or lookup tree with your own fancy locking scheme instead of using existing standard interfaces you are on a fast path to level 3 locking hell. Don’t do that.

  3. From this point on there are no consistency issues anymore and all threads can access the object without any locking.

Locking Pattern: Single Owner

Another way to ensure there’s no concurrent access is by only allowing one thread to own an object at a given point of time, and have well defined handover points if that is necessary.

Most often this pattern is used for asynchronously processing a userspace request:

  1. The syscall or IOCTL constructs an object with sufficient information to process the userspace’s request.

  2. That object is handed over to a worker thread with e.g. queue_work().

  3. The worker thread is now the sole owner of that piece of memory and can do whatever it feels like with it.

Again the second step requires memory barriers, which means if you hand roll your own lockless queue you’re firmly in level 3 territory and won’t get rid of the burned in red hot afterglow in your retina for quite some time. Use standard interfaces like struct completion or even better libraries like the workqueue subsystem here.

Note that the handover can also be chained or split up, e.g. for a nonblocking atomic kernel modeset requests there’s three asynchronous processing pieces involved:

Locking Pattern: Reference Counting

Users generally don’t appreciate if the kernel leaks memory too much, and cleaning up objects by freeing their memory and releasing any other resources tends to be an operation of the very much mutable kind. Reference counting to the rescue!

Note that this scheme falls apart when released objects are put into some kind of cache and can be resurrected. In that case your cleanup code needs to somehow deal with these zombies and ensure there’s no confusion, and vice versa any code that resurrects a zombie needs to deal the wooden spikes the cleanup code might throw at an inopportune time. The worst example of this kind is SLAB_TYPESAFE_BY_RCU, where readers that are only protected with rcu_read_lock() may need to deal with objects potentially going through simultaneous zombie resurrections, potentially multiple times, while the readers are trying to figure out what is going on. This generally leads to lots of sorrow, wailing and ill-tempered maintainers, as the GPU subsystem has and continues to experience with struct dma_fence.

Hence use standard reference counting, and don’t be tempted by the siren of trying to implement clever caching of any kind.

Level 1: Big Dumb Lock

It would be great if nothing ever changes, but sometimes that cannot be avoided. At that point you add a single lock for each logical object. An object could be just a single structure, but it could also be multiple structures that are dynamically allocated and freed under the protection of that single big dumb lock, e.g. when managing GPU virtual address space with different mappings.

The tricky part is figuring out what is an object to ensure that your lock is neither too big nor too small:

Ideally, your big dumb lock would always be right-sized everytime the requirements on the datastructures changes. But working magic 8 balls tend to be on short supply, and you tend to only find out that your guess was wrong when the pain of the lock being too big or too small is already substantial. The inherent struggles of resizing a lock as the code evolves then keeps pushing you further away from the optimum instead of closer. Good luck!

Level 2: Fine-grained Locking

It would be great if this is all the locking we ever need, but sometimes there’s functional reasons that force us to go beyond the single lock for each logical object approach. This section will go through a few of the common examples, and the usual pitfalls to avoid.

But before we delve into details remember to document in kerneldoc with the inline per-member kerneldoc comment style once you go beyond a simple single lock per object approach. It’s the best place for future bug fixers and reviewers - meaning you - to find the rules for how at least things were meant to work.

Locking Pattern: Object Tracking Lists

One of the main duties of the kernel is to track everything, least to make sure there’s no leaks and everything gets cleaned up again. But there’s other reasons to maintain lists (or other container structures) of objects.

Now sometimes there’s a clear parent object, with its own lock, which could also protect the list with all the objects, but this does not always work:

Simplicity should still win, therefore only add a (nested) lock for lists or other container objects if there’s really no suitable object lock that could do the job instead.

Locking Pattern: Interrupt Handler State

Another example that requires nested locking is when part of the object is manipulated from a different execution context. The prime example here are interrupt handlers. Interrupt handlers can only use interrupt safe spinlocks, but often the main object lock must be a mutex to allow sleeping or allocating memory or nesting with other mutexes.

Hence the need for a nested spinlock to just protect the object state shared between the interrupt handler and code running from process context. Process context should generally only acquire the spinlock nested with the main object lock, to avoid surprises and limit any concurrency issues to just the singleton interrupt handler.

Locking Pattern: Async Processing

Very similar to the interrupt handler problems is coordination with async workers. The best approach is the single owner pattern, but often state needs to be shared between the worker and other threads operating on the same object.

The naive approach of just using a single object lock tends to deadlock:

start_processing(obj)
{
	mutex_lock(&obj->lock);
	/* set up the data for the async work */;
	schedule_work(&obj->work);
	mutex_unlock(&obj->lock);
}

stop_processing(obj)
{
	mutex_lock(&obj->lock);
	/* clear the data for the async work */;
	cancel_work_sync(&obj->work);
	mutex_unlock(&obj->lock);
}

work_fn(work)
{
	obj = container_of(work, work);

	mutex_lock(&obj->lock);
	/* do some processing */
	mutex_unlock(&obj->lock);
}

Do not worry if you don’t spot the deadlock, because it is a cross-release dependency between the entire work_fn() and cancel_work_sync() and these are a lot trickier to spot. Since cross-release dependencies are a entire huge topic on themselves I won’t go into more details, a good starting point is this LWN article.

There’s a bunch of variations of this theme, with problems in different scenarios:

Like with interrupt handlers the clean solution tends to be an additional nested lock which protects just the mutable state shared with the work function and nests within the main object lock. That way work can be cancelled while the main object lock is held, which avoids a ton of races. But without holding the sublock that work_fn() needs, which avoids the deadlock.

Note that in some cases the superior lock doesn’t need to exist, e.g. struct drm_connector_state is protected by the single owner pattern, but drivers might have some need for some further decoupled asynchronous processing, e.g. for handling the content protect or link training machinery. In that case only the sublock for the mutable driver private state shared with the worker exists.

Locking Pattern: Weak References

Reference counting is a great pattern, but sometimes you need be able to store pointers without them holding a full reference. This could be for lookup caches, or because your userspace API mandates that some references do not keep the object alive - we’ve unfortunately committed that mistake in the GPU world. Or because holding full references everywhere would lead to unreclaimable references loops and there’s no better way to break them than to make some of the references weak. In languages with a garbage collector weak references are implemented by the runtime, and so no real worry. But in the kernel the concept has to be implemented by hand.

Since weak references are such a standard pattern struct kref has ready-made support for them. The simple approach is using kref_put_mutex() with the same lock that also protects the structure containing the weak reference. This guarantees that either the weak reference pointer is gone too, or there is at least somewhere still a strong reference around and it is therefore safe to call kref_get(). But there are some issues with this approach:

The much better approach is using kref_get_unless_zero(), together with a spinlock for your data structure containing the weak reference. This looks especially nifty in combination with struct xarray.

obj_find_in_cache(id)
{
	xa_lock();
	obj = xa_find(id);
	if (!kref_get_unless_zero(&obj->kref))
		obj = NULL;
	xa_unlock();

	return obj;
}

With this all the issues are resolved:

With both together the locking does no longer leak beyond the lookup structure and it’s associated code any more, unlike with kref_put_mutex() and similar approaches. Thankfully kref_get_unless_zero() has become the much more popular approach since it was added 10 years ago!

Locking Antipattern: Confusing Object Lifetime and Data Consistency

We’ve now seen a few examples where the “no locking” patterns from level 0 collide in annoying ways when more locking is added to the point where we seem to violate the principle to protect data, not code. It’s worth to look at this a bit closer, since we can generalize what’s going on here to a fairly high-level antipattern.

The key insight is that the “no locking” patterns all rely on memory barrier primitives in disguise, not classic locks, to synchronize access between multiple threads. In the case of the single owner pattern there might also be blocking semantics involved, when the next owner needs to wait for the previous owner to finish processing first. These are functions like flush_work() or the various wait functions like wait_event() or wait_completion().

Calling these barrier functions while holding locks commonly leads to issues:

For these reasons try as hard as possible to not hold any locks, or as few as feasible, when calling any of these memory barriers in disguise functions used to manage object lifetime or ownership in general. The antipattern here is abusing locks to fix lifetime issues. We have seen two specific instances thus far:

We will see some more, but the antipattern holds in general as a source of troubles.

Level 2.5: Splitting Locks for Performance Reasons

We’ve looked at a pile of functional reasons for complicating the locking design, but sometimes you need to add more fine-grained locking for performance reasons. This is already getting dangerous, because it’s very tempting to tune some microbenchmark just because we can, or maybe delude ourselves that it will be needed in the future. Therefore only complicate your locking if:

Only then make your future maintenance pain guaranteed worse by applying more tricky locking than the bare minimum necessary for correctness. Still, go with the simplest approach, often converting a lock to its read-write variant is good enough.

Sometimes this isn’t enough, and you actually have to split up a lock into more fine-grained locks to achieve more parallelism and less contention among threads. Note that doing so blindly will backfire because locks are not free. When common operations still have to take most of the locks anyway, even if it’s only for short time and in strict succession, the performance hit on single threaded workloads will not justify any benefit in more threaded use-cases.

Another issue with more fine-grained locking is that often you cannot define a strict nesting hierarchy, or worse might need to take multiple locks of the same object or lock class. I’ve written previously about this specific issue, and more importantly, how to teach lockdep about lock nesting, the bad and the good ways.

One really entertaining story from the GPU subsystem, for bystanders at least, is that we really screwed this up for good by defacto allowing userspace to control the lock order of all the objects involved in an IOCTL. Furthermore disjoint operations should actually proceed without contention. If you ever manage to repeat this feat you can take a look at the wait-wound mutexes. Or if you just want some pretty graphs, LWN has an old article about wait-wound mutexes too.

Level 3: Lockless Tricks

Do not go here wanderer!

Seriously, I have seen a lot of very fancy driver subsystem locking designs, I have not yet found a lot that were actually justified. Because only real world, non-contrived performance issues can ever justify reaching for this level, and in almost all cases algorithmic or architectural fixes yield much better improvements than any kind of (locking) micro-optimization could ever hope for.

Hence this is just a long list of antipatterns, so that people who have not yet a grumpy expression permanently chiseled into their facial structure know when they’re in trouble.

Note that this section isn’t limited to lockless tricks in the academic sense of guaranteed constant overhead forward progress, meaning no spinning or retrying anywhere at all. It’s for everything which doesn’t use standard locks like struct mutex, spinlock_t, struct rw_semaphore, or any of the others provided in the Linux kernel.

Locking Antipattern: Using RCU

Yeah RCU is really awesome and impressive, but it comes at serious costs:

All together all freely using RCU achieves is proving that there really is no bottom on the code maintainability scale. It is not a great day when your driver dies in synchronize_rcu() and lockdep has no idea what’s going on, and I’ve seen such days.

Personally I think in driver subsystem the most that’s still a legit and justified use of RCU is for object lookup with struct xarray and kref_get_unless_zero(), and cleanup handled entirely by kfree_rcu(). Anything more and you’re very likely chasing a rabbit down it’s hole and have not realized it yet.

Locking Antipattern: Atomics

Firstly, Linux atomics have two annoying properties just to start:

Those are a lot of unnecessary trap doors, but the real bad part is what people tend to build with atomic instructions:

In short, unless you’re actually building a new locking or synchronization primitive in the core kernel, you most likely do not want to get seen even looking at atomic operations as an option.

Locking Antipattern: preempt/local_irq/bh_disable() and Friends …

This one is simple: Lockdep doesn’t understand them. The real-time folks hate them. Whatever it is you’re doing, use proper primitives instead, and at least read up on the LWN coverage on why these are problematic what to do instead. If you need some kind of synchronization primitive - maybe to avoid the lifetime vs. consistency antipattern pitfalls - then use the proper functions for that like synchronize_irq().

Locking Antipattern: Memory Barriers

Or more often, lack of them, incorrect or imbalanced use of barriers, badly or wrongly or just not at all documented memory barriers, or …

Fact is that exceedingly most kernel hackers, and more so driver people, have no useful understanding of the Linux kernel’s memory model, and should never be caught entertaining use of explicit memory barriers in production code. Personally I’m pretty good at spotting holes, but I’ve had to learn the hard way that I’m not even close to being able to positively prove correctness. And for better or worse, nothing short of that tends to cut it.

For a still fairly cursory discussion read the LWN series on lockless algorithms. If the code comments and commit message are anything less rigorous than that it’s fairly safe to assume there’s an issue.

Now don’t get me wrong, I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly. But aside from extreme exceptions this kind of maintenance cost has simply no justification in a driver subsystem. At least unless it’s packaged in a driver hacker proof library or core kernel service of some sorts with all the memory barriers well hidden away where ordinary fools like me can’t touch them.

Closing Thoughts

I hope you enjoyed this little tour of progressively more worrying levels of locking engineering, with really just one key take away:

Simple, dumb locking is good locking, since with that you have a fighting chance to make it correct locking.

Thanks to Daniel Stone and Jason Ekstrand for reading and commenting on drafts of this text.

August 03, 2022 12:00 AM

July 29, 2022

Linux Plumbers Conference: LPC 2022 Schedule is posted!

 

The schedule for when the miniconferences and tracks are going to occur is now posted at: https://lpc.events/event/16/timetable/#all

The runners for the miniconferences will be adding more details to each of their schedules over the coming weeks.

The Linux Plumbers Refereed track schedule and Kernel Summit schedule is now available at: https://lpc.events/event/16/timetable/#all.detailed

The leads for the networking and toolchain tracks will be adding more details to each of their schedules over the coming weeks, as well.

For those that are registered as in person, you are free to continue to submit Birds of a Feather(BOF) sessions. They will be allocated space in the BOF rooms on a first come, first serve basis. Please note that the BOFs will not be recorded.

We’re looking forward to a great 3 days of presentations and discussions. We hope you can join us either in-person or virtually!

July 29, 2022 03:50 PM

July 28, 2022

Matthew Garrett: UEFI rootkits and UEFI secure boot

Kaspersky describes a UEFI-implant used to attack Windows systems. Based on it appearing to require patching of the system firmware image, they hypothesise that it's propagated by manually dumping the contents of the system flash, modifying it, and then reflashing it back to the board. This probably requires physical access to the board, so it's not especially terrifying - if you're in a situation where someone's sufficiently enthusiastic about targeting you that they're reflashing your computer by hand, it's likely that you're going to have a bad time regardless.

But let's think about why this is in the firmware at all. Sophos previously discussed an implant that's sufficiently similar in some technical details that Kaspersky suggest they may be related to some degree. One notable difference is that the MyKings implant described by Sophos installs itself into the boot block of legacy MBR partitioned disks. This code will only be executed on old-style BIOS systems (or UEFI systems booting in BIOS compatibility mode), and they have no support for code signatures, so there's no need to be especially clever. Run malicious code in the boot block, patch the next stage loader, follow that chain all the way up to the kernel. Simple.

One notable distinction here is that the MBR boot block approach won't be persistent - if you reinstall the OS, the MBR will be rewritten[1] and the infection is gone. UEFI doesn't really change much here - if you reinstall Windows a new copy of the bootloader will be written out and the UEFI boot variables (that tell the firmware which bootloader to execute) will be updated to point at that. The implant may still be on disk somewhere, but it won't be run.

But there's a way to avoid this. UEFI supports loading firmware-level drivers from disk. If, rather than providing a backdoored bootloader, the implant takes the form of a UEFI driver, the attacker can set a different set of variables that tell the firmware to load that driver at boot time, before running the bootloader. OS reinstalls won't modify these variables, which means the implant will survive and can reinfect the new OS install. The only way to get rid of the implant is to either reformat the drive entirely (which most OS installers won't do by default) or replace the drive before installation.

This is much easier than patching the system firmware, and achieves similar outcomes - the number of infected users who are going to wipe their drives to reinstall is fairly low, and the kernel could be patched to hide the presence of the implant on the filesystem[2]. It's possible that the goal was to make identification as hard as possible, but there's a simpler argument here - if the firmware has UEFI Secure Boot enabled, the firmware will refuse to load such a driver, and the implant won't work. You could certainly just patch the firmware to disable secure boot and lie about it, but if you're at the point of patching the firmware anyway you may as well just do the extra work of installing your implant there.

I think there's a reasonable argument that the existence of firmware-level rootkits suggests that UEFI Secure Boot is doing its job and is pushing attackers into lower levels of the stack in order to obtain the same outcomes. Technologies like Intel's Boot Guard may (in their current form) tend to block user choice, but in theory should be effective in blocking attacks of this form and making things even harder for attackers. It should already be impossible to perform attacks like the one Kaspersky describes on more modern hardware (the system should identify that the firmware has been tampered with and fail to boot), which pushes things even further - attackers will have to take advantage of vulnerabilities in the specific firmware they're targeting. This obviously means there's an incentive to find more firmware vulnerabilities, which means the ability to apply security updates for system firmware as easily as security updates for OS components is vital (hint hint if your system firmware updates aren't available via LVFS you're probably doing it wrong).

We've known that UEFI rootkits have existed for a while (Hacking Team had one in 2015), but it's interesting to see a fairly widespread one out in the wild. Protecting against this kind of attack involves securing the entire boot chain, including the firmware itself. The industry has clearly been making progress in this respect, and it'll be interesting to see whether such attacks become more common (because Secure Boot works but firmware security is bad) or not.

[1] As we all remember from Windows installs overwriting Linux bootloaders
[2] Although this does run the risk of an infected user booting another OS instead, and being able to see the implant

comment count unavailable comments

July 28, 2022 10:19 PM

July 27, 2022

Daniel Vetter: Locking Engineering Principles

For various reasons I spent the last two years way too much looking at code with terrible locking design and trying to rectify it, instead of a lot more actual building cool things. Symptomatic that the last post here on my neglected blog is also a rant on lockdep abuse.

I tried to distill all the lessons learned into some training slides, and this two part is the writeup of the same. There are some GPU specific rules, but I think the key points should apply to at least apply to kernel drivers in general.

The first part here lays out some principles, the second part builds a locking engineering design pattern hierarchy from the most easiest to understand and maintain to the most nightmare inducing approaches.

Also with locking engineering I mean the general problem of protecting data structures against concurrent access by multiple threads and trying to ensure that each sufficiently consistent view of the data it reads and that the updates it commits won’t result in confusion. Of course it highly depends upon the precise requirements what exactly sufficiently consistent means, but figuring out these kind of questions is out of scope for this little series here.

Priorities in Locking Engineering

Designing a correct locking scheme is hard, validating that your code actually implements your design is harder, and then debugging when - not if! - you screwed up is even worse. Therefore the absolute most important rule in locking engineering, at least if you want to have any chance at winning this game, is to make the design as simple and dumb as possible.

1. Make it Dumb

Since this is the key principle the entire second part of this series will go through a lot of different locking design patterns, from the simplest and dumbest and easiest to understand, to the most hair-raising horrors of complexity and trickiness.

Meanwhile let’s continue to look at everything else that matters.

2. Make it Correct

Since simple doesn’t necessarily mean correct, especially when transferring a concept from design to code, we need guidelines. On the design front the most important one is to design for lockdep, and not fight it, for which I already wrote a full length rant. Here I will only go through the main lessons: Validating locking by hand against all the other locking designs and nesting rules the kernel has overall is nigh impossible, extremely slow, something only few people can do with any chance of success and hence in almost all cases a complete waste of time. We need tools to automate this, and in the Linux kernel this is lockdep.

Therefore if lockdep doesn’t understand your locking design your design is at fault, not lockdep. Adjust accordingly.

A corollary is that you actually need to teach lockdep your locking rules, because otherwise different drivers or subsystems will end up with defacto incompatible nesting and dependencies. Which, as long as you never exercise them on the same kernel boot-up, much less same machine, wont make lockdep grumpy. But it will make maintainers very much question why they are doing what they’re doing.

Hence at driver/subsystem/whatever load time, when CONFIG_LOCKDEP is enabled, take all key locks in the correct order. One example for this relevant to GPU drivers is in the dma-buf subsystem.

In the same spirit, at every entry point to your library or subsytem, or anything else big, validate that the callers hold up the locking contract with might_lock(), might_sleep(), might_alloc() and all the variants and more specific implementations of this. Note that there’s a huge overlap between locking contracts and calling context in general (like interrupt safety, or whether memory allocation is allowed to call into direct reclaim), and since all these functions compile away to nothing when debugging is disabled there’s really no cost in sprinkling them around very liberally.

On the implementation and coding side there’s a few rules of thumb to follow:

3. Make it Fast

Speed doesn’t matter if you don’t understand the design anymore in the future, you need simplicity first.

Speed doesn’t matter if all you’re doing is crashing faster. You need correctness before speed.

Finally speed doesn’t matter where users don’t notice it. If you micro-optimize a path that doesn’t even show up in real world workloads users care about, all you’ve done is wasted time and committed to future maintenance pain for no gain at all.

Similarly optimizing code paths which should never be run when you instead improve your design are not worth it. This holds especially for GPU drivers, where the real application interfaces are OpenGL, Vulkan or similar, and there’s an entire driver in the userspace side - the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts.

The big example here is GPU address patch list processing at command submission time, which was necessary for old hardware that completely lacked any useful concept of a per process virtual address space. But that has changed, which means virtual addresses can stay constant, while the kernel can still freely manage the physical memory by manipulating pagetables, like on the CPU. Unfortunately one driver in the DRM subsystem instead spent an easy engineer decade of effort to tune relocations, write lots of testcases for the resulting corner cases in the multi-level fastpath fallbacks, and even more time handling the impressive amounts of fallout in the form of bugs and future headaches due to the resulting unmaintainable code complexity …

In other subsystems where the kernel ABI is the actual application contract these kind of design simplifications might instead need to be handled between the subsystem’s code and driver implementations. This is what we’ve done when moving from the old kernel modesetting infrastructure to atomic modesetting. But sometimes no clever tricks at all help and you only get true speed with a radically revamped uAPI - io_uring is a great example here.

Protect Data, not Code

A common pitfall is to design locking by looking at the code, perhaps just sprinkling locking calls over it until it feels like it’s good enough. The right approach is to design locking for the data structures, which means specifying for each structure or member field how it is protected against concurrent changes, and how the necessary amount of consistency is maintained across the entire data structure with rules that stay invariant, irrespective of how code operates on the data. Then roll it out consistently to all the functions, because the code-first approach tends to have a lot of issues:

The big antipattern of how you end up with code centric locking is to protect an entire subsystem (or worse, a group of related subsystems) with a single huge lock. The canonical example was the big kernel lock BKL, that’s gone, but in many cases it’s just replaced by smaller, but still huge locks like console_lock().

This results in a lot of long term problems when trying to adjust the locking design later on:

For these reasons big subsystem locks tend to live way past their justified usefulness until code maintenance becomes nigh impossible: Because no individual bugfix is worth the task to really rectify the design, but each bugfix tends to make the situation worse.

From Principles to Practice

Stay tuned for next week’s installment, which will cover what these principles mean when applying to practice: Going through a large pile of locking design patterns from the most desirable to the most hair raising complex.

July 27, 2022 12:00 AM

July 20, 2022

Dave Airlie (blogspot): lavapipe Vulkan 1.3 conformant

The software Vulkan renderer in Mesa, lavapipe, achieved official Vulkan 1.3 conformance. The official entry in the table is  here . We can now remove the nonconformant warning from the driver. Thanks to everyone involved!

July 20, 2022 12:42 AM

July 12, 2022

Matthew Garrett: Responsible stewardship of the UEFI secure boot ecosystem

After I mentioned that Lenovo are now shipping laptops that only boot Windows by default, a few people pointed to a Lenovo document that says:

Starting in 2022 for Secured-core PCs it is a Microsoft requirement for the 3rd Party Certificate to be disabled by default.

"Secured-core" is a term used to describe machines that meet a certain set of Microsoft requirements around firmware security, and by and large it's a good thing - devices that meet these requirements are resilient against a whole bunch of potential attacks in the early boot process. But unfortunately the 2022 requirements don't seem to be publicly available, so it's difficult to know what's being asked for and why. But first, some background.

Most x86 UEFI systems that support Secure Boot trust at least two certificate authorities:

1) The Microsoft Windows Production PCA - this is used to sign the bootloader in production Windows builds. Trusting this is sufficient to boot Windows.
2) The Microsoft Corporation UEFI CA - this is used by Microsoft to sign non-Windows UEFI binaries, including built-in drivers for hardware that needs to work in the UEFI environment (such as GPUs and network cards) and bootloaders for non-Windows.

The apparent secured-core requirement for 2022 is that the second of these CAs should not be trusted by default. As a result, drivers or bootloaders signed with this certificate will not run on these systems. This means that, out of the box, these systems will not boot anything other than Windows[1].

Given the association with the secured-core requirements, this is presumably a security decision of some kind. Unfortunately, we have no real idea what this security decision is intended to protect against. The most likely scenario is concerns about the (in)security of binaries signed with the third-party signing key - there are some legitimate concerns here, but I'm going to cover why I don't think they're terribly realistic.

The first point is that, from a boot security perspective, a signed bootloader that will happily boot unsigned code kind of defeats the point. Kaspersky did it anyway. The second is that even a signed bootloader that is intended to only boot signed code may run into issues in the event of security vulnerabilities - the Boothole vulnerabilities are an example of this, covering multiple issues in GRUB that could allow for arbitrary code execution and potential loading of untrusted code.

So we know that signed bootloaders that will (either through accident or design) execute unsigned code exist. The signatures for all the known vulnerable bootloaders have been revoked, but that doesn't mean there won't be other vulnerabilities discovered in future. Configuring systems so that they don't trust the third-party CA means that those signed bootloaders won't be trusted, which means any future vulnerabilities will be irrelevant. This seems like a simple choice?

There's actually a couple of reasons why I don't think it's anywhere near that simple. The first is that whenever a signed object is booted by the firmware, the trusted certificate used to verify that object is measured into PCR 7 in the TPM. If a system previously booted with something signed with the Windows Production CA, and is now suddenly booting with something signed with the third-party UEFI CA, the values in PCR 7 will be different. TPMs support "sealing" a secret - encrypting it with a policy that the TPM will only decrypt it if certain conditions are met. Microsoft make use of this for their default Bitlocker disk encryption mechanism. The disk encryption key is encrypted by the TPM, and associated with a specific PCR 7 value. If the value of PCR 7 doesn't match, the TPM will refuse to decrypt the key, and the machine won't boot. This means that attempting to attack a Windows system that has Bitlocker enabled using a non-Windows bootloader will fail - the system will be unable to obtain the disk unlock key, which is a strong indication to the owner that they're being attacked.

The second is that this is predicated on the idea that removing the third-party bootloaders and drivers removes all the vulnerabilities. In fact, there's been rather a lot of vulnerabilities in the Windows bootloader. A broad enough vulnerability in the Windows bootloader is arguably a lot worse than a vulnerability in a third-party loader, since it won't change the PCR 7 measurements and the system will boot happily. Removing trust in the third-party CA does nothing to protect against this.

The third reason doesn't apply to all systems, but it does to many. System vendors frequently want to ship diagnostic or management utilities that run in the boot environment, but would prefer not to have to go to the trouble of getting them all signed by Microsoft. The simple solution to this is to ship their own certificate and sign all their tooling directly - the secured-core Lenovo I'm looking at currently is an example of this, with a Lenovo signing certificate. While everything signed with the third-party signing certificate goes through some degree of security review, there's no requirement for any vendor tooling to be reviewed at all. Removing the third-party CA does nothing to protect the user against the code that's most likely to contain vulnerabilities.

Obviously I may be missing something here - Microsoft may well have a strong technical justification. But they haven't shared it, and so right now we're left making guesses. And right now, I just don't see a good security argument.

But let's move on from the technical side of things and discuss the broader issue. The reason UEFI Secure Boot is present on most x86 systems is that Microsoft mandated it back in 2012. Microsoft chose to be the only trusted signing authority. Microsoft made the decision to assert that third-party code could be signed and trusted.

We've certainly learned some things since then, and a bunch of things have changed. Third-party bootloaders based on the Shim infrastructure are now reviewed via a community-managed process. We've had a productive coordinated response to the Boothole incident, which also taught us that the existing revocation strategy wasn't going to scale. In response, the community worked with Microsoft to develop a specification for making it easier to handle similar events in future. And it's also worth noting that after the initial Boothole disclosure was made to the GRUB maintainers, they proactively sought out other vulnerabilities in their codebase rather than simply patching what had been reported. The free software community has gone to great lengths to ensure third-party bootloaders are compatible with the security goals of UEFI Secure Boot.

So, to have Microsoft, the self-appointed steward of the UEFI Secure Boot ecosystem, turn round and say that a bunch of binaries that have been reviewed through processes developed in negotiation with Microsoft, implementing technologies designed to make management of revocation easier for Microsoft, and incorporating fixes for vulnerabilities discovered by the developers of those binaries who notified Microsoft of these issues despite having no obligation to do so, and which have then been signed by Microsoft are now considered by Microsoft to be insecure is, uh, kind of impolite? Especially when unreviewed vendor-signed binaries are still considered trustworthy, despite no external review being carried out at all.

If Microsoft had a set of criteria used to determine whether something is considered sufficiently trustworthy, we could determine which of these we fell short on and do something about that. From a technical perspective, Microsoft could set criteria that would allow a subset of third-party binaries that met additional review be trusted without having to trust all third-party binaries[2]. But, instead, this has been a decision made by the steward of this ecosystem without consulting major stakeholders.

If there are legitimate security concerns, let's talk about them and come up with solutions that fix them without doing a significant amount of collateral damage. Don't complain about a vendor blocking your apps and then do the same thing yourself.

[Edit to add: there seems to be some misunderstanding about where this restriction is being imposed. I bought this laptop because I'm interested in investigating the Microsoft Pluton security processor, but Pluton is not involved at all here. The restriction is being imposed by the firmware running on the main CPU, not any sort of functionality implemented on Pluton]

[1] They'll also refuse to run any drivers that are stored in flash on Thunderbolt devices, which means eGPU setups may be more complicated, as will netbooting off Thunderbolt-attached NICs
[2] Use a different leaf cert to sign the new trust tier, add the old leaf cert to dbx unless a config option is set, leave the existing intermediate in db

comment count unavailable comments

July 12, 2022 08:25 AM

July 09, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Rust

Linux Plumbers Conference 2022 is pleased to host the Rust MC

Rust is a systems programming language that is making great strides in becoming the next big one in the domain.

Rust for Linux aims to bring it into the kernel since it has a key property that makes it very interesting to consider as the second language in the kernel: it guarantees no undefined behavior takes place (as long as unsafe code is sound). This includes no use-after-free mistakes, no double frees, no data races, etc.

This microconference intends to cover talks and discussions on both Rust for Linux as well as other non-kernel Rust topics.

Possible Rust for Linux topics:

Possible Rust topics:

Please come and join us in the discussion about Rust in the Linux ecosystem.

We hope to see you there!

July 09, 2022 01:50 PM

July 08, 2022

Matthew Garrett: Lenovo shipping new laptops that only boot Windows by default

I finally managed to get hold of a Thinkpad Z13 to examine a functional implementation of Microsoft's Pluton security co-processor. Trying to boot Linux from a USB stick failed out of the box for no obvious reason, but after further examination the cause became clear - the firmware defaults to not trusting bootloaders or drivers signed with the Microsoft 3rd Party UEFI CA key. This means that given the default firmware configuration, nothing other than Windows will boot. It also means that you won't be able to boot from any third-party external peripherals that are plugged in via Thunderbolt.

There's no security benefit to this. If you want security here you're paying attention to the values measured into the TPM, and thanks to Microsoft's own specification for measurements made into PCR 7, switching from booting Windows to booting something signed with the 3rd party signing key will change the measurements and invalidate any sealed secrets. It's trivial to detect this. Distrusting the 3rd party CA by default doesn't improve security, it just makes it harder for users to boot alternative operating systems.

Lenovo, this isn't OK. The entire architecture of UEFI secure boot is that it allows for security without compromising user choice of OS. Restricting boot to Windows by default provides no security benefit but makes it harder for people to run the OS they want to. Please fix it.

comment count unavailable comments

July 08, 2022 06:49 AM

July 07, 2022

Vegard Nossum: Stigmergy in programming

Ants are known to leave invisible pheromones on their paths in order to inform both themselves and their fellow ants where to go to find food or signal that a path leads to danger. In biology, this phenomenon is known as stigmergy: the act of modifying your environment to manipulate the future behaviour of yourself or others. From the Wikipedia article:

Stigmergy (/ˈstɪɡmərdʒi/ STIG-mər-jee) is a mechanism of indirect coordination, through the environment, between agents or actions. The principle is that the trace left in the environment by an individual action stimulates the performance of a succeeding action by the same or different agent. Agents that respond to traces in the environment receive positive fitness benefits, reinforcing the likelihood of these behaviors becoming fixed within a population over time.

For ants in particular, stigmergy is useful as it alleviates the need for memory and more direct communication; instead of broadcasting a signal about where a new source of food has been found, you can instead just leave a breadcrumb trail of pheromones that will naturally lead your community to the food.

We humans also use stigmergy in a lot of ways: most notably, we write things down. From post-it notes posted on the fridge to remind ourselves to buy more cheese to writing books that can potentially influence the behaviour of a whole future generation of young people.

Let's face it: We don't have infinite brains and we need to somehow alleviate the need to remember everything. If you remember the movie Memento, the protagonist Leonard has lost his ability to form new long-term memories and relies on stigmergy to inform his future actions; everything that's important he writes down in a place he's sure to come across it again when needed. His most important discoveries he turns into tattoos that he cannot lose or avoid seeing when he wakes up in the morning.

Perhaps a biologist would object and say this is stretching the definition of stigmergy, but I contest that it fits: leaving a trace in the environment in order to stimulate a future action.

For stigmergy to be effective, it must be placed in the right location so that whoever comes across it will perform the correct action at that time. If we return briefly to the shopping list example, we typically keep the list close to the fridge because that is often where we are when we need to write something down -- or when we go to check what we need to buy.

Let's take an example from computing: Have you ever seen a line at the top of a file that says "AUTOMATICALLY GENERATED; DON'T MODIFY THIS"? Well, that's stigmergy. Somebody made sure that line would be there in order to influence the behaviour of whomever came across that file. A little note from the past placed in the environment to manipulate future actions.

In programming, stigmergy mainly manifests as comments scattered throughout the code -- the most common form is perhaps leaving a comment to explain what a piece of code is there to do, where we know somebody will find it and, hopefully, be able to make use of it. Another one is leaving a "TODO" comment where something isn't quite finished -- you may not know that a piece of code isn't handling some corner case just by glancing at it, but a "TODO" comment stands out and may even contain enough information to complete the implementation. In other cases, we see the opposite: "here be dragons"-type comments instructing the reader not to change something, perhaps because the code is known to be complicated, complex, brittle, or prone to breaking.

Stigmergy is a powerful idea, and once you are aware of it you can consciously make use of it to help yourself and others down the line. We're not robotic ants and we can make deliberate choices regarding when, where, and how we modify our environment in order to most effectively influence future behaviour.

July 07, 2022 01:47 PM

July 06, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Power Management and Thermal Control

Linux Plumbers Conference 2022 is pleased to host the Power Management and Thermal Control Microconference

The Power Management and Thermal Control microconference focuses on frameworks related to power management and thermal control, CPU and device power-management mechanisms, and thermal-control methods. In particular, we are interested in extending the energy-efficient scheduling concept beyond the energy-aware scheduling (EAS), improving the thermal control framework in the kernel to cover more use cases and making system-wide suspend (and power management in general) more robust.

The goal is to facilitate cross-framework and cross-platform discussions that can help improve energy-awareness and thermal control in Linux.

Suggested topics:

Please come and join us in the discussion about keeping your systems cool.

We hope to see you there!

July 06, 2022 05:07 PM

July 03, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: System Boot and Security

Linux Plumbers Conference 2022 is pleased to host the System Boot and Security Microconference

In the fourth year in a row, System Boot and Security microconference is are going to bring together people interested in the firmware, bootloaders, system boot, security, etc., and discuss all these topics. This year we would particularly like to focus on better communication and closer cooperation between different Free Software and Open Source projects. In the past we have seen that the lack of cooperation’s between projects very often delays introduction of very interesting and important features with TrenchBoot being very prominent example.

The System Boot and Security MC is very important to improve such communication and cooperation, but it is not limited to this kind of problems. We would like to encourage all stakeholders to bring and discuss issues that they encounter in the broad sense of system boot and security.

Expected topics:

Please come and join us in the discussion about how to keep your system secure from the very boot.

We hope to see you there!

July 03, 2022 02:39 PM

June 30, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: VFIO/IOMMU/PCI

Linux Plumbers Conference 2022 is pleased to host the VFIO/IOMMU/PCI Microconference

The PCI interconnect specification, the devices that implement it, and the system IOMMUs that provide memory and access control to them are nowadays a de-facto standard for connecting high-speed components, incorporating more and more features such as:

These features are aimed at high-performance systems, server and desktop computing, embedded and SoC platforms, virtualization, and ubiquitous IoT devices.

The kernel code that enables these new system features focuses on coordination between the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems.

The VFIO/IOMMU/PCI micro-conference focuses on the kernel code that enables these new system features that often require coordination between the VFIO, IOMMU and PCI sub-systems.

Tentative topics include (but not limited to):

Come and join us in the discussion in helping Linux keep up with the new features being added to the PCI interconnect specification.

We hope to see you there !

June 30, 2022 04:56 PM

June 27, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Android

Linux Plumbers Conference 2022 is pleased to host the Android Microconference

Continuing in the same direction as last year, this year’s Android microconference will be an opportunity to foster collaboration between the Android and Linux kernel communities. Discussions will be centered on the goal of ensuring that both the Android and Linux development moves in a lockstep fashion going forward.

Projected topics:

Please come and join us in the discussion of making Android a better partner with Linux.

We hope to see you there!

June 27, 2022 01:55 PM

June 24, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Open Printing

Linux Plumbers Conference 2022 is pleased to host the Open Printing Microconference

OpenPrinting has been improving the way we print in Linux. Over the years we have changed many conventional ways of printing and scanning. Over the last few years we have been emphasizing on the fact that driverless print and scan has made life easier however this does not make us stop improving. Every day we are trying to design new ways of printing to make your printing and scanning experience better than that of today.

Proposed Topics :

Please come and join us in the discussion to bring Linux printing, scanning and fax a better experience.

We hope to see you there!

June 24, 2022 09:20 PM

Kees Cook: finding binary differences

As part of the continuing work to replace 1-element arrays in the Linux kernel, it’s very handy to show that a source change has had no executable code difference. For example, if you started with this:

struct foo {
    unsigned long flags;
    u32 length;
    u32 data[1];
};

void foo_init(int count)
{
    struct foo *instance;
    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
    ...
    instance = kmalloc(bytes, GFP_KERNEL);
    ...
};

And you changed only the struct definition:

-    u32 data[1];
+    u32 data[];

The bytes calculation is going to be incorrect, since it is still subtracting 1 element’s worth of space from the desired count. (And let’s ignore for the moment the open-coded calculation that may end up with an arithmetic over/underflow here; that can be solved separately by using the struct_size() helper or the size_mul(), size_add(), etc family of helpers.)

The missed adjustment to the size calculation is relatively easy to find in this example, but sometimes it’s much less obvious how structure sizes might be woven into the code. I’ve been checking for issues by using the fantastic diffoscope tool. It can produce a LOT of noise if you try to compare builds without keeping in mind the issues solved by reproducible builds, with some additional notes. I prepare my build with the “known to disrupt code layout” options disabled, but with debug info enabled:

$ KBF="KBUILD_BUILD_TIMESTAMP=1970-01-01 KBUILD_BUILD_USER=user KBUILD_BUILD_HOST=host KBUILD_BUILD_VERSION=1"
$ OUT=gcc
$ make $KBF O=$OUT allmodconfig
$ ./scripts/config --file $OUT/.config \
        -d GCOV_KERNEL -d KCOV -d GCC_PLUGINS -d IKHEADERS -d KASAN -d UBSAN \
        -d DEBUG_INFO_NONE -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
$ make $KBF O=$OUT olddefconfig

Then I build a stock target, saving the output in “before”. In this case, I’m examining drivers/scsi/megaraid/:

$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/before
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/before/

Then I patch and build a modified target, saving the output in “after”:

$ vi the/source/code.c
$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/after
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/after/

And then run diffoscope:

$ diffoscope $OUT/before/ $OUT/after/

If diffoscope output reports nothing, then we’re done. 🥳

Usually, though, when source lines move around other stuff will shift too (e.g. WARN macros rely on line numbers, so the bug table may change contents a bit, etc), and diffoscope output will look noisy. To examine just the executable code, the command that diffoscope used is reported in the output, and we can run it directly, but with possibly shifted line numbers not reported. i.e. running objdump without --line-numbers:

$ ARGS="--disassemble --demangle --reloc --no-show-raw-insn --section=.text"
$ for i in $(cd $OUT/before && echo *.o); do
        echo $i
        diff -u <(objdump $ARGS $OUT/before/$i | sed "0,/^Disassembly/d") \
                <(objdump $ARGS $OUT/after/$i  | sed "0,/^Disassembly/d")
done

If I see an unexpected difference, for example:

-    c120:      movq   $0x0,0x800(%rbx)
+    c120:      movq   $0x0,0x7f8(%rbx)

Then I'll search for the pattern with line numbers added to the objdump output:

$ vi <(objdump --line-numbers $ARGS $OUT/after/megaraid_sas_fp.o)

I'd search for "0x0,0x7f8", find the source file and line number above it, open that source file at that position, and look to see where something was being miscalculated:

$ vi drivers/scsi/megaraid/megaraid_sas_fp.c +329

Once tracked down, I'd start over at the "patch and build a modified target" step above, repeating until there were no differences. For example, in the starting example, I'd also need to make this change:

-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = sizeof(*instance) + sizeof(u32) * count;

Though, as hinted earlier, better yet would be:

-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = struct_size(instance, data, count);

But sometimes adding the helper usage will add binary output differences since they're performing overflow checking that might saturate at SIZE_MAX. To help with patch clarity, those changes can be done separately from fixing the array declaration.

© 2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

June 24, 2022 08:11 PM

June 22, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: CPU Isolation

Linux Plumbers Conference 2022 is pleased to host the CPU Isolation Microconference

CPU Isolation is an ability to shield workloads with extreme latency or performance requirements from interruptions (also known as Operating System noise) provided by a close combination of several kernel and userspace components. An example of such workloads are DPDK use cases in Telco/5G where even the shortest interruption can cause packet losses, eventually leading to exceeding QoS requirements.

Despite considerable improvements in the last few years towards implementing full CPU Isolation (nohz_full, rcu_nocb, isolcpus, etc.), there are issues to be addressed, as it is still relatively simple to highlight sources of OS noise just by running synthetic workloads mimicking polling (always running) type of application similar to the ones mentioned above.

There were recent improvements and discussions about CPU isolation features on LKML, and tools such as osnoise tracer and rtla osnoise improved the CPU isolation analysis. Nevertheless, this is an ongoing process, and discussions are needed to speed up solutions for existing issues and to improve the existing tools and methods.

The purpose of CPU Isolation MC is to get together to discuss open problems, most notably: how to improve the identification of OS noise sources, how to track them publicly and how to tackle the sources of noise that have already been identified.

A non exhaustive list of potential topics is:

Please come and join us in the discussion about CPU isolation.

We hope to see you there!

June 22, 2022 02:56 AM

Linux Plumbers Conference: Registration Still Sold Out, But There is Now a Waitlist

Because we ran out of places so fast, we are setting up a waitlist for in-person registration (virtual attendee places are still available). Please fill in this form and try to be clear about your reasons for wanting to attend. This year we’re giving waitlist priority to new attendees and people expected to contribute content. We expect to be able to accept our first group of attendees from the waitlist in mid July.

June 22, 2022 01:22 AM

June 18, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: IoTs a 4-Letter Word

Linux Plumbers Conference 2022 is pleased to host the IoT Microconference

The IoT microconference is back for its fourth year and our Open Source HW / SW / FW communities are productizing Linux and Zephyr in ways that we have never seen before.

A lot has happened in the last year to discuss and bring forward:

Each of the above items were large efforts made by Linux centric communities actively pushing the bounds of what is possible in IoT.

Whether you are an apprentice or master, we welcome you to bring your plungers and join us for a deep dive into the pipework of Linux IoT!

We hope to see you there!

June 18, 2022 02:37 PM

June 15, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Real-time and Scheduling

Linux Plumbers Conference 2022 is pleased to host the Real-time and Scheduling Microconference

The real-time and scheduling micro-conference joins these two intrinsically connected communities to discuss the next steps together.

Over the past decade, many parts of PREEMPT_RT have been included in the official Linux codebase. Examples include real-time mutexes, high-resolution timers, lockdep, ftrace, RCU_PREEMPT, threaded interrupt handlers and more. The number of patches that need integration has been significantly reduced, and the rest is mature enough to make their way into mainline Linux.

The scheduler is the core of Linux performance. With different topologies and workloads, it is not an easy task to give the user the best experience possible, from low latency to high throughput, and from small power-constrained devices to HPC.

This year’s topics to be discussed include:

Please come and join us in the discussion of controlling what tasks get to run on your machine and when.

We hope to see you there!

June 15, 2022 04:10 PM

June 14, 2022

Linux Plumbers Conference: Registration Currently Sold Out, We’re Trying to Add More Places

Back in 2021 when we were planning this conference, everyone warned us that we’d still be doing social distancing and that in-person conferences were likely not to be as popular as they had been, so we lowered our headcount to fit within a socially distanced venue.   Unfortunately the enthusiasm of the plumbers community didn’t follow this conventional wisdom so the available registrations sold out within days of being released.  We’re now investigating how we might expand the venue capacity to accommodate some of the demand for in-person registration, so stay tuned for what we find out.

June 14, 2022 09:00 PM

June 13, 2022

Linux Plumbers Conference: CFP Deadline Extended – Refereed Presentations

This is the last year that we will be adhering to our long-standing tradition of extending the deadline by one week. In 2023, we will break from this tradition, so that the refereed-track deadline will be a hard deadline, not subject to extension.

But this is still 2022, and so we are taking this one last opportunity to announce that we are extending the Refereed-Track deadline from the current June 12 to June 19. Again, if you have already submitted a proposal, thank you very much! For the rest of you, there is one additional week in which to get your proposal submitted. We very much look forward to seeing what you all come up with.

June 13, 2022 03:12 PM

June 11, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Kernel Memory Management

This microconference supplements the LSF/MM event by providing an opportunity to discuss current topics with a different audience, in a different location, and at a different time of year.

The microconference is about current problems in kernel memory management, for example:

Please come and join us in the discussion about the rocket science kernel memory management.

We hope to see you there!

June 11, 2022 01:55 PM

June 09, 2022

Linux Plumbers Conference: Registration for Linux Plumbers Conference is now open

We hope very much to see you in Dublin in September (12-14th). Please visit our attend page for all the details.

June 09, 2022 12:40 PM

June 08, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Compute Express Link

Linux Plumbers Conference 2022 is pleased to host the Compute Express Link Microconference

Compute Express Link is a cache coherent fabric that is gaining a lot of momentum in the industry. Hardware vendors have begun to ramp up on CXL 2.0 hardware and software must not lag behind. The current software ecosystem looks promising with enough components ready to begin provisioning of test systems.

The Compute Express Link microconference focuses around how to evolve the Linux
CXL kernel driver and userspace for full support of the CXL 2.0 and beyond. It is also an opportunity to discuss the needs and expectations of everyone on the CXL community and to address the current state of development.

Suggested topics:

Please come and join us in the discussion about the Linux support of the next generation high speed interconnect.

We hope to see you there!

June 08, 2022 07:10 PM

June 05, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: RISC-V

Linux Plumbers Conference 2022 is pleased to host the RISC-V Microconference

The RISC-V software ecosystem continues to grow tremendously with many RISC-V ISA extensions being ratified last year. There are many features supporting the ratified extensions that are under development, for instance svpbmt, sstc, sscofpmf, cbo.
The RISC-V microconference is to discuss these issues with a wider community to arrive at a solution as was successfully done in the past.

Here are a few of the expected topics and current problems in RISC-V Linux land that would be covered this year:

Please come and join us in the discussion on how we can improve the support for RISC-V in the Linux kernel.

We hope to see you there!

June 05, 2022 06:41 PM

June 02, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Zoned Storage Devices (SMR HDDs & ZNS SSDs)

Linux Plumbers Conference 2022 is pleased to host the Zoned Storage Devices (SMR HDDs & ZNS SSDs).

The Zoned Storage interface has been introduced to make more efficient use of the storage medium, improving both device raw capacity and performance. Zoned storage devices expose their storage through zone semantics with a set of read/write rules associated with each zone.

The Linux kernel supports SMR HDDs since kernel 4.10 and ZNS SSDs since kernel 5.9. Furthermore, a few parts of the storage stack have been extended with zone support, for example btrfs and f2fs filesystems, and the device mapper targets dm-linear and dm-zoned.

The Zoned Storage microconference aims to communicate the benefits of Zoned Storage to a broader audience, present the current and future challenges within the Linux storage stack, and collaborate with the wider community.

Please come and join us in the discussion about advantages and challenges of Zoned Storage.

We hope to see you there!

June 02, 2022 01:08 PM

May 29, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Service Management and systemd

Linux Plumbers Conference 2022 is pleased to host the Service Management and systemd Microconference.

The focus of this microconference will be on topics related to the current
state of host-level service management and ideas for the future.

Most of the topics will be aroind the systemd ecosystem as the most widely adoped service manager. The Service Management and systemd microconference also welcomes proposals that are not specific to systemd so we can discover and share new ideas on how to improve service management in general.

Please come and join us in the discussion about the future of service management.

We hope to see you there!

May 29, 2022 11:34 AM

Paul E. Mc Kenney: Stupid RCU Tricks: How Read-Intensive is The Kernel's Use of RCU?

RCU is a specialized synchronization mechanism, and is typically used where there are far more readers (rcu_read_lock(), rcu_read_unlock(), rcu_dereference(), and so on) than there are updaters (synchronize_rcu(), call_rcu(), rcu_assign_pointer(), and so on). But does the Linux kernel really make heavier use of RCU's read-side primitives than of its update-side primitives?

One way to determine this would be to use something like ftrace to record all the calls to these functions. This works, but trace messages can be lost, especially when applied to frequently invoked functions. Also, dumping out the trace buffer can perturb the syatem. Another approach is to modify the kernel source code to count these function invocations in a cache-friendly manner, then come up with some way to dump this to userspace. This works, but I am lazy. Yet another approach is to ask the tracing folks for advice.

This last is what I actually did, and because the tracing person I happened to ask happened to be Andrii Nakryiko, I learned quite a bit about BPF in general and the bpftrace command in particular. If you don't happen to have Andrii on hand, you can do quite well with Appendix A and Appendix B of Brendan Gregg's “BPF Performance Tools”. You will of course need to install bpftrace itself, which is reasonably straightforward on many Linux distributions.

Linux-Kernel RCU Read Intensity

Those of you who have used sed and awk have a bit of a running start because you can invoke bpftrace with a -e argument and a series of tracepoint/program pairs, where a program is bpftrace code enclosed in curly braces. This code is compiled, verified, and loaded into the running kernel as a kernel module. When the code finishes executing, the results are printed right there for you on stdout. For example:

bpftrace -e 'kprobe:__rcu_read_lock { @rcu_reader = count(); } kprobe:rcu_gp_fqs_loop { @gp = count(); } interval:s:10 { exit(); }'

This command uses the kprobe facility to attach a program to the __rcu_read_lock() function and to attach a very similar program to the rcu_gp_fqs_loop() function, which happens to be invoked exactly once per RCU grace period. Both programs count the number of calls, with @gp being the bpftrace “variable” accumulating the count, and the count() function doing the counting in a cache-friendly manner. The final interval:s:10 in effect attaches a program to a timer, so that this last program will execute every 10 seconds (“s:10”). Except that the program invokes the exit() function that terminates this bpftrace program at the end of the very first 10-second time interval. Upon termination, bpftrace outputs the following on an idle system:

Attaching 3 probes...

@gp: 977
@rcu_reader: 6435368

In other words, there were about a thousand grace periods and more than six million RCU readers during that 10-second time period, for a read-to-grace-period ratio of more than six thousand. This certainly qualifies as read-intensive.

But what if the system is busy? Much depends on exactly how busy the system is, as well as exactly how it is busy, but let's use that old standby, the kernel build (but using the nice command to avoid delaying bpftrace). Let's also put the bpftrace script into a creatively named file rcu1.bpf like so:

kprobe:__rcu_read_lock
{
        @rcu_reader = count();
}

kprobe:rcu_gp_fqs_loop
{
        @gp = count();
}

interval:s:10
{
        exit();
}

This allows the command bpftrace rcu1.bpf to produce the following output:

Attaching 3 probes...

@gp: 274
@rcu_reader: 78211260

Where the idle system had about one thousand grace periods over the course of ten seconds, the busy system had only 274. On the other hand, the busy system had 78 million RCU read-side critical sections, more than ten times that of the idle system. The busy system had more than one quarter million RCU read-side critical sections per grace period, which is seriously read-intensive.

RCU works hard to make the same grace-period computation cover multiple requests. Because synchronize_rcu() invokes call_rcu(), we can use the number of call_rcu() invocations as a rough proxy for the number of updates, that is, the number of requests for a grace period. (The more invocations of synchronize_rcu_expedited() and kfree_rcu(), the rougher this proxy will be.)

We can make the bpftrace script more concise by assigning the same action to a group of tracepoints, as in the rcu2.bpf file shown here:

kprobe:__rcu_read_lock, kprobe:call_rcu, kprobe:rcu_gp_fqs_loop { @[func] = count(); } interval:s:10 { exit(); }

With this file in place, bpftrace rcu2.bpf produces the following output in the midst of a kernel build:

Attaching 4 probes... @[rcu_gp_fqs_loop]: 128 @[call_rcu]: 195721 @[__rcu_read_lock]: 21985946

These results look quite different from the earlier kernel-build results, confirming any suspicions you might harbor about the suitability of kernel builds as a repeatable benchmark. Nevertheless, there are about 180K RCU read-side critical sections per grace period, which is still seriously read-intensive. Furthermore, there are also almost 2K call_rcu() invocations per RCU grace period, which means that RCU is able to amortize the overhead of a given grace period down to almost nothing per grace-period request.

Linux-Kernel RCU Grace-Period Latency

The following bpftrace program makes a histogram of grace-period latencies, that is, the time from the call to rcu_gp_init() to the return from rcu_gp_cleanup():

kprobe:rcu_gp_init { @start = nsecs; } kretprobe:rcu_gp_cleanup { if (@start) { @gplat = hist((nsecs - @start)/1000000); } } interval:s:10 { printf("Internal grace-period latency, milliseconds:\n"); exit(); }

The kretprobe attaches the program to the return from rcu_gp_cleanup(). The hist() function computes a log-scale histogram. The check of the @start variable avoids a beginning-of-time value for this variable in the common case where this script start in the middle of a grace period. (Try it without that check!)

The output is as follows:
Attaching 3 probes... Internal grace-period latency, milliseconds: @gplat: [2, 4) 259 |@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 591 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 137 |@@@@@@@@@@@@ | [16, 32) 3 | | [32, 64) 5 | | @start: 95694642573968

Most of the grace periods complete within between four and eight milliseconds, with most of the remainder completing within between two and four milliseconds and then between eight and sixteen milliseonds, but with a few stragglers taking up to 64 milliseconds. The final @start line shows that bpftrace simply dumps out all the variables. You can use the delete(@start) function to prevent printing of @start, but please note that the next invocation of rcu_gp_init() will re-create it.

It is nice to know the internal latency of an RCU grace period, but most in-kernel users will be more concerned about the latency of the synchronize_rcu() function, which will need to wait for the current grace period to complete and also for callback invocation. We can measure this function's latency with the following bpftrace script:

kprobe:synchronize_rcu { @start[tid] = nsecs; } kretprobe:synchronize_rcu { if (@start[tid]) { @srlat = hist((nsecs - @start[tid])/1000000); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, milliseconds:\n"); exit(); }

The tid variable contains the ID of the currently running task, which allows this script to associate a given return from synchronize_rcu() with the corresponding call by using tid as an index to the @start variable.

As you would expect, the resulting histogram is weighted towards somewhat longer latencies, though without the stragglers:

Attaching 3 probes... synchronize_rcu() latency, milliseconds: @srlat: [4, 8) 9 |@@@@@@@@@@@@@@@ | [8, 16) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @start[4075307]: 96560784497352

In addition, we see not one but two values for @start. The delete statement gets rid of old ones, but any new call to synchronize_rcu() will create more of them.

Linux-Kernel Expedited RCU Grace-Period Latency

Linux kernels will sometimes executed synchronize_rcu_expedited() to obtain a faster grace period, and the following command will further cause synchronize_rcu() to act like synchronize_rcu_expedited():

echo 1 >  /sys/kernel/rcu_expedited

Doing this on a dual-socket system with 80 hardware threads might be ill-advised, but you only live once!

Ill-advised or not, the following bpftrace script measures synchronize_rcu_expedited() latency, but in microseconds rather than milliseconds:

kprobe:synchronize_rcu_expedited {
        @start[tid] = nsecs;
}

kretprobe:synchronize_rcu_expedited {
        if (@start[tid]) {
                @srelat = hist((nsecs - @start[tid])/1000);
                delete(@start[tid]);
        }
}

interval:s:10 {
        printf("synchronize_rcu() latency, microseconds:\n");
        exit();
}

The output of this script run concurrently with a kernel build is as follows:

Attaching 3 probes...
synchronize_rcu() latency, microseconds:


@srelat: 
[128, 256)            57 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)            14 |@@@@@@@@@@@@                                        |
[512, 1K)              1 |                                                    |
[1K, 2K)               2 |@                                                   |
[2K, 4K)               7 |@@@@@@                                              |
[4K, 8K)               2 |@                                                   |
[8K, 16K)              3 |@@                                                  |

@start[4140285]: 97489845318700

Most synchronize_rcu_expedited() invocations complete within a few hundred microseconds, but with a few stragglers around ten milliseconds.

But what about linear histograms? This is what the lhist() function is for, with added minimum, maximum, and bucket-size arguments:

kprobe:synchronize_rcu_expedited {
        @start[tid] = nsecs;
}

kretprobe:synchronize_rcu_expedited {
        if (@start[tid]) {
                @srelat = lhist((nsecs - @start[tid])/1000, 0, 1000, 100);
                delete(@start[tid]);
        }
}

interval:s:10 {
        printf("synchronize_rcu() latency, microseconds:\n");
        exit();
}

Running this with the usual kernel build in the background:

Attaching 3 probes...
synchronize_rcu() latency, microseconds:


@srelat: 
[100, 200)            26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 300)            13 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[300, 400)             5 |@@@@@@@@@@                                          |
[400, 500)             1 |@@                                                  |
[500, 600)             0 |                                                    |
[600, 700)             2 |@@@@                                                |
[700, 800)             0 |                                                    |
[800, 900)             1 |@@                                                  |
[900, 1000)            1 |@@                                                  |
[1000, ...)           18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |

@start[4184562]: 98032023641157

The final bucket is overflow, containing measurements that exceeded the one-millisecond limit.

The above histogram had only a few empty buckets, but that is mostly because the 18 synchronize_rcu_expedited() instances that overflowed the one-millisecond limit are consolidated into a single [1000, ...) overflow bucket. This is sometimes what is needed, but other times losing the maximum latency can be a problem. This can be dealt with given the following bpftrace program:

kprobe:synchronize_rcu_expedited {
        @start[tid] = nsecs;
}

kretprobe:synchronize_rcu_expedited {
        if (@start[tid]) {
                @srelat[(nsecs - @start[tid])/100000*100] = count();
                delete(@start[tid]);
        }       
}       

interval:s:10 {
        printf("synchronize_rcu() latency, microseconds:\n");
        exit();
}

Given the usual kernel-build background load, this produces the following output:

Attaching 3 probes...
synchronize_rcu() latency, microseconds:


@srelat[1600]: 1
@srelat[500]: 1
@srelat[1000]: 1
@srelat[700]: 1
@srelat[1100]: 1
@srelat[2300]: 1
@srelat[300]: 1
@srelat[400]: 2
@srelat[600]: 3
@srelat[200]: 4
@srelat[100]: 20
@start[763214]: 17487881311831

This is a bit hard to read, but simple scripting can be applied to this output to produce something like this:

100: 20
200: 4
300: 1
400: 2
500: 1
600: 3
700: 1
1000: 1
1100: 1
1600: 1

This produces compact output despite outliers such as the last entry, corresponding to an invocation that took somewhere between 1.6 and 1.7 milliseconds.

Summary

The bpftrace command can be used to quickly and easily script compiled in-kernel programs that can measure and monitor a wide variety of things. This post focused on a few aspects of RCU, but quite a bit more material may be found in Brendan Gregg's “BPF Performance Tools” book.

May 29, 2022 04:08 AM

May 28, 2022

Linux Plumbers Conference: Linux Plumbers Conference Refereed-Track Deadlines

The proposal deadline is June 12, which is right around the corner.  We have excellent submissions, for which we gratefully thank our submitters! For the rest of you, we do have one problem, namely that we do not yet have your submission. So please point your browser at the call-for-proposals page and submit your proposal. After all, if you don’t submit it, we won’t accept it!

May 28, 2022 06:17 PM

May 27, 2022

Dave Airlie (blogspot): lavapipe Vulkan 1.2 conformant

The software Vulkan renderer in Mesa, lavapipe, achieved official Vulkan 1.2 conformance. The non obvious entry in the table is here.

Thanks to all the Mesa team who helped achieve this, Shout outs to Mike of Zink fame who drove a bunch of pieces over the line, Roland who helped review some of the funkier changes. 

We will be submitting 1.3 conformance soon, just a few things to iron out.

May 27, 2022 07:58 PM

Paul E. Mc Kenney: Stupid RCU Tricks: Is RCU Watching?

It is just as easy to ask why RCU wouldn't be watching all the time. After all, you never know when you might need to synchronize!

Unfortunately, an eternally watchful RCU is impractical in the Linux kernel due to energy-efficiency considerations. The problem is that if RCU watches an idle CPU, RCU needs that CPU to execute instructions. And making an idle CPU unnecessarily execute instructions (for a rather broad definition of the word “unnecessarily”) will terminally annoy a great many people in the battery-powered embedded world. And for good reason: Making RCU avoid watching idle CPUs can provide 30-40% increases in battery lifetime.

In this, CPUs are not all that different from people. Interrupting someone who is deep in thought can cause them to lose 20 minutes of work. Similarly, when a CPU is deeply idle, asking it to execute instructions will consume not only the energy required for those instructions, but also much more energy to work its way out of that deep idle state, and then to return back to that deep idle state.

And this is why CPUs must tell RCU to stop watching them when they go idle. This allows RCU to ignore them completely, in particular, to refrain from asking them to execute instructions.

In some kernel configurations, RCU also ignores portions of the kernel's entry/exit code, that is, the last bits of kernel code before switching to userspace and the first bits of kernel code after switching away from userspace. This happens only in kernels built with CONFIG_NO_HZ_FULL=y, and even then only on CPUs mentioned in the CPU list passed to the nohz_full kernel parameter. This enables carefully configured HPC applications and CPU-bound real-time applications to get near-bare-metal performance from such CPUs, while still having the entire Linux kernel at their beck and call. Because RCU is not watching such applications, the scheduling-clock interrupt can be turned off entirely, thus avoiding disturbing such performance-critical applications.

But if RCU is not watching a given CPU, rcu_read_lock() has no effect on that CPU, which can come as a nasty shock to the corresponding RCU read-side critical section, which naively expected to be able to safely traverse an RCU-protected data structure. This can be a trap for the unwary, which is why kernels built with CONFIG_PROVE_LOCKING=y (lockdep) complain bitterly when rcu_read_lock() is invoked on CPUs that RCU is not watching.

But suppose that you have code using RCU that is invoked both from deep within the idle loop and from normal tasks.

Back in the day, this was not much of a problem. True to its name, the idle loop was not much more than a loop, and the deep architecture-specific code on the kernel entry/exit paths had no need of RCU. This has changed, especially with the advent of idle drivers and governors, to say nothing of tracing. So what can you do?

First, you can invoke rcu_is_watching(), which, as its name suggests, will return true if RCU is watching. And, as you might expect, lockdep uses this function to figure out when it should complain bitterly. The following example code lays out the current possibilities:

if (rcu_is_watching())
        printk("Invoked from normal or idle task with RCU watching.\n");
else if (is_idle_task(current))
        printk("Invoked from deep within in the idle task where RCU is not watching.\");
else
        printk("Invoked from nohz_full entry/exit code where RCU is not watching.\");

Except that even invoking printk() is an iffy proposition while RCU is not watching.

So suppose that you invoke rcu_is_watching() and it helpfully returns false, indicating that you cannot invoke rcu_read_lock() and friends. What now?

You could do what the v5.18 Linux kernel's kernel_text_address() function does, which can be abbreviated as follows:

no_rcu = !rcu_is_watching();
if (no_rcu)
        rcu_nmi_enter(); // Make RCU watch!!!
do_rcu_traversals();
if (no_rcu)
        rcu_nmi_exit(); // Return RCU to its prior watchfulness state.

If your code is not so performance-critical, you can do what the arm64 implementation of the cpu_suspend() function does:

RCU_NONIDLE(__cpu_suspend_exit());

This macro forces RCU to watch while it executes its argument as follows:

#define RCU_NONIDLE(a) \
        do { \
                rcu_irq_enter_irqson(); \
                do { a; } while (0); \
                rcu_irq_exit_irqson(); \
        } while (0)

The rcu_irq_enter_irqson() and rcu_irq_exit_irqson() functions are essentially wrappers around the aforementioned rcu_nmi_enter() and rcu_nmi_exit() functions.

Although RCU_NONIDLE() is more compact than the kernel_text_address() approach, it is still annoying to have to pass your code to a macro. And this is why Peter Zijlstra has been reworking the various idle loops to cause RCU to be watching a much greater fraction of their code. This might well be an ongoing process as the idle loops continue gaining functionality, but Peter's good work thus far at least makes RCU watch the idle governors and a much larger fraction of the idle loop's trace events. When combined with the kernel entry/exit work by Peter, Thomas Gleixner, Mark Rutland, and many others, it is hoped that the functions not watched by RCU will all eventually be decorated with something like noinstr, for example:

static noinline noinstr unsigned long rcu_dynticks_inc(int incby)
{
        return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
}

We don't need to worry about exactly what this function does. For this blog entry, it is enough to know that its noinstr tag prevents tracing this function, making it less problematic for RCU to not be watching it.

What exactly are you prohibited from doing while RCU is not watching your code?

As noted before, RCU readers are a no-go. If you try invoking rcu_read_lock(), rcu_read_unlock(), rcu_read_lock_bh(), rcu_read_unlock_bh(), rcu_read_lock_sched(), or rcu_read_lock_sched() from regions of code where rcu_is_watching() would return false, lockdep will complain.

On the other hand, using SRCU (srcu_read_lock() and srcu_read_unlock()) is just fine, as is RCU Tasks Trace (rcu_read_lock_trace() and rcu_read_unlock_trace()). RCU Tasks Rude does not have explicit read-side markers, but anything that disables preemption acts as an RCU Tasks Rude reader no matter what rcu_is_watching() would return at the time.

RCU Tasks is an odd special case. Like RCU Tasks Rude, RCU Tasks has implicit read-side markers, which are any region of non-idle-task kernel code that does not do a voluntary context switch (the idle tasks are instead handled by RCU Tasks Rude). Except that in kernels built with CONFIG_PREEMPTION=n and without any of RCU's test suite, the RCU Tasks API maps to plain old RCU. This means that code not watched by RCU is ignored by the remapped RCU Tasks in such kernels. Given that RCU Tasks ignores the idle tasks, this affects only user entry/exit code in kernels built with CONFIG_NO_HZ_FULL=y, and even then, only on CPUs mentioned in the list given to the nohz_full kernel boot parameter. However, this situation can nevertheless be a trap for the unwary.

Therefore, in post-v5.18 mainline, you can build your kernel with CONFIG_FORCE_TASKS_RCU=y, in which case RCU Tasks will always be built into your kernel, avoiding this trap.

In summary, energy-efficiency, battery-lifetime, and application-performance/latency concerns force RCU to avert its gaze from idle CPUs, and, in kernels built with CONFIG_NO_HZ_FULL=y, also from nohz_full CPUs on the low-level kernel entry/exit code paths. Fortunately, recent changes have allowed RCU to watch more code, but this being the kernel, corner cases will always be with us. This corner-case code from which RCU must avert its gaze requires the special handling described in this blog post.

May 27, 2022 12:43 AM

May 26, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Containers and Checkpoint/Restore

Linux Plumbers Conference 2022 is pleased to host the Containers and Checkpoint/Restore Microconference

The Containers and Checkpoint/Restore Microconference focuses on both userspace and kernel related work. The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.

Potential discussion topcis include :

Please come and join the discussion centered on what holds “The Cloud” together.

We hope to see you there!

May 26, 2022 09:55 AM

May 23, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Kernel Testing & Dependability

Linux Plumbers Conference 2022 is pleased to host the Kernel Testing & Dependability Microconference

The Kernel Testing & Dependability Microconference focuses on advancing the state of testing of the Linux kernel and testing on Linux in general. The main purpose is to improve software quality and dependability for applications that require predictability and trust. The microconference aims to create connections between folks working on similar projects, and help individual projects make progress

This microconference is a merge of Testing and Fuzzing and the Kernel Dependability and Assurance microconferences into a single session. There was a lot of overlap in topics and attendees of these MCs and and combining the two tracks will promote collaboration between all the interested communities and people.

The Microconference is open to all topics related to testing on Linux, not necessarily in the kernel space.

Please come and join us in the discussion on how we can assure that Linux becomes the most trusted and dependable software in the world!

We hope to see you there!

May 23, 2022 09:23 AM

May 20, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: linux/arch

Linux Plumbers Conference 2022 is pleased to host the linux/arch Microconference

The linux/arch microconference aims to bring architecture maintainers in one room to discuss how the code in arch/ can be improved, consolidated and generalized.

Potential topics for the discussion are:

Please come and join us in the discussion about improving architectures integration with generic kernel code!

We hope to see you there!

May 20, 2022 08:48 AM

May 17, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Confidential Computing

Linux Plumbers Conference 2022 is pleased to host the Confidential Computing Microconference.

The Confidential Computing Microconference brings together plumbers enabling secure execution features in hypervisors, firmware, Linux Kernel, over low-level user space up to container runtimes.

Good progress was made on a couple of topics since the last year, but enabling Confidential Computing in the Linux ecosystem is an ongoing process, and there are still many problems to solve. The most important ones are:

Please come and join us in the discussion for solutions to the open problems for supporting these technologies!

We hope to see you there!

May 17, 2022 02:56 PM

May 16, 2022

Matthew Garrett: Can we fix bearer tokens?

Last month I wrote about how bearer tokens are just awful, and a week later Github announced that someone had managed to exfiltrate bearer tokens from Heroku that gave them access to, well, a lot of Github repositories. This has inevitably resulted in a whole bunch of discussion about a number of things, but people seem to be largely ignoring the fundamental issue that maybe we just shouldn't have magical blobs that grant you access to basically everything even if you've copied them from a legitimate holder to Honest John's Totally Legitimate API Consumer.

To make it clearer what the problem is here, let's use an analogy. You have a safety deposit box. To gain access to it, you simply need to be able to open it with a key you were given. Anyone who turns up with the key can open the box and do whatever they want with the contents. Unfortunately, the key is extremely easy to copy - anyone who is able to get hold of your keyring for a moment is in a position to duplicate it, and then they have access to the box. Wouldn't it be better if something could be done to ensure that whoever showed up with a working key was someone who was actually authorised to have that key?

To achieve that we need some way to verify the identity of the person holding the key. In the physical world we have a range of ways to achieve this, from simply checking whether someone has a piece of ID that associates them with the safety deposit box all the way up to invasive biometric measurements that supposedly verify that they're definitely the same person. But computers don't have passports or fingerprints, so we need another way to identify them.

When you open a browser and try to connect to your bank, the bank's website provides a TLS certificate that lets your browser know that you're talking to your bank instead of someone pretending to be your bank. The spec allows this to be a bi-directional transaction - you can also prove your identity to the remote website. This is referred to as "mutual TLS", or mTLS, and a successful mTLS transaction ends up with both ends knowing who they're talking to, as long as they have a reason to trust the certificate they were presented with.

That's actually a pretty big constraint! We have a reasonable model for the server - it's something that's issued by a trusted third party and it's tied to the DNS name for the server in question. Clients don't tend to have stable DNS identity, and that makes the entire thing sort of awkward. But, thankfully, maybe we don't need to? We don't need the client to be able to prove its identity to arbitrary third party sites here - we just need the client to be able to prove it's a legitimate holder of whichever bearer token it's presenting to that site. And that's a much easier problem.

Here's the simple solution - clients generate a TLS cert. This can be self-signed, because all we want to do here is be able to verify whether the machine talking to us is the same one that had a token issued to it. The client contacts a service that's going to give it a bearer token. The service requests mTLS auth without being picky about the certificate that's presented. The service embeds a hash of that certificate in the token before handing it back to the client. Whenever the client presents that token to any other service, the service ensures that the mTLS cert the client presented matches the hash in the bearer token. Copy the token without copying the mTLS certificate and the token gets rejected. Hurrah hurrah hats for everyone.

Well except for the obvious problem that if you're in a position to exfiltrate the bearer tokens you can probably just steal the client certificates and keys as well, and now you can pretend to be the original client and this is not adding much additional security. Fortunately pretty much everything we care about has the ability to store the private half of an asymmetric key in hardware (TPMs on Linux and Windows systems, the Secure Enclave on Macs and iPhones, either a piece of magical hardware or Trustzone on Android) in a way that avoids anyone being able to just steal the key.

How do we know that the key is actually in hardware? Here's the fun bit - it doesn't matter. If you're issuing a bearer token to a system then you're already asserting that the system is trusted. If the system is lying to you about whether or not the key it's presenting is hardware-backed then you've already lost. If it lied and the system is later compromised then sure all your apes get stolen, but maybe don't run systems that lie and avoid that situation as a result?

Anyway. This is covered in RFC 8705 so why aren't we all doing this already? From the client side, the largest generic issue is that TPMs are astonishingly slow in comparison to doing a TLS handshake on the CPU. RSA signing operations on TPMs can take around half a second, which doesn't sound too bad, except your browser is probably establishing multiple TLS connections to subdomains on the site it's connecting to and performance is going to tank. Fixing this involves doing whatever's necessary to convince the browser to pipe everything over a single TLS connection, and that's just not really where the web is right at the moment. Using EC keys instead helps a lot (~0.1 seconds per signature on modern TPMs), but it's still going to be a bottleneck.

The other problem, of course, is that ecosystem support for hardware-backed certificates is just awful. Windows lets you stick them into the standard platform certificate store, but the docs for this are hidden in a random PDF in a Github repo. Macs require you to do some weird bridging between the Secure Enclave API and the keychain API. Linux? Well, the standard answer is to do PKCS#11, and I have literally never met anybody who likes PKCS#11 and I have spent a bunch of time in standards meetings with the sort of people you might expect to like PKCS#11 and even they don't like it. It turns out that loading a bunch of random C bullshit that has strong feelings about function pointers into your security critical process is not necessarily something that is going to improve your quality of life, so instead you should use something like this and just have enough C to bridge to a language that isn't secretly plotting to kill your pets the moment you turn your back.

And, uh, obviously none of this matters at all unless people actually support it. Github has no support at all for validating the identity of whoever holds a bearer token. Most issuers of bearer tokens have no support for embedding holder identity into the token. This is not good! As of last week, all three of the big cloud providers support virtualised TPMs in their VMs - we should be running CI on systems that can do that, and tying any issued tokens to the VMs that are supposed to be making use of them.

So sure this isn't trivial. But it's also not impossible, and making this stuff work would improve the security of, well, everything. We literally have the technology to prevent attacks like Github suffered. What do we have to do to get people to actually start working on implementing that?

comment count unavailable comments

May 16, 2022 07:48 AM

May 09, 2022

Rusty Russell: Pickhardt Payments Implementation: Finding ?!

So, I’ve finally started implementing Pickhardt Payments in Core Lightning (#cln) and there are some practical complications beyond the paper which are worth noting for others who consider this!

In particular, the cost function in the paper cleverly combines the probability of success, with the fee charged by the channel, giving a cost function of:

? log( (ce + 1 ? fe) / (ce + 1)) + ? · fe · fee(e)

Which is great: bigger ? means fees matter more, smaller means they matter less. And the paper suggests various ways of adjusting them if you don’t like the initial results.

But, what’s a reasonable ? value? 1? 1000? 0.00001? Since the left term is the negative log of a probability, and the right is a value in millisats, it’s deeply unclear to me!

So it’s useful to look at the typical ranges of the first term, and the typical fees (the rest of the second term which is not ?), using stats from the real network.

If we want these two terms to be equal, we get:

? log( (ce + 1 ? fe) / (ce + 1)) = ? · fe · fee(e)
=> ? = ? log( (ce + 1 ? fe) / (ce + 1)) / ( fe · fee(e))

Let’s assume that fee(e) is the median fee: 51 parts per million. I chose to look at amounts of 1sat, 10sat, 100sat, 1000sat, 10,000sat, 100,000sat and 1M sat, and calculated the ? values for each channel. It turns out that, for almost all those values, the 10th percentile ? value is 0.125 the median, and the 90th percentile ? value is 12.5 times the median, though for 1M sats it’s 0.21 and 51x, which probably reflects that the median fee is not 51 for these channels!

Nonetheless, this suggests we can calculate the “expected ?” using the median capacity of channels we could use for a payment (i.e. those with capacity >= amount), and the median feerate of those channels. We can then bias it by a factor of 10 or so either way, to reasonably promote certainty over fees or vice versa.

So, in the internal API for the moment I accept a frugality factor, generally 0.1 (not frugal, prefer certainty to fees) to 10 (frugal, prefer fees to certainty), and derive ?:

? = -log((median_capacity_msat + 1 – amount_msat) / (median_capacity_msat + 1)) * frugality / (median_fee + 1)

The median is selected only from the channels with capacity > amount, and the +1 on the median_fee covers the case where median fee turns out to be 0 (such as in one of my tests!).

Note that it’s possible to try to send a payment larger than any channel in the network, using MPP. This is a corner case, where you generally care less about fees, so I set median_capacity_msat in the “no channels” case to amount_msat, and the resulting ? is really large, but at that point you can’t be fussy about fees!

May 09, 2022 05:37 AM

May 02, 2022

Brendan Gregg: Brendan@Intel.com

I'm thrilled to be joining Intel to work on the performance of everything, apps to metal, with a focus on cloud computing. It's an exciting time to be joining: The geeks are back with [Pat Gelsinger] and [Greg Lavender] as the CEO and CTO; new products are launching including the Sapphire Rapids processor; there are more competitors, which will drive innovation and move the whole industry forward more quickly; and Intel are building new fabs on US soil. It's a critical time to join, and an honour to do so as an Intel fellow, based in Australia. My dream is to turn computer performance analysis into a science, one where we can completely understand the performance of everything: of applications, libraries, kernels, hypervisors, firmware, and hardware. These were the opening words of my 2019 [AWS re:Invent talk], which I followed by demonstrating rapid on-the-fly dynamic instrumentation of the Intel wireless driver. With the growing complexities of our industry, both hardware and software offerings, it has become increasingly challenging to find the root causes for system performance problems. I dream of solving this: to be able to observe everything and to provide complete answers to any performance question, for any workload, any operating system, and any hardware type. A while ago I began exploring the idea of building this performance analysis capability for a major cloud, and use it to help find performance improvements to make that cloud industry-leading. The question was: Which cloud should I make number one? I'm grateful for the companies who explored this idea with me and offered good opportunities. I wasn't thinking Intel to start with, but after spending much time talking to Greg and other Intel leaders, I realized the massive opportunity and scope of an Intel role: I can work on new performance and debugging technologies for everything from apps down to silicon, across all xPUs (CPUs, GPUs, IPUs, etc.), and have a massive impact on the world. It was the most challenging of the options before me, just as Netflix was when I joined it, and at that point it gets hard for me to say no. I want to know if I can do this. Why climb the highest mountain? Intel is a deeply technical company and a leader in high performance computing, and I'm excited to be working with others who have a similar appetite for deep technical work. I'll also have the opportunity to hire and mentor staff and build a team of the world's best performance and debugging engineers. My work will still involve hands-on engineering, but this time with resources. I was reminded of this while interviewing with other companies: One interviewer who had studied my work asked "How many staff report to you?" "None." He kept returning to this question. I got the feeling that he didn't actually believe it, and thought if he asked enough times I'd confess to having a team. (I'm reminded of my days at Sun Microsystems, where the joke was that I must be a team of clones to get so much done – so I made a [picture].) People have helped me build things (I have detailed Acknowledgement lists in my books), but I've always been an individual contributor. I now have an opportunity at Intel to grow further in my career, and to help grow other people. Teaching others is another passion of mine – it's what drives me to write books, create training materials, and write blog posts here. I'll still be working on eBPF and other open source projects, and I'm glad that Intel's leadership has committed to continue supporting [open source]. Intel has historically been a top contributor to the Linux kernel, and supports countless other projects. Many of the performance tools I've worked on are open source, in particular eBPF, which plays an important role for understanding the performance of everything. eBPF is in development for Windows as well (not just Linux where it originated). For cloud computing performance, I'll be working on projects such as the [Intel DevCloud], run by [Markus Flierl], Corporate VP, General Manager, Intel Developer Cloud Platforms. I know Markus from Sun and I'm glad to be working for him at Intel. While at Netflix in the last few years I've had more regular meetings with Intel than any other company, to the point where we had started joking that I already worked for Intel. I've found them to be not only the deepest technical company – capable of analysis and debugging at a mind-blowing atomic depth – but also professional and a pleasure to work with. This became especially clear after I recently worked with another hardware vendor, who were initially friendly and supportive but after evaluations of their technology went poorly became bullying and misleading. You never know a company (or person) until you see them on their worst day. Over the years I've seen Intel on good days and bad, and they have always been professional and respectful, and work hard to do right by the customer. A bonus of the close relationship between Intel and Netflix, and my focus on cloud computing at Intel, is that I'll likely continue to help the performance of the Netflix cloud, as well as other clouds (that might mean you!). I'm looking forward to meeting new people in this bigger ecosystem, making computers faster everywhere, and understanding the performance of everything. The title of this post is indeed my email address (I also used to be Brendan@Sun.com). [picture]: https://www.brendangregg.com/Images/brendan_clones2006.jpg [open source]: https://www.linkedin.com/pulse/open-letter-ecosystem-pat-gelsinger [Pat Gelsinger]: https://twitter.com/PGelsinger [Greg Lavender]: https://twitter.com/GregL_Intel [Intel DevCloud]: https://www.intel.com/content/www/us/en/developer/tools/devcloud/overview.html [AWS re:Invent talk]: https://www.youtube.com/watch?v=16slh29iN1g [Markus Flierl]: https://www.linkedin.com/in/markus-flierl-375185/

May 02, 2022 12:00 AM

April 17, 2022

Matthew Garrett: The Freedom Phone is not great at privacy

The Freedom Phone advertises itself as a "Free speech and privacy first focused phone". As documented on the features page, it runs ClearOS, an Android-based OS produced by Clear United (or maybe one of the bewildering array of associated companies, we'll come back to that later). It's advertised as including Signal, but what's shipped is not the version available from the Signal website or any official app store - instead it's this fork called "ClearSignal".

The first thing to note about ClearSignal is that the privacy policy link from that page 404s, which is not a great start. The second thing is that it has a version number of 5.8.14, which is strange because upstream went from 5.8.10 to 5.9.0. The third is that, despite Signal being GPL 3, there's no source code available. So, I grabbed jadx and started looking for differences between ClearSignal and the upstream 5.8.10 release. The results were, uh, surprising.

First up is that they seem to have integrated ACRA, a crash reporting framework. This feels a little odd - in the absence of a privacy policy, it's unclear what information this gathers or how it'll be stored. Having a piece of privacy software automatically uploading information about what you were doing in the event of a crash with no notification other than a toast that appears saying "Crash Report" feels a little dubious.

Next is that Signal (for fairly obvious reasons) warns you if your version is out of date and eventually refuses to work unless you upgrade. ClearSignal has dealt with this problem by, uh, simply removing that code. The MacOS version of the desktop app they provide for download seems to be derived from a release from last September, which for an Electron-based app feels like a pretty terrible idea. Weirdly, for Windows they link to an official binary release from February 2021, and for Linux they tell you how to use the upstream repo properly. I have no idea what's going on here.

They've also added support for network backups of your Signal data. This involves the backups being pushed to an S3 bucket using credentials that are statically available in the app. It's ok, though, each upload has some sort of nominally unique identifier associated with it, so it's not trivial to just download other people's backups. But, uh, where does this identifier come from? It turns out that Clear Center, another of the Clear family of companies, employs a bunch of people to work on a ClearID[1], some sort of decentralised something or other that seems to be based on KERI. There's an overview slide deck here which didn't really answer any of my questions and as far as I can tell this is entirely lacking any sort of peer review, but hey it's only the one thing that stops anyone on the internet being able to grab your Signal backups so how important can it be.

The final thing, though? They've extended Signal's invitation support to encourage users to get others to sign up for Clear United. There's an exposed API endpoint called "get_user_email_by_mobile_number" which does exactly what you'd expect - if you give it a registered phone number, it gives you back the associated email address. This requires no authentication. But it gets better! The API to generate a referral link to send to others sends the name and phone number of everyone in your phone's contact list. There does not appear to be any indication that this is going to happen.

So, from a privacy perspective, going to go with things being some distance from ideal. But what's going on with all these Clear companies anyway? They all seem to be related to Michael Proper, who founded the Clear Foundation in 2009. They are, perhaps unsurprisingly, heavily invested in blockchain stuff, while Clear United also appears to be some sort of multi-level marketing scheme which has a membership agreement that includes the somewhat astonishing claim that:

Specifically, the initial focus of the Association will provide members with supplements and technologies for:

9a. Frequency Evaluation, Scans, Reports;

9b. Remote Frequency Health Tuning through Quantum Entanglement;

9c. General and Customized Frequency Optimizations;


- there's more discussion of this and other weirdness here. Clear Center, meanwhile, has a Chief Physics Officer? I have a lot of questions.

Anyway. We have a company that seems to be combining blockchain and MLM, has some opinions about Quantum Entanglement, bases the security of its platform on a set of novel cryptographic primitives that seem to have had no external review, has implemented an API that just hands out personal information without any authentication and an app that appears more than happy to upload all your contact details without telling you first, has failed to update this app to keep up with upstream security updates, and is violating the upstream license. If this is their idea of "privacy first", I really hate to think what their code looks like when privacy comes further down the list.

[1] Pointed out to me here

comment count unavailable comments

April 17, 2022 12:23 AM

April 15, 2022

Brendan Gregg: Netflix End of Series 1

A large and unexpected opportunity has come my way outside of Netflix that I've decided to try. Netflix has been the best job of my career so far, and I'll miss my colleagues and the culture.


offer letter logo (2014)

flame graphs (2014)

eBPF tools (2014-2019)

PMC analysis (2017)

my pandemic-abandoned
desk (2020); office wall
I joined Netflix in 2014, a company at the forefront of cloud computing with an attractive [work culture]. It was the most challenging job among those I interviewed for. On the Netflix Java/Linux/EC2 stack there were no working mixed-mode flame graphs, no production safe dynamic tracer, and no PMCs: All tools I used extensively for advanced performance analysis. How would I do my job? I realized that this was a challenge I was best suited to fix. I could help not only Netflix but all customers of the cloud. Since then I've done just that. I developed the original JVM changes to allow [mixed-mode flame graphs], I pioneered using [eBPF for observability] and helped develop the [front-ends and tools], and I worked with Amazon to get [PMCs enabled] and developed tools to use them. Low-level performance analysis is now possible in the cloud, and with it I've helped Netflix save a very large amount of money, mostly from service teams using flame graphs. There is also now a flourishing industry of observability products based on my work. Apart from developing tools, much of my time has been spent helping teams with performance issues and evaluations. The Netflix stack is more diverse than I was expecting, and is explained in detail in the [Netflix tech blog]: The production cloud is AWS EC2, Ubuntu Linux, Intel x86, mostly Java with some Node.js (and other languages), microservices, Cassandra (storage), EVCache (caching), Spinnaker (deployment), Titus (containers), Apache Spark (analytics), Atlas (monitoring), FlameCommander (profiling), and at least a dozen more applications and workloads (but no 3rd party agents in the BaseAMI). The Netflix CDN runs FreeBSD and NGINX (not Linux: I published a Netflix-approved [footnote] in my last book to explain why). This diverse environment has always provided me with interesting things to explore, to understand, analyze, debug, and improve. I've also used and helped develop many other technologies for debugging, primarily perf, Ftrace, eBPF (bcc and bpftrace), PMCs, MSRs, Intel vTune, and of course, [flame graphs] and [heat maps]. Martin Spier and I also created [Flame Scope] while at Netflix, to analyze perturbations and variation in profiles. I've also had the chance to do other types of work. For 18 months I joined the [CORE SRE team] rotation, and was the primary contact for Netflix outages. It was difficult and fascinating work. I've also created internal training materials and classes, apart from my books. I've worked with awesome colleagues not just in cloud engineering, but also in open connect, studio, DVD, NTech, talent, immigration, HR, PR/comms, legal, and most recently ANZ content. Last time I quit a job, I wanted to share publicly the reasons why I left, but I ultimately did not. I've since been asked many times why I resigned that job (not unlike [The Prisoner]) along with much speculation (none true). I wouldn't want the same thing happening here, and having people wondering if something bad happened at Netflix that caused me to leave: I had a great time and It's a great company! I'm thankful for the opportunities and support I've had, especially from my former managers [Coburn] and [Ed]. I'm also grateful for the support for my work by other companies, technical communities, social communities (Twitter, HackerNews), conference organizers, and all who have liked my work, developed it further, and shared it with others. Thank you. I hope my last two books, [Systems Performance 2nd Ed] and [BPF Performance Tools] serve Netflix well in my absence and everyone else who reads them. I'll still be posting here in my next job. More on that soon... [work culture]: /blog/2017-11-13/brilliant-jerks.html [work culture2]: https://www.slideshare.net/kevibak/netflix-culture-deck-77978007 [Systems Performance 2nd Ed]: /systems-performance-2nd-edition-book.html [BPF Performance Tools]: /bpf-performance-tools-book.html [mixed-mode flame graphs]: http://techblog.netflix.com/2015/07/java-in-flames.html [eBPF for observability]: /blog/2015-05-15/ebpf-one-small-step.html [front-ends and tools]: /Perf/bcc_tracing_tools.png [front-ends and tools2]: /blog/2019-01-01/learn-ebpf-tracing.html [PMCs enabled]: /blog/2017-05-04/the-pmcs-of-ec2.html [CORE SRE team]: /blog/2016-05-04/srecon2016-perf-checklists-for-sres.html [footnote]: /blog/images/2022/netflixcdn.png [Coburn]: https://www.linkedin.com/in/coburnw/ [Ed]: https://www.linkedin.com/in/edwhunter/ [Netflix tech blog]: https://netflixtechblog.com/ [Why Don't You Use ...]: /blog/2022-03-19/why-dont-you-use.html [The Prisoner]: https://en.wikipedia.org/wiki/The_Prisoner [flame graphs]: /flamegraphs.html [heat maps]: /heatmaps.html [Flame Scope]: https://netflixtechblog.com/netflix-flamescope-a57ca19d47bb

April 15, 2022 12:00 AM

April 09, 2022

Brendan Gregg: TensorFlow Library Performance

A while ago I helped a colleague, Vadim, debug a performance issue with TensorFlow in an unexpected location. I thought this was a bit interesting so I've been meaning to share it; here's a rough post of the details. ## 1. The Expert's Eye Vadim had spotted something unusual in this CPU flamegraph (redacted); do you see it?:

I'm impressed he found it so quickly, but then if you look at enough flame graphs the smaller unusual patterns start to jump out. In this case there's an orange tower (kernel code) that's unusual. The cause I've highlighted here. 10% of total CPU time in page faults. At Netflix, 10% of CPU time somewhere unexpected can be a large costly issue across thousands of server instances. We'll use flame graphs to chase down the 1%s. ## 2. Why is it Still Faulting? Browsing the stack trace shows these are from __memcpy_avx_unaligned(). Well, at least that makes sense: memcpy would be faulting in a data segment mappings. But this process had been up and running for hours, and shouldn't still be doing so much page fault growth of its RSS. You see that early on when segments and the heap are fresh and don't have mappings yet, but after hours they are mostly faulted in (mapped to physical memory) and you see the faults dwindle. Sometimes processes page fault a lot because they call mmap()/munmap() to add/remove memory. I used my eBPF tools to trace them ([mmapsnoop.py]) but there was no activity. So how is it still page faulting? Is it doing madvise() and dropping memory? A search of madvise in the flame graph showed it was 0.8% of CPU, so it definitely was, and madvise() was calling zap_page_range() that was calling the faults. (Click on the [flame graph] and try Ctrl-F searching for "madvise" and zooming in.) ## 3. Premature Optimization I read the kernel code related to madvise() and zap_page_range() from mm/advise.c. That showed it calls zap_page_range() for the MADV_DONTNEED flag. (I could have also traced sys_madvise() flags using kprobes/tracepoints). This seemed to be a premature optimization gone bad: The allocator was calling dontneed on memory pages that it did in fact need. The dontneed dropped the virtual to physical mapping, which would soon afterwards cause a page fault to bring the mapping back. ## 4. Allocator Issue I suggested looking into the allocator, and Vadim said it was jemalloc, a configure option. He rebuilt with glibc, and the problem was fixed! Here's the fixed flame graph:
Initial testing showed only a 3% win (can be verified by the flame graphs). We were hoping for 10%! [mmapsnoop.py]: https://github.com/brendangregg/bpf-perf-tools-book/blob/master/originals/Ch07_Memory/mmapsnoop.py [flame graph]: /blog/images/2022/cpuflamegraph.tensorflow0-red.svg

April 09, 2022 12:00 AM

April 05, 2022

Matthew Garrett: Bearer tokens are just awful

As I mentioned last time, bearer tokens are not super compatible with a model in which every access is verified to ensure it's coming from a trusted device. Let's talk about that in a bit more detail.

First off, what is a bearer token? In its simplest form, it's simply an opaque blob that you give to a user after an authentication or authorisation challenge, and then they show it to you to prove that they should be allowed access to a resource. In theory you could just hand someone a randomly generated blob, but then you'd need to keep track of which blobs you've issued and when they should be expired and who they correspond to, so frequently this is actually done using JWTs which contain some base64 encoded JSON that describes the user and group membership and so on and then have a signature associated with them so whenever the user presents one you can just validate the signature and then assume that the contents of the JSON are trustworthy.

One thing to note here is that the crypto is purely between whoever issued the token and whoever validates the token - as far as the server is concerned, any client who can just show it the token is just fine as long as the signature is verified. There's no way to verify the client's state, so one of the core ideas of Zero Trust (that we verify that the client is in a trustworthy state on every access) is already violated.

Can we make things not terrible? Sure! We may not be able to validate the client state on every access, but we can validate the client state when we issue the token in the first place. When the user hits a login page, we do state validation according to whatever policy we want to enforce, and if the client violates that policy we refuse to issue a token to it. If the token has a sufficiently short lifetime then an attacker is only going to have a short period of time to use that token before it expires and then (with luck) they won't be able to get a new one because the state validation will fail.

Except! This is fine for cases where we control the issuance flow. What if we have a scenario where a third party authenticates the client (by verifying that they have a valid token issued by their ID provider) and then uses that to issue their own token that's much longer lived? Well, now the client has a long-lived token sitting on it. And if anyone copies that token to another device, they can now pretend to be that client.

This is, sadly, depressingly common. A lot of services will verify the user, and then issue an oauth token that'll expire some time around the heat death of the universe. If a client system is compromised and an attacker just copies that token to another system, they can continue to pretend to be the legitimate user until someone notices (which, depending on whether or not the service in question has any sort of audit logs, and whether you're paying any attention to them, may be once screenshots of your data show up on Twitter).

This is a problem! There's no way to fit a hosted service that behaves this way into a Zero Trust model - the best you can say is that a token was issued to a device that was, around that time, apparently trustworthy, and now it's some time later and you have literally no idea whether the device is still trustworthy or if the token is still even on that device.

But wait, there's more! Even if you're nowhere near doing any sort of Zero Trust stuff, imagine the case of a user having a bunch of tokens from multiple services on their laptop, and then they leave their laptop unlocked in a cafe while they head to the toilet and whoops it's not there any more, better assume that someone has access to all the data on there. How many services has our opportunistic new laptop owner gained access to as a result? How do we revoke all of the tokens that are sitting there on the local disk? Do you even have a policy for dealing with that?

There isn't a simple answer to all of these problems. Replacing bearer tokens with some sort of asymmetric cryptographic challenge to the client would at least let us tie the tokens to a TPM or other secure enclave, and then we wouldn't have to worry about them being copied elsewhere. But that wouldn't help us if the client is compromised and the attacker simply keeps using the compromised client. The entire model of simply proving knowledge of a secret being sufficient to gain access to a resource is inherently incompatible with a desire for fine-grained trust verification on every access, but I don't see anything changing until we have a standard for third party services to be able to perform that trust verification against a customer's policy.

Still, at least this means I can just run weird Android IoT apps through mitmproxy, pull the bearer token out of the request headers and then start poking the remote API with curl. It may all be broken, but it's also got me a bunch of bug bounty credit, so, it;s impossible to say if its bad or not,

(Addendum: this suggestion that we solve the hardware binding problem by simply passing all the network traffic through some sort of local enclave that could see tokens being set and would then sequester them and reinject them into later requests is OBVIOUSLY HORRIFYING and is also probably going to be at least three startup pitches by the end of next week)

comment count unavailable comments

April 05, 2022 06:59 AM

Kees Cook: security things in Linux v5.10

Previously: v5.9

Linux v5.10 was released in December, 2020. Here’s my summary of various security things that I found interesting:

AMD SEV-ES
While guest VM memory encryption with AMD SEV has been supported for a while, Joerg Roedel, Thomas Lendacky, and others added register state encryption (SEV-ES). This means it’s even harder for a VM host to reconstruct a guest VM’s state.

x86 static calls
Josh Poimboeuf and Peter Zijlstra implemented static calls for x86, which operates very similarly to the “static branch” infrastructure in the kernel. With static branches, an if/else choice can be hard-coded, instead of being run-time evaluated every time. Such branches can be updated too (the kernel just rewrites the code to switch around the “branch”). All these principles apply to static calls as well, but they’re for replacing indirect function calls (i.e. a call through a function pointer) with a direct call (i.e. a hard-coded call address). This eliminates the need for Spectre mitigations (e.g. RETPOLINE) for these indirect calls, and avoids a memory lookup for the pointer. For hot-path code (like the scheduler), this has a measurable performance impact. It also serves as a kind of Control Flow Integrity implementation: an indirect call got removed, and the potential destinations have been explicitly identified at compile-time.

network RNG improvements
In an effort to improve the pseudo-random number generator used by the network subsystem (for things like port numbers and packet sequence numbers), Linux’s home-grown pRNG has been replaced by the SipHash round function, and perturbed by (hopefully) hard-to-predict internal kernel states. This should make it very hard to brute force the internal state of the pRNG and make predictions about future random numbers just from examining network traffic. Similarly, ICMP’s global rate limiter was adjusted to avoid leaking details of network state, as a start to fixing recent DNS Cache Poisoning attacks.

SafeSetID handles GID
Thomas Cedeno improved the SafeSetID LSM to handle group IDs (which required teaching the kernel about which syscalls were actually performing setgid.) Like the earlier setuid policy, this lets the system owner define an explicit list of allowed group ID transitions under CAP_SETGID (instead of to just any group), providing a way to keep the power of granting this capability much more limited. (This isn’t complete yet, though, since handling setgroups() is still needed.)

improve kernel’s internal checking of file contents
The kernel provides LSMs (like the Integrity subsystem) with details about files as they’re loaded. (For example, loading modules, new kernel images for kexec, and firmware.) There wasn’t very good coverage for cases where the contents were coming from things that weren’t files. To deal with this, new hooks were added that allow the LSMs to introspect the contents directly, and to do partial reads. This will give the LSMs much finer grain visibility into these kinds of operations.

set_fs removal continues
With the earlier work landed to free the core kernel code from set_fs(), Christoph Hellwig made it possible for set_fs() to be optional for an architecture. Subsequently, he then removed set_fs() entirely for x86, riscv, and powerpc. These architectures will now be free from the entire class of “kernel address limit” attacks that only needed to corrupt a single value in struct thead_info.

sysfs_emit() replaces sprintf() in /sys
Joe Perches tackled one of the most common bug classes with sprintf() and snprintf() in /sys handlers by creating a new helper, sysfs_emit(). This will handle the cases where kernel code was not correctly dealing with the length results from sprintf() calls, which might lead to buffer overflows in the PAGE_SIZE buffer that /sys handlers operate on. With the helper in place, it was possible to start the refactoring of the many sprintf() callers.

nosymfollow mount option
Mattias Nissler and Ross Zwisler implemented the nosymfollow mount option. This entirely disables symlink resolution for the given filesystem, similar to other mount options where noexec disallows execve(), nosuid disallows setid bits, and nodev disallows device files. Quoting the patch, it is “useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts.” (i.e. for when /proc/sys/fs/protected_symlinks isn’t a big enough hammer.) Chrome OS uses this option for its stateful filesystem, as symlink traversal as been a common attack-persistence vector.

ARMv8.5 Memory Tagging Extension support
Vincenzo Frascino added support to arm64 for the coming Memory Tagging Extension, which will be available for ARMv8.5 and later chips. It provides 4 bits of tags (covering multiples of 16 byte spans of the address space). This is enough to deterministically eliminate all linear heap buffer overflow flaws (1 tag for “free”, and then rotate even values and odd values for neighboring allocations), which is probably one of the most common bugs being currently exploited. It also makes use-after-free and over/under indexing much more difficult for attackers (but still possible if the target’s tag bits can be exposed). Maybe some day we can switch to 128 bit virtual memory addresses and have fully versioned allocations. But for now, 16 tag values is better than none, though we do still need to wait for anyone to actually be shipping ARMv8.5 hardware.

fixes for flaws found by UBSAN
The work to make UBSAN generally usable under syzkaller continues to bear fruit, with various fixes all over the kernel for stuff like shift-out-of-bounds, divide-by-zero, and integer overflow. Seeing these kinds of patches land reinforces the the rationale of shifting the burden of these kinds of checks to the toolchain: these run-time bugs continue to pop up.

flexible array conversions
The work on flexible array conversions continues. Gustavo A. R. Silva and others continued to grind on the conversions, getting the kernel ever closer to being able to enable the -Warray-bounds compiler flag and clear the path for saner bounds checking of array indexes and memcpy() usage.

That’s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.11.

© 2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

April 05, 2022 12:01 AM

March 31, 2022

Matthew Garrett: ZTA doesn't solve all problems, but partial implementations solve fewer

Traditional network access controls work by assuming that something is trustworthy based on some other factor - for example, if a computer is on your office network, it's trustworthy because only trustworthy people should be able to gain physical access to plug something in. If you restrict access to your services to requests coming from trusted networks, then you can assert that it's coming from a trusted device.

Of course, this isn't necessarily true. A machine on your office network may be compromised. An attacker may obtain valid VPN credentials. Someone could leave a hostile device plugged in under a desk in a meeting room. Trust is being placed in devices that may not be trustworthy.

A Zero Trust Architecture (ZTA) is one where a device is granted no inherent trust. Instead, each access to a service is validated against some policy - if the policy is satisfied, the access is permitted. A typical implementation involves granting each device some sort of cryptographic identity (typically a TLS client certificate) and placing the protected services behind a proxy. The proxy verifies the device identity, queries another service to obtain the current device state (we'll come back to that in a moment), compares the state against a policy and either pass the request through to the service or reject it. Different services can have different policies (eg, you probably want a lax policy around whatever's hosting the documentation for how to fix your system if it's being refused access to something for being in the wrong state), and if you want you can also tie it to proof of user identity in some way.

From a user perspective, this is entirely transparent. The proxy is made available on the public internet, DNS for the services points to the proxy, and every time your users try to access the service they hit the proxy instead and (if everything's ok) gain access to it no matter which network they're on. There's no need to connect to a VPN first, and there's no worries about accidentally leaking information over the public internet instead of over a secure link.

It's also notable that traditional solutions tend to be all-or-nothing. If I have some services that are more sensitive than others, the only way I can really enforce this is by having multiple different VPNs and only granting access to sensitive services from specific VPNs. This obviously risks combinatorial explosion once I have more than a couple of policies, and it's a terrible user experience.

Overall, ZTA approaches provide more security and an improved user experience. So why are we still using VPNs? Primarily because this is all extremely difficult. Let's take a look at an extremely recent scenario. A device used by customer support technicians was compromised. The vendor in question has a solution that can tie authentication decisions to whether or not a device has a cryptographic identity. If this was in use, and if the cryptographic identity was tied to the device hardware (eg, by being generated in a TPM), the attacker would not simply be able to obtain the user credentials and log in from their own device. This is good - if the attacker wanted to maintain access to the service, they needed to stay on the device in question. This increases the probability of the monitoring tooling on the compromised device noticing them.

Unfortunately, the attacker simply disabled the monitoring tooling on the compromised device. If device state was being verified on each access then this would be noticed before too long - the last data received from the device would be flagged as too old, and the requests would no longer satisfy any reasonable access control policy. Instead, the device was assumed to be trustworthy simply because it could demonstrate its identity. There's an important point here: just because a device belongs to you doesn't mean it's a trustworthy device.

So, if ZTA approaches are so powerful and user-friendly, why aren't we all using one? There's a few problems, but the single biggest is that there's no standardised way to verify device state in any meaningful way. Remote Attestation can both prove device identity and the device boot state, but the only product on the market that does much with this is Microsoft's Device Health Attestation. DHA doesn't solve the broader problem of also reporting runtime state - it may be able to verify that endpoint monitoring was launched, but it doesn't make assertions about whether it's still running. Right now, people are left trying to scrape this information from whatever tooling they're running. The absence of any standardised approach to this problem means anyone who wants to deploy a strong ZTA has to integrate with whatever tooling they're already running, and that then increases the cost of migrating to any other tooling later.

But even device identity is hard! Knowing whether a machine should be given a certificate or not depends on knowing whether or not you own it, and inventory control is a surprisingly difficult problem in a lot of environments. It's not even just a matter of whether a machine should be given a certificate in the first place - if a machine is reported as lost or stolen, its trust should be revoked. Your inventory system needs to tie into your device state store in order to ensure that your proxies drop access.

And, worse, all of this depends on you being able to put stuff behind a proxy in the first place! If you're using third-party hosted services, that's a problem. In the absence of a proxy, trust decisions are probably made at login time. It's possible to tie user auth decisions to device identity and state (eg, a self-hosted SAML endpoint could do that before passing through to the actual ID provider), but that's still going to end up providing a bearer token of some sort that can potentially be exfiltrated, and will continue to be trusted even if the device state becomes invalid.

ZTA doesn't solve all problems, and there isn't a clear path to it doing so without significantly greater industry support. But a complete ZTA solution is significantly more powerful than a partial one. Verifying device identity is a step on the path to ZTA, but in the absence of device state verification it's only a step.

comment count unavailable comments

March 31, 2022 11:06 PM

March 23, 2022

Matthew Garrett: AMD's Pluton implementation seems to be controllable

I've been digging through the firmware for an AMD laptop with a Ryzen 6000 that incorporates Pluton for the past couple of weeks, and I've got some rough conclusions. Note that these are extremely preliminary and may not be accurate, but I'm going to try to encourage others to look into this in more detail. For those of you at home, I'm using an image from here, specifically version 309. The installer is happy to run under Wine, and if you tell it to "Extract" rather than "Install" it'll leave a file sitting in C:\\DRIVERS\ASUS_GA402RK_309_BIOS_Update_20220322235241 which seems to have an additional 2K of header on it. Strip that and you should have something approximating a flash image.

Looking for UTF16 strings in this reveals something interesting:

Pluton (HSP) X86 Firmware Support
Enable/Disable X86 firmware HSP related code path, including AGESA HSP module, SBIOS HSP related drivers.
Auto - Depends on PcdAmdHspCoreEnable build value
NOTE: PSP directory entry 0xB BIT36 have the highest priority.
NOTE: This option will NOT put HSP hardware in disable state, to disable HSP hardware, you need setup PSP directory entry 0xB, BIT36 to 1.
// EntryValue[36] = 0: Enable, HSP core is enabled.
// EntryValue[36] = 1: Disable, HSP core is disabled then PSP will gate the HSP clock, no further PSP to HSP commands. System will boot without HSP.

"HSP" here means "Hardware Security Processor" - a generic term that refers to Pluton in this case. This is a configuration setting that determines whether Pluton is "enabled" or not - my interpretation of this is that it doesn't directly influence Pluton, but disables all mechanisms that would allow the OS to communicate with it. In this scenario, Pluton has its firmware loaded and could conceivably be functional if the OS knew how to speak to it directly, but the firmware will never speak to it itself. I took a quick look at the Windows drivers for Pluton and it looks like they won't do anything unless the firmware wants to expose Pluton, so this should mean that Windows will do nothing.

So what about the reference to "PSP directory entry 0xB BIT36 have the highest priority"? The PSP is the AMD Platform Security Processor - it's an ARM core on the CPU package that boots before the x86. The PSP firmware lives in the same flash image as the x86 firmware, so the PSP looks for a header that points it towards the firmware it should execute. This gives a pointer to a "directory" - a list of different object types and where they're located in flash (there's a description of this for slightly older AMDs here). Type 0xb is treated slightly specially. Where most types contain the address of where the actual object is, type 0xb contains a 64-bit value that's interpreted as enabling or disabling various features - something AMD calls "soft fusing" (Intel have something similar that involves setting bits in the Firmware Interface Table). The PSP looks at the bits that are set here and alters its behaviour. If bit 36 is set, the PSP tells Pluton to turn itself off and will no longer send any commands to it.

So, we have two mechanisms to disable Pluton - the PSP can tell it to turn itself off, or the x86 firmware can simply never speak to it or admit that it exists. Both of these imply that Pluton has started executing before it's shut down, so it's reasonable to wonder whether it can still do stuff. In the image I'm looking at, there's a blob starting at 0x0069b610 that appears to be firmware for Pluton - it contains chunks that appear to be the reference TPM2 implementation, and it broadly decompiles as valid ARM code. It should be viable to figure out whether it can do anything in the face of being "disabled" via either of the above mechanisms.

Unfortunately for me, the system I'm looking at does set bit 36 in the 0xb entry - as a result, Pluton is disabled before x86 code starts running and I can't investigate further in any straightforward way. The implication that the user-controllable mechanism for disabling Pluton merely disables x86 communication with it rather than turning it off entirely is a little concerning, although (assuming Pluton is behaving as a TPM rather than having an enhanced set of capabilities) skipping any firmware communication means the OS has no way to know what happened before it started running even if it has a mechanism to communicate with Pluton without firmware assistance. In that scenario it'd be viable to write a bootloader shim that just faked up the firmware measurements before handing control to the OS.

The bit 36 disabling mechanism seems more solid? Again, it should be possible to analyse the Pluton firmware to determine whether it actually pays attention to a disable command being sent. But even if it chooses to ignore that, if the PSP is in a position to just cut the clock to Pluton, it's not going to be able to do a lot. At that point we're trusting AMD rather than trusting Microsoft, but given that you're also trusting AMD to execute the code you're giving them to execute, it's hard to avoid placing trust in them.

Overall: I'm reasonably confident that systems that ship with Pluton disabled via setting bit 36 in the soft fuses are going to disable it sufficiently hard that the OS can't do anything about it. Systems that give the user an option to enable or disable it are a little less clear in that respect, and it's possible (but not yet demonstrated) that an OS could communicate with Pluton anyway. However, if that's true, and if the firmware never communicates with Pluton itself, the user could install a stub loader in UEFI that mimicks the firmware behaviour and leaves the OS thinking everything was good when it absolutely is not.

So, assuming that Pluton in its current form on AMD has no capabilities outside those we know about, the disabling mechanisms are probably good enough. It's tough to make a firm statement on this before I have access to a system that doesn't just disable it immediately, so stay tuned for updates.

comment count unavailable comments

March 23, 2022 08:42 AM

March 19, 2022

Brendan Gregg: Why Don't You Use ...

Working for a famous tech company, I get asked a lot "Why don't you use technology X?" X may be an application, programming language, operating system, hypervisor, processor, or tool. It may be because: - It performs poorly. - It is too expensive. - It is not open source. - It lacks features. - It lacks a community. - It lacks debug tools. - It has serious bugs. - It is poorly documented. - It lacks timely security fixes. - It lacks subject matter expertise. - It's developed for the wrong audience. - Our custom internal solution is good enough. - Its longevity is uncertain: Its startup may be dead or sold soon. - We know under NDA of a better solution. - We know other bad things under NDA. - Key contributors told us it was doomed. - It made sense a decade ago but doesn't today. - It made false claims in articles/blogs/talks and has lost credibility. - It tolerates brilliant jerks and has no effective CoC. - Our lawyers won't accept its T&Cs or license. It's rarely because we don't know about it or don't understand it. Sometimes we do know about it, but have yet to find time to check it out. For big technical choices, it is often the result of an internal evaluation that involved various teams, and for some combination of the above reasons. Companies typically do not share these internal product evaluations publicly.

It's easier to say what you do use and why, than what you don't use and why not.
I used to be the outsider asking the big companies, and their silence would drive me nuts. Do they not know about this technology? Why wouldn't they use it?...I finally get it now having seen it from the other side. They are usually well aware of various technologies but have reasons not to use them which they usually won't share. ## Why companies won't say why **Private reasons**: - It is too expensive. - We know under NDA of a better solution. - We know other bad things under NDA. - Key contributors told us it was doomed. These have all happened to me. Technology X may be too expensive because we're using another technology with a special discount that's confidential. If the reasons are under NDA, then I also cannot share it. In one case I was interested in a technology, only to have the CEO of the company that developed it, under NDA, tell me that he was abandoning it. They had not announced this publicly. I've also privately chatted with key technology contributors at conferences who are looking for something different to work on, because they believe their own technology is doomed. At some companies, whatever reason may just be considered competitive knowledge and thus confidential. **Complicated reasons**: - It performs poorly. - Our lawyers won't accept its T&Cs or license ([example]). - Our custom internal solution is good enough. - It made sense a decade ago but doesn't today. - It tolerates brilliant jerks and has no effective CoC. - It lacks subject matter expertise. - It's developed for the wrong audience. These are complicated and time consuming to explain properly, and it may not be a good use of engineering time. If you rushed a quick answer instead, you can put the company in a worse position than by saying nothing: that of providing a _weak_ explanation. Those vested in the technology will find it easy to attack your response – and have more time, energy, and interest in doing so – which can make your company look worse. I've found one of these reasons is common: Discovering that a product was built without subject matter expertise. I've seen startups that do nothing new and nothing well in a space that's crying for new or better solutions. I usually find out later that the development team has no prior domain experience. It's complicated to explain, even to the developer team, as they lack the background to best understand why they missed the mark. A common reason I've seen at smaller companies is when an internal solution is good enough. Adopting new technologies isn't "free": It takes (their limited) engineering time away from other projects, it adds technical debt, and it may add security risk (agents running as root). In this case, the other technology just doesn't have enough features or improved performance to justify a switch. **Safer reasons**: - It is poorly documented. - It lacks features. - It lacks debug tools. - It has serious bugs. - etc. Those are objective and easy to discuss, right? Sort of. There are usually multiple reasons not to use a product, which might include all three: "safe", "complicated", and "private." It can be misleading to only mention the safe reasons. Also, if you do, and they get fixed, people will feel let down that you don't switch. **Bad reasons**: - Not invented here syndrome. - Pride/ego. - Corporate politics. - etc. I didn't include them in the above list, but there can be some poor reasons behind technology choices. At large tech companies (with many staff and their collective expertise) I'd expect that most technical choices will have valid reasons, even if the company won't share them, and poor reasons are the exception and not the rule. People think it's poor technical choices ten times more than it actually is. Just as an example, I've been asked many times why my company uses FreeBSD for its CDN, with the assumption that there must be some dumb reason we didn't choose Linux. No, we do regular Linux vs FreeBSD production tests (which I've helped analyze) and FreeBSD is faster for that workload. For a long time we just avoided this question. I finally got approval to mention it in a footnote in my last book (SysPerf2 page 124). The only sure way to find out why a company doesn't use a given technology is to join that company and then ask the right team. Otherwise, try asking questions that are safer to answer. For example: "Are you aware of technology X?," and "Are you aware that technology X claims to be better at Y?". This can at least rule out ignorance. Update: This was discussed on [Hacker News], which included some other reasons, including: - Engineers don't know X and can't hire X engineers (certainly true at smaller companies). - The cost of switching outweighs the benefits. - Legacy. And I liked the comment about "...random drive-by comments where the person is really just advertising x, rather than being genuinely curious." [example]: /blog/2014-05-17/free-as-in-we-own-your-ip.html [Hacker News]: https://news.ycombinator.com/item?id=30755861

March 19, 2022 12:00 AM

March 15, 2022

Lucas De Marchi: Tracepoints and kernel modules

While debugging, it’s invaluable to be able to trace functions in the Linux kernel. There are several tools for that task: perf, ftrace (with helper tool like trace-cmd), bpftrace, etc.

For my own debug of kernel internals and in the last few years to debug the i915 module for Intel graphics, I usually use tracepoints or dynamic tracepoints (with perf probe) as the trigger for something I want to trace.

One question I get some times is “how do I trace the initial probe of a kernel module?”. At that time you (usually) don’t yet have the module loaded and as a consequence, the symbols of that module are still not available for using tracepoints. For example:

# trace-cmd list -e '^i915.*'
#

One thing to realize is that a lot of time we don’t want to debug the module loading, but rather the module binding to a device. These are 2 separate things, that happen in a cascade of events. It probably also adds to the confusion that the userspace tools to load the modules are not named very consistently: modprobe and insmod, and they end up calling the [f]init_module syscall.

The various buses available in the kernel have mechanisms for stopping the autoprobe (and hence binding to a device), when a module is loaded. With i915, all we have to do is to set a few things in the /sys/bus/pci/:

# echo 0 > /sys/bus/pci/drivers_autoprobe
# modprobe i915

With the i915 module, but not attached to any device, we are ready to attach to any tracepoint or create new dynamic probes. Let’s suppose I want to debug the i915_driver_probe()() function, and any function called by it during the initialization. This is one of the functions to initialize the GPU in i915 called when we are binding to a device.

F=i915_driver_probe

echo 0 > /sys/bus/pci/drivers_autoprobe 
modprobe i915

cd /sys/kernel/debug/tracing
cat /dev/null >  trace
echo $F > set_graph_function
echo _printk > set_graph_notrace
echo 10 > max_graph_depth
echo function_graph > current_tracer
echo 1 > tracing_on

echo 0000:03:00.0 | tee /sys/bus/pci/drivers/i915/bind
cp trace /tmp/trace.txt

echo 0 > tracing_on
cat /dev/null >  trace

With the snippet above we will start tracing whenever function i915_driver_probe() is executed. We also set a few additional parameters: set the maximum call depth to graph, disable graphing on printk() since it’s usually not very interesting. Depending on the size of your trace you may also need to increase the buffer_size_kb to avoid the ring buffer to rotate, or have something to pump data out off the ringbuffer to a file.

Even after we enable tracing by echo’ing 1 to the tracing_on file, nothing happens as we are not automatically binding to the device. In the snippet above we bind it manually, by writting the pci slot to the /sys/bus/pci/drivers/i915/bind file. We then should start seeing the function to be traced:

# head /tmp/trace.txt
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
  3)               |  i915_driver_probe [i915]() {
  3)               |    __devm_drm_dev_alloc() {
  3)               |      __kmalloc() {
  3)               |        kmalloc_order_trace() {
  3)               |          kmalloc_order() {
  3)               |            __alloc_pages() {

Another thing very useful to do is to attach to a specific tracepoint. For example, to trace the MMIO read/write we can do the commands below. Now I’m using trace-cmd directly and omitting the device bind, that we should do after start recording:

# trace-cmd record -e i915:i915_reg_rw
Hit Ctrl^C to stop recording
^C
# trace-cmd report
...
tee-2239  [009]   778.693172: i915_reg_rw:          read reg=0x51000, len=4, val=(0xc1900, 0x0)
tee-2239  [009]   778.693174: i915_reg_rw:          write reg=0xc6204, len=4, val=(0x10251000, 0x0)
tee-2239  [009]   778.693796: i915_reg_rw:          read reg=0xa26c, len=4, val=(0x4, 0x0)
tee-2239  [009]   778.693805: i915_reg_rw:          read reg=0xd00, len=4, val=(0x14, 0x0)
tee-2239  [009]   778.693808: bprint:               intel_gt_init_clock_frequency: 0000:03:00.0 Using clock frequency: 19200kHz, period: 53ns, wrap: 113816ms
tee-2239  [009]   778.693817: i915_reg_rw:          read reg=0x913c, len=4, val=(0xff00, 0x0)
tee-2239  [009]   778.693825: i915_reg_rw:          read reg=0x9144, len=4, val=(0xff00, 0x0)
tee-2239  [009]   778.693908: i915_reg_rw:          read reg=0x9134, len=4, val=(0xff, 0x0)
tee-2239  [009]   778.693918: i915_reg_rw:          read reg=0x9118, len=4, val=(0xd02, 0x0)
tee-2239  [009]   778.693933: i915_reg_rw:          read reg=0x9140, len=4, val=(0x30005, 0x0)
tee-2239  [009]   778.693941: i915_reg_rw:          read reg=0x911c, len=4, val=(0x3000000, 0x0)

March 15, 2022 05:00 PM

March 12, 2022

Linux Plumbers Conference: Proposed Microconferences

We are pleased to announce the first batch of proposed microconferences at the Linux Plumbers Conference (LPC) 2022:

The call for microconferences proposal will close on Saturday, 2. April 2022. The slots are filling up fast so we strongly encourage everyone thinking about submitting a microconference to do so as soon as possible!

LPC 2022 is currently planned to take place in Dublin, Ireland from 12 September to 14 September. For details about the location, co-location with other events see our website and social media for updates.

We do hope that LPC 2022 will be mainly an in-person event. Ideally, microconference runners should be willing and able to attend in person.

March 12, 2022 01:03 PM

February 17, 2022

Dave Airlie (blogspot): optimizing llvmpipe vertex/fragment processing.

Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite threading rasterization, fragment shading and blending stages, never did anything else while those were happening.

I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.

At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.

The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.

Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.

Emma Anholt ran some benchmarks on the results with the paraview traces and got

  • pv-waveletvolume fps +13.9279% +/- 4.91667% (n=15)
  • pv-waveletcountour fps +67.8306% +/- 11.4762% (n=3)
which seems like a good return on the investment.

I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits.

February 17, 2022 03:33 AM

February 11, 2022

Linux Plumbers Conference: CFP Open – Refereed Presentations

The Call for Refereed Presentation Proposals for the 2022 edition of the Linux Plumbers Conference (LPC) is now open.  We plan to hold LPC in Dublin, Ireland on September 12-14 in conjunction with The Linux Foundation Open Source Summit. 

If an in-person conference should prove to be impossible due to the circumstances at that time, Linux Plumbers will switch to a virtual-only conference. Submitters should ideally be able to give their presentation in person if circumstances permit, although presenting remotely will be possible in either case. Please see our website or social media for regular updates.

Refereed Presentations are 45 minutes in length and should focus on a specific aspect of the “plumbing” in a Linux system. Examples of Linux plumbing include core kernel subsystems, init systems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problem statements, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

The Refereed Presentations track will be running throughout all three days of the conference. Note that the current Linux Plumbers Refereed track may overlap with the Open Source Summit.

Linux Plumbers Conference Program Committee members will be reviewing all submitted proposals.  High-quality submissions that cannot be accepted due to the limited number of slots will be forwarded to both the Open Source Summit and to organizers of suitable Linux Plumbers Microconferences for further consideration.

To submit a Refereed Track Presentation proposal follow the instructions here [1]

Submissions are due on or before 11:59PM UTC on Sunday, June 12, 2022.

[1] https://lpc.events/event/16/abstracts/

February 11, 2022 07:24 PM

February 04, 2022

Linux Plumbers Conference: CFP Open – Microconferences

We are pleased to announce the call for papers (cfp) for microconferences at the Linux Plumbers Conference (LPC) 2022.

LPC 2022 is currently planned to take place in Dublin, Ireland from 12 September to 14 September. For details about the location, co-location with other events see our website and social media for updates.

We do hope that LPC 2022 will be mainly an in-person event. Ideally, microconference runners should be willing and able to attend in person.

As the name suggests, LPC is concerned with Linux plumbing encompassing topics from kernel and userspace. A microconference is a set of sessions organized around a particular topic. The topic can be a kernel subsystem or a specific problem area in either kernel or userspace.

A microconference is supposed to be research and development in action and an abstract for a microconference should be thought of as a set of research questions and problem statements.

The sessions in each microconference are expected to address specific problems and should generate new ideas, solutions, and patches. Sessions should be focussed on discussion. Presentations should always aim to aid or kick off a discussion. If your presentation feels like a talk we would recommend to consider submitting to the LPC refereed track.

In past years microconferences were organized around topics such as security, scalability, energy efficiency, toolchains, containers, printing, system boot, Android, scheduling, filesystems, tracing, or real-time. The LPC microconference track is open to a wide variety of topics as long as it is focussed, concerned with interesting problems, and is related to open source and the wider Linux ecosystem. We are happy about a wide range of topics!

A microconference submission should outline the overall topic and list key people and problems which can be discussed. The list of problems and specific topics in a microconference can be continously updated until fairly late. This will allow microconferences to cover topics that pop up after submission and to address new developments or problems.

Microconferences that have been at previous LPCs should list results and accomplishments in the submission and should make sure to cover follow-up work and new topics.

After a microconference has been accepted, microconference organizers are expected to write a short blogpost for the LPC website to announce and advertise their topic.

February 04, 2022 03:48 PM

February 03, 2022

Pete Zaitcev: Cura on Fedora is dead, use Slic3r

Was enjoying my Prusa i3S for a few months, but had to use my Lulzbot Mini today, and it was something else.

In the past, I used the cura-lulzbot package. It went through difficult times, with a Russian take-over and Qtfication. But I persisted in suffering, because, well, it was turnkey and I was a complete novice.

So, I went to install Cura on Fedora 35, and found that package cura-lulzbot is gone. Probably failed to build, and with spot no longer at Red Hat, nobody was motivated enough to keep it going.

The "cura" package is the Ultimaker Cura. It's an absolute dumpster fire of pretend open source. Tons of plug-ins, but basic materials are missing. I print in BASF Ultrafuse ABS, but the nearest available material is the PC/ABS mix.

The material problem is fixable with configuration, but a more serious problem is that UI is absolutely bonkers with crazy flashing - and it does not work. They have menus that cannot be reached: as I move cursor into a submenu, it disappears. Something seriously broken in Qt on F35.

BTW, OpenSCAD suffers from incorrect refresh too on F35. It's super annoying, but at least it works, mostly.

Fortunately, "dnf remove cura" also removes 743 trash packages that it pulls in.

Then, I installed Slic3r, and that turned out to be pretty dope. It's a well put together package, and it has a graphical UI nowadays, operation of which is mostly bug-free and makes sense.

However, my first print popped off. As it turned out, Lulzbot requires the initial sequence that auto-levels it, and I missed that. I could extract it from my old dot files, but in the end I downloaded a settings package from Lulzbot website.

February 03, 2022 12:55 AM

January 20, 2022

Linux Plumbers Conference: Welcome to the 2022 Linux Plumbers Conference

Planning for the 2022 Linux Plumbers Conference is well underway. The hope is to be in Dublin co-located with OSS EU (although with hopefully non-overlapping dates). However, the Linux Foundation is still negotiating for a suitable venue so we can’t fully confirm the location yet.

There is an outside (and hopefully receding) chance that we may have to go back to being fully on-line this year, but if that happens, we’ll be sure to alert you through the usual channels of this blog and twitter.

January 20, 2022 05:55 PM

January 17, 2022

Matthew Garrett: Boot Guard and PSB have user-hostile defaults

Compromising an OS without it being detectable is hard. Modern operating systems support the imposition of a security policy or the launch of some sort of monitoring agent sufficient early in boot that even if you compromise the OS, you're probably going to have left some sort of detectable trace[1]. You can avoid this by attacking the lower layers - if you compromise the bootloader then it can just hotpatch a backdoor into the kernel before executing it, for instance.

This is avoided via one of two mechanisms. Measured boot (such as TPM-based Trusted Boot) makes a tamper-proof cryptographic record of what the system booted, with each component in turn creating a measurement of the next component in the boot chain. If a component is tampered with, its measurement will be different. This can be used to either prevent the release of a cryptographic secret if the boot chain is modified (for instance, using the TPM to encrypt the disk encryption key), or can be used to attest the boot state to another device which can tell you whether you're safe or not. The other approach is verified boot (such as UEFI Secure Boot), where each component in the boot chain verifies the next component before executing it. If the verification fails, execution halts.

In both cases, each component in the boot chain measures and/or verifies the next. But something needs to be the first link in this chain, and traditionally this was the system firmware. Which means you could tamper with the system firmware and subvert the entire process - either have the firmware patch the bootloader in RAM after measuring or verifying it, or just load a modified bootloader and lie about the measurements or ignore the verification. Attackers had already been targeting the firmware (Hacking Team had something along these lines, although this was pre-secure boot so just dropped a rootkit into the OS), and given a well-implemented measured and verified boot chain, the firmware becomes an even more attractive target.

Intel's Boot Guard and AMD's Platform Secure Boot attempt to solve this problem by moving the validation of the core system firmware to an (approximately) immutable environment. Intel's solution involves the Management Engine, a separate x86 core integrated into the motherboard chipset. The ME's boot ROM verifies a signature on its firmware before executing it, and once the ME is up it verifies that the system firmware's bootblock is signed using a public key that corresponds to a hash blown into one-time programmable fuses in the chipset. What happens next depends on policy - it can either prevent the system from booting, allow the system to boot to recover the firmware but automatically shut it down after a while, or flag the failure but allow the system to boot anyway. Most policies will also involve a measurement of the bootblock being pushed into the TPM.

AMD's Platform Secure Boot is slightly different. Rather than the root of trust living in the motherboard chipset, it's in AMD's Platform Security Processor which is incorporated directly onto the CPU die. Similar to Boot Guard, the PSP has ROM that verifies the PSP's own firmware, and then that firmware verifies the system firmware signature against a set of blown fuses in the CPU. If that fails, system boot is halted. I'm having trouble finding decent technical documentation about PSB, and what I have found doesn't mention measuring anything into the TPM - if this is the case, PSB only implements verified boot, not measured boot.

What's the practical upshot of this? The first is that you can't replace the system firmware with anything that doesn't have a valid signature, which effectively means you're locked into firmware the vendor chooses to sign. This prevents replacing the system firmware with either a replacement implementation (such as Coreboot) or a modified version of the original implementation (such as firmware that disables locking of CPU functionality or removes hardware allowlists). In this respect, enforcing system firmware verification works against the user rather than benefiting them.
Of course, it also prevents an attacker from doing the same thing, but while this is a real threat to some users, I think it's hard to say that it's a realistic threat for most users.

The problem is that vendors are shipping with Boot Guard and (increasingly) PSB enabled by default. In the AMD case this causes another problem - because the fuses are in the CPU itself, a CPU that's had PSB enabled is no longer compatible with any motherboards running firmware that wasn't signed with the same key. If a user wants to upgrade their system's CPU, they're effectively unable to sell the old one. But in both scenarios, the user's ability to control what their system is running is reduced.

As I said, the threat that these technologies seek to protect against is real. If you're a large company that handles a lot of sensitive data, you should probably worry about it. If you're a journalist or an activist dealing with governments that have a track record of targeting people like you, it should probably be part of your threat model. But otherwise, the probability of you being hit by a purely userland attack is so ludicrously high compared to you being targeted this way that it's just not a big deal.

I think there's a more reasonable tradeoff than where we've ended up. Tying things like disk encryption secrets to TPM state means that if the system firmware is measured into the TPM prior to being executed, we can at least detect that the firmware has been tampered with. In this case nothing prevents the firmware being modified, there's just a record in your TPM that it's no longer the same as it was when you encrypted the secret. So, here's what I'd suggest:

1) The default behaviour of technologies like Boot Guard or PSB should be to measure the firmware signing key and whether the firmware has a valid signature into PCR 7 (the TPM register that is also used to record which UEFI Secure Boot signing key is used to verify the bootloader).
2) If the PCR 7 value changes, the disk encryption key release will be blocked, and the user will be redirected to a key recovery process. This should include remote attestation, allowing the user to be informed that their firmware signing situation has changed.
3) Tooling should be provided to switch the policy from merely measuring to verifying, and users at meaningful risk of firmware-based attacks should be encouraged to make use of this tooling

This would allow users to replace their system firmware at will, at the cost of having to re-seal their disk encryption keys against the new TPM measurements. It would provide enough information that, in the (unlikely for most users) scenario that their firmware has actually been modified without their knowledge, they can identify that. And it would allow users who are at high risk to switch to a higher security state, and for hardware that is explicitly intended to be resilient against attacks to have different defaults.

This is frustratingly close to possible with Boot Guard, but I don't think it's quite there. Before you've blown the Boot Guard fuses, the Boot Guard policy can be read out of flash. This means that you can drop a Boot Guard configuration into flash telling the ME to measure the firmware but not prevent it from running. But there are two problems remaining:

1) The measurement is made into PCR 0, and PCR 0 changes every time your firmware is updated. That makes it a bad default for sealing encryption keys.
2) It doesn't look like the policy is measured before being enforced. This means that an attacker can simply reflash modified firmware with a policy that disables measurement and then make a fake measurement that makes it look like the firmware is ok.

Fixing this seems simple enough - the Boot Guard policy should always be measured, and measurements of the policy and the signing key should be made into a PCR other than PCR 0. If an attacker modified the policy, the PCR value would change. If an attacker modified the firmware without modifying the policy, the PCR value would also change. People who are at high risk would run an app that would blow the Boot Guard policy into fuses rather than just relying on the copy in flash, and enable verification as well as measurement. Now if an attacker tampers with the firmware, the system simply refuses to boot and the attacker doesn't get anything.

Things are harder on the AMD side. I can't find any indication that PSB supports measuring the firmware at all, which obviously makes this approach impossible. I'm somewhat surprised by that, and so wouldn't be surprised if it does do a measurement somewhere. If it doesn't, there's a rather more significant problem - if a system has a socketed CPU, and someone has sufficient physical access to replace the firmware, they can just swap out the CPU as well with one that doesn't have PSB enabled. Under normal circumstances the system firmware can detect this and prompt the user, but given that the attacker has just replaced the firmware we can assume that they'd do so with firmware that doesn't decide to tell the user what just happened. In the absence of better documentation, it's extremely hard to say that PSB actually provides meaningful security benefits.

So, overall: I think Boot Guard protects against a real-world attack that matters to a small but important set of targets. I think most of its benefits could be provided in a way that still gave users control over their system firmware, while also permitting high-risk targets to opt-in to stronger guarantees. Based on what's publicly documented about PSB, it's hard to say that it provides real-world security benefits for anyone at present. In both cases, what's actually shipping reduces the control people have over their systems, and should be considered user-hostile.

[1] Assuming that someone's both turning this on and actually looking at the data produced

comment count unavailable comments

January 17, 2022 04:37 AM

January 09, 2022

Matthew Garrett: Pluton is not (currently) a threat to software freedom

At CES this week, Lenovo announced that their new Z-series laptops would ship with AMD processors that incorporate Microsoft's Pluton security chip. There's a fair degree of cynicism around whether Microsoft have the interests of the industry as a whole at heart or not, so unsurprisingly people have voiced concerns about Pluton allowing for platform lock-in and future devices no longer booting non-Windows operating systems. Based on what we currently know, I think those concerns are understandable but misplaced.

But first it's helpful to know what Pluton actually is, and that's hard because Microsoft haven't actually provided much in the way of technical detail. The best I've found is a discussion of Pluton in the context of Azure Sphere, Microsoft's IoT security platform. This, in association with the block diagrams on page 12 and 13 of this slidedeck, suggest that Pluton is a general purpose security processor in a similar vein to Google's Titan chip. It has a relatively low powered CPU core, an RNG, and various hardware cryptography engines - there's nothing terribly surprising here, and it's pretty much the same set of components that you'd find in a standard Trusted Platform Module of the sort shipped in pretty much every modern x86 PC. But unlike Titan, Pluton seems to have been designed with the explicit goal of being incorporated into other chips, rather than being a standalone component. In the Azure Sphere case, we see it directly incorporated into a Mediatek chip. In the Xbox Series devices, it's incorporated into the SoC. And now, we're seeing it arrive on general purpose AMD CPUs.

Microsoft's announcement says that Pluton can be shipped in three configurations:as the Trusted Platform Module; as a security processor used for non-TPM scenarios like platform resiliency; or OEMs can choose to ship with Pluton turned off. What we're likely to see to begin with is the former - Pluton will run firmware that exposes a Trusted Computing Group compatible TPM interface. This is almost identical to the status quo. Microsoft have required that all Windows certified hardware ship with a TPM for years now, but for cost reasons this is often not in the form of a separate hardware component. Instead, both Intel and AMD provide support for running the TPM stack on a component separate from the main execution cores on the system - for Intel, this TPM code runs on the Management Engine integrated into the chipset, and for AMD on the Platform Security Processor that's integrated into the CPU package itself.

So in this respect, Pluton changes very little; the only difference is that the TPM code is running on hardware dedicated to that purpose, rather than alongside other code. Importantly, in this mode Pluton will not do anything unless the system firmware or OS ask it to. Pluton cannot independently block the execution of any other code - it knows nothing about the code the CPU is executing unless explicitly told about it. What the OS can certainly do is ask Pluton to verify a signature before executing code, but the OS could also just verify that signature itself. Windows can already be configured to reject software that doesn't have a valid signature. If Microsoft wanted to enforce that they could just change the default today, there's no need to wait until everyone has hardware with Pluton built-in.

The two things that seem to cause people concerns are remote attestation and the fact that Microsoft will be able to ship firmware updates to Pluton via Windows Update. I've written about remote attestation before, so won't go into too many details here, but the short summary is that it's a mechanism that allows your system to prove to a remote site that it booted a specific set of code. What's important to note here is that the TPM (Pluton, in the scenario we're talking about) can't do this on its own - remote attestation can only be triggered with the aid of the operating system. Microsoft's Device Health Attestation is an example of remote attestation in action, and the technology definitely allows remote sites to refuse to grant you access unless you booted a specific set of software. But there are two important things to note here: first, remote attestation cannot prevent you from booting whatever software you want, and second, as evidenced by Microsoft already having a remote attestation product, you don't need Pluton to do this! Remote attestation has been possible since TPMs started shipping over two decades ago.

The other concern is Microsoft having control over the firmware updates. The context here is that TPMs are not magically free of bugs, and sometimes these can have security consequences. One example is Infineon TPMs producing weak RSA keys, a vulnerability that could be rectified by a firmware update to the TPM. Unfortunately these updates had to be issued by the device manufacturer rather than Infineon being able to do so directly. This meant users had to wait for their vendor to get around to shipping an update, something that might not happen at all if the machine was sufficiently old. From a security perspective, being able to ship firmware updates for the TPM without them having to go through the device manufacturer is a huge win.

Microsoft's obviously in a position to ship a firmware update that modifies the TPM's behaviour - there would be no technical barrier to them shipping code that resulted in the TPM just handing out your disk encryption secret on demand. But Microsoft already control the operating system, so they already have your disk encryption secret. There's no need for them to backdoor the TPM to give them something that the TPM's happy to give them anyway. If you don't trust Microsoft then you probably shouldn't be running Windows, and if you're not running Windows Microsoft can't update the firmware on your TPM.

So, as of now, Pluton running firmware that makes it look like a TPM just isn't a terribly interesting change to where we are already. It can't block you running software (either apps or operating systems). It doesn't enable any new privacy concerns. There's no mechanism for Microsoft to forcibly push updates to it if you're not running Windows.

Could this change in future? Potentially. Microsoft mention another use-case for Pluton "as a security processor used for non-TPM scenarios like platform resiliency", but don't go into any more detail. At this point, we don't know the full set of capabilities that Pluton has. Can it DMA? Could it play a role in firmware authentication? There are scenarios where, in theory, a component such as Pluton could be used in ways that would make it more difficult to run arbitrary code. It would be reassuring to hear more about what the non-TPM scenarios are expected to look like and what capabilities Pluton actually has.

But let's not lose sight of something more fundamental here. If Microsoft wanted to block free operating systems from new hardware, they could simply mandate that vendors remove the ability to disable secure boot or modify the key databases. If Microsoft wanted to prevent users from being able to run arbitrary applications, they could just ship an update to Windows that enforced signing requirements. If they want to be hostile to free software, they don't need Pluton to do it.

(Edit: it's been pointed out that I kind of gloss over the fact that remote attestation is a potential threat to free software, as it theoretically allows sites to block access based on which OS you're running. There's various reasons I don't think this is realistic - one is that there's just way too much variability in measurements for it to be practical to write a policy that's strict enough to offer useful guarantees without also blocking a number of legitimate users, and the other is that you can just pass the request through to a machine that is running the appropriate software and have it attest for you. The fact that nobody has actually bothered to use remote attestation for this purpose even though most consumer systems already ship with TPMs suggests that people generally agree with me on that)

comment count unavailable comments

January 09, 2022 02:35 AM

January 04, 2022

Pete Zaitcev: PyPI is not trustworthy

I was dealing with a codebase S at work that uses a certain Python package N (I'll name it in the end, because its identity is so odious that it will distract from the topic at hand). Anyhow, S failed tests because N didn't work on my Fedora 35. That happened because S installed N with pip(1), which pulls from PyPI, and the archive at PyPI contained broken code.

The code for N in its source repository was fine, only PyPI was bad.

When I tried to find out what happened, it turned out that there is no audit trail for the code in PyPI. In addition, it is not possible to contact listed maintainers of N in PyPI, and there is no way to report the problem: the problem tracking system of PyPI is all plastered with warnings not to use it for problems with packages, but only with PyPI software itself.

By fuzzy-matching provided personal details with git logs, I was able to contact the maintainers. To my great surprise, two out of three even responded, but they disclaimed any knowledge of what went on.

So, an unknown entity was able to insert a certain code into a package at PyPI, and pip(1) was downloading it for years. This only came to light because the inserted code failed on my Fedora test box.

At this point I can only conclude that PyPI is not trustworthy.

Oh, yeah. The package N is actually nose. I am aware that it was dead and unmaintained, and nobody should be using it anymore, least of all S. I'm working on it.

January 04, 2022 08:34 PM

December 31, 2021

Matthew Garrett: Update on Linux hibernation support when lockdown is enabled

Some time back I wrote up a description of my proposed (and implemented) solution for making hibernation work under Linux even within the bounds of the integrity model. It's been a while, so here's an update.

The first is that localities just aren't an option. It turns out that they're optional in the spec, and TPMs are entirely permitted to say they don't support them. The only time they're likely to work is on platforms that support DRTM implementations like TXT. Most consumer hardware doesn't fall into that category, so we don't get to use that solution. Unfortunate, but, well.

The second is that I'd ignored an attack vector. If the kernel is configured to restrict access to PCR 23, then yes, an attacker is never able to modify PCR 23 to be in the same state it would be if hibernation were occurring and the key certification data will fail to validate. Unfortunately, an attacker could simply boot into an older kernel that didn't implement the PCR 23 restriction, and could fake things up there (yes, this is getting a bit convoluted, but the entire point here is to make this impossible rather than just awkward). Once PCR 23 was in the correct state, they would then be able to write out a new swap image, boot into a new kernel that supported the secure hibernation solution, and have that resume successfully in the (incorrect) belief that the image was written out in a secure environment.

This felt like an awkward problem to fix. We need to be able to distinguish between the kernel having modified the PCRs and userland having modified the PCRs, and we need to be able to do this without modifying any kernels that have already been released[1]. The normal approach to determining whether an event occurred in a specific phase of the boot process is to "cap" the PCR - extend it with a known value that indicates a transition between stages of the boot process. Any events that occur before the cap event must have occurred in the previous stage of boot, and since the final PCR value depends on the order of measurements and not just the contents of those measurements, if a PCR is capped before userland runs, userland can't fake the same PCR value afterwards. If Linux capped a PCR before userland started running, we'd be able to place a measurement there before the cap occurred and then prove that that extension occurred before userland had the opportunity to interfere. We could simply place a statement that the kernel supported the PCR 23 restrictions there, and we'd be fine.

Unfortunately Linux doesn't currently do this, and adding support for doing so doesn't fix the problem - if an attacker boots a kernel that doesn't cap a PCR, they can just cap it themselves from userland. So, we're faced with the same problem: booting an older kernel allows the system to be placed in an identical state to the current kernel, and a fake hibernation image can be written out. Solving this required a PCR that was being modified after kernel code was running, but before userland was started, even with existing kernels.

Thankfully, there is one! PCR 5 is defined as containing measurements related to boot management configuration and data. One of the measurements it contains is the result of the UEFI ExitBootServices() call. ExitBootServices() is called at the transition from the UEFI boot environment to the running OS, and the kernel contains code that executes before it. So, if we measure an assertion regarding whether or not we support restricted access to PCR 23 into PCR 5 before we call ExitBootServices(), this will prevent userspace from spoofing us (because userspace will only be able to extend PCR 5 after the firmware extended PCR 5 in response to ExitBootServices() being called). Obviously this depends on the firmware actually performing the PCR 5 extension when ExitBootServices() is called, but if firmware's out of spec then I don't think there's any real expectation of it being secure enough for any of this to buy you anything anyway.

My current tree is here, but there's a couple of things I want to do before submitting it, including ensuring that the key material is wiped from RAM after use (otherwise it could potentially be scraped out and used to generate another image afterwards) and, uh, actually making sure this works (I no longer have the machine I was previously using for testing, and switching my other dev machine over to TPM 2 firmware is proving troublesome, so I need to pull another machine out of the stack and reimage it).

[1] The linear nature of time makes feature development much more frustrating

comment count unavailable comments

December 31, 2021 03:36 AM

December 24, 2021

Paul E. Mc Kenney: Parallel Programming: December 2021 Update

It is past time for another release of Is Parallel Programming Hard, And, If So, What Can You Do About It?. But first, what is the difference between an edition and a release?

The main difference is the level of validation. For example, during the several months leading up to the second edition, I read the entire book, fixing issues as I found them. So where an edition is a completed work, a release is primarily for the benefit of people who would like to see a recent snapshot of this book, but without having to install and run LaTeX and its dependencies.

Having cleared that up, here are the big-animal changes since the second edition:


  1. A lot of detailed work to make the new ebook-sized PDF usable, along with other formatting improvements, courtesy of Akira Yokosawa, SeongJae Park, and Balbir Singh.
  2. The nq build now places quick quizzes at the end of each chapter, courtesy of Akira Yokosawa.
  3. There are a few new sections in the “Locking” chapter, the “Putting It All Together” chapter, the “Advanced Synchronization: Memory Ordering” chapter, and the “Important Questions” appendix.
  4. There is now more validation information associated with much of the sample code throughout the book.
  5. The “RCU Usage” section has received a much-needed upgrade.
  6. There is now an index and a list of acronyms, courtesy of Akira Yokosawa.
  7. A new install_latex_package.sh script, courtesy of SeongJae Park.
  8. Greatly improved error handling, including scripts that check for common LaTeX formatting mistakes, courtesy of Akira Yokosawa.


Yang Lu and Zhouyi Zhou are translating the Second Edition to Chinese, and would greatly appreciate additional help.

Everyone mentioned above contributed a great many wordsmithing fixes, as did Chin En Lin, Elad Lahav, Zhouyi Zhou, and GitHub user “rootbeer”. A grateful “thank you” to everyone who contributed!

December 24, 2021 08:22 PM

December 23, 2021

Pete Zaitcev: Adventures in tech support

OVH was pestering me about migrating my VPS from its previous range to the new (and more expensive) one. I finally agreed to that. Migrated the VM to the new host, it launches with no networking. Not entirely unexpected, but it gets better.

The root cause is the DHCP server at OVH returning a lease with netmask /32. In that situation, it's not possible to add a default route, because the next hop is outside of the netmask.

Seems like a simple enough problem, so I filed a ticket in OVH support, basically saying "your DHCP server supplies incorrect netmask, please fix it." Their ultimate answer was, and I quote:

Please note that the VPS IP and our failover IPs are a static IPs, configuring the DHCP server may cause a network issues.

I suppose I knew what I was buying when I saw the price for these OVH VMs. It was too cheap to be true. Still, disappointing.

December 23, 2021 08:17 PM

December 20, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: Removing CONFIG_RCU_FAST_NO_HZ

The CONFIG_RCU_FAST_NO_HZ Kconfig option was added many years ago to improve energy efficiency for systems having significant numbers of short bursts of idle time. Prior to the addition of CONFIG_RCU_FAST_NO_HZ, RCU would insist on keeping a given idle CPU's scheduling-clock tick enabled until all of that CPU's RCU callbacks had been invoked. On certain types of battery-powered embedded systems, these few additional scheduling-clock ticks would consume up to 40% of the battery lifetime. The people working on such systems were not amused, and were not shy about letting me know of their dissatisfaction with RCU's life choices. Please note that “letting me know” did not take the form of flaming me on LKML. Instead, they called me on the telephone and yelled at me.

Given that history, why on earth would I even be thinking about removing CONFIG_RCU_FAST_NO_HZ, let alone queuing a patch series intended for the v5.17 merge window???

The reason is that everyone I know of who builds their kernels with CONFIG_RCU_FAST_NO_HZ=y also boots those systems with each and every CPU designated as a rcu_nocbs CPU. With this combination, CONFIG_RCU_FAST_NO_HZ=y is doing nothing but placing a never-taken branch in the fastpath to and from idle. Such systems should therefore run slightly faster and with slightly better battery lifetime if their kernels were instead built with CONFIG_RCU_FAST_NO_HZ=n, which would get rid of that never-taken branch.

But given that battery-powered embedded folks badly wanted CONFIG_RCU_FAST_NO_HZ=y, and given that they are no longer getting any benefit from it, why on earth haven't they noticed?

The have not noticed because rcu_nocbs CPUs do not invoke their own RCU callbacks. This work is instead delegated to a set of per-CPU rcuoc kthreads, with a smaller set of rcuog kthreads managing those callbacks and requesting grace periods as needed. By default, these rcuoc and rcuog kthreads are not bound, which allows both the scheduler (and for that matter, the systems administrator) to take both performance and energy efficiency into account and to run those kthreads wherever is appropriate at any given time. In contrast, non-rcu_nocbs CPUs will always run their own callbacks, even if that means powering up an inconveniently placed portion of the system at an inconvenient time. This includes CONFIG_RCU_FAST_NO_HZ=y kernels, whose only advantage is that they power up inconveniently placed portions of systems at inconvenient times only 25% as often as would a non-rcu_nocbs CPU in a CONFIG_RCU_FAST_NO_HZ=n kernel.

In short, the rcu_nocbs CPUs' practice of letting the scheduler decide where to run the callbacks is especially helpful on asymmetric systems (AKA big.LITTLE systems), as shown by data collected by Dietmar Eggeman and Robin Randhawa. This point is emphasized by the aforementioned fact that everyone I know of who builds their kernels with CONFIG_RCU_FAST_NO_HZ=y also boots those systems with each and every CPU designated as a rcu_nocbs CPU.

So if no one is getting any benefit from building their kernels with CONFIG_RCU_FAST_NO_HZ=y, why keep that added complexity in the Linux kernel? Why indeed, and hence the patch series intended for the v5.17 merge window.

So if you know of someone who is getting significant benefit from CONFIG_RCU_FAST_NO_HZ=y who could not get that benefit from booting with rcu_nocbs CPUs, this would be a most excellent time to let me know!

December 20, 2021 10:14 PM