Kernel Planet

January 26, 2023

Paul E. Mc Kenney: What Does It Mean To Be An RCU Implementation?

Under Construction

A correspondent closed out 2022 by sending me an off-list email asking whether or not a pair of Rust crates (rcu_clean and left_right) were really implementations of read-copy update (RCU).  At first glance, this is a pair of simple yes/no questions that one should be able to answer off the cuff.

What Is An RCU?

Except that there is quite a variety of RCU implementations in the wild.  Even if we remain within the cozy confines of the Linux kernel, we have: (1) The original "vanilla" RCU, (2) Sleepable RCU (SRCU), (3) Tasks RCU, (4) Tasks Rude RCU, and Tasks Trace RCU.  These differ not just in performance characteristics, in fact, it is not in general possible to mechanically convert (say) SRCU to RCU.  Let's look at an overview of their properties.  (For more detail, see the 2019 LWN article.)

Vanilla RCU

Vanilla RCU has quite a variety of bells and whistles:

Sleepable RCU (SRCU)

SRCU has a similar variety of bells and whistles, but some important differences.  The most important difference is that SRCU supports multiple domains, each represented by an srcu_struct structure.  A reader in one domain does not block a grace period in another domain.  In contrast, RCU is global in nature, with exactly one domain.  On the other hand, the price SRCU pays for this flexibility is reduced amortization of grace-period overhead.

Tasks RCU

Tasks RCU was designed specially to handle the trampolines used in Linux-kernel tracing.

Tasks Rude RCU

By design, Tasks RCU does not wait for idle tasks.  Something about them never doing any voluntary context switches on CPUs that remain idle for long periods of time.  So trampoline that might be involved in tracing of code within the idle loop need something else, and that something is Tasks Rude RCU.

Tasks Trace RCU

Both Tasks RCU and Tasks Rude RCU disallow sleeping while executing in a given trampoline.  Some BPF programs need to sleep, hence Tasks Trace RCU.

DYNIX/ptx rclock

The various Linux examples are taken from a code base in which RCU has been under active development for more than 20 years, which might yield an overly stringent set of criteria.  In contrast, the 1990s DYNIX/ptx implementation of RCU (called "rclock" for "read-copy lock") was only under active development for about five years.  The implementation was correspondingly minimal:

Perhaps this can form the basis of an RCU classification system, though some translation will no doubt be required to bridge from C to Rust.

RCU Classification and Rust RCU Crates

Except that the first RCU crate, rcu_clean, throws a monkey wrench into the works.  It does not have any grace-period primitives, but instead a clean() function that takes a reference to a RCU-protected data item.  The user invokes this at some point in the code where it is known that there are no readers, either within this thread or anywhere else.  In true Rust fashion, in some cases, the compiler is able to prove the presence or absence of readers and issue a diagnostic when needed.  The documentation notes that the addition of grace periods (also known as "epochs") would allow greater accuracy.

This sort of thing is not unprecedented.  The userspace RCU library has long had an rcu_quiescent_state() function that can be invoked from a given thread when that particular thread is in a quiescent state, and thus cannot have references to any RCU-protected object.  However, rcu_clean takes this a step further by having no RCU grace-period mechanism at all.

Nevertheless, rcu_clean could be used to implement the add-only list RCU use case, so it is difficult to argue that is not an RCU implementation.  But it is clearly a very primitive implementation.  That said, primitive implementations do have their place, for example:

The left_right crate definitely uses RCU in the guise of epochs, and it can be used for at least some of the things that RCU can be used for.  It does have a single-writer restriction, though as the documentation says, you could use a Mutex to serialize at least some multi-writer use cases.  In addition, it has long been known that RCU use cases involving only a single writer thread permit wait-free updaters as well as wait-free readers.

One might argue that the fact that the left_right crate uses RCU means that it cannot possibly be itself an implementation of RCU.  Except that in the Linux kernel, RCU Tasks uses vanilla RCU, RCU Tasks Trace uses SRCU, and previous versions of SRCU used vanilla RCU.  So let's give the left_right crate the benefit of the doubt, at least for the time being, but with the understanding that it might eventually instead be classified as an RCU use case rather than an RCU implementation.

Here is a prototype classification system:

  1. Are there explicit RCU read-side markers?  Of the Linux-kernel RCU implementations, RCU Tasks and RCU Tasks Rude lack such markers.  Given the Rust borrow checker, it is hard to imagine an implementation without such markers, but feel free to prove me wrong.
  2. Are grace periods computed automatically?  (If not, as in rcu_clean, none of the remaining questions apply.)
  3. Are there synchronous grace-period-wait APIs?  All of the Linux-kernel implementations do, and left_right also looks to.
  4. Are there asynchronous grace-period-wait APIs?  If so, are there callback-wait APIs?All of the Linux-kernel implementations do, but left_right does not appear to.  Providing them seems doable, but might result in more than two copies of recently-updated data structures.
  5. Are there polled grace-period-wait APIs?  The Linux-kernel RCU and SRCU implementations do.
  6. Are there multiple grace-period domains?  The Linux-kernel SRCU implementation does.

But does this classification scheme work for your favorite RCU implementation?  What about your favorite RCU use case?

History

January 26, 2023 01:26 AM

January 23, 2023

Matthew Garrett: Build security with the assumption it will be used against your friends

Working in information security means building controls, developing technologies that ensure that sensitive material can only be accessed by people that you trust. It also means categorising people into "trustworthy" and "untrustworthy", and trying to come up with a reasonable way to apply that such that people can do their jobs without all your secrets being available to just anyone in the company who wants to sell them to a competitor. It means ensuring that accounts who you consider to be threats shouldn't be able to do any damage, because if someone compromises an internal account you need to be able to shut them down quickly.

And like pretty much any security control, this can be used for both good and bad. The technologies you develop to monitor users to identify compromised accounts can also be used to compromise legitimate users who management don't like. The infrastructure you build to push updates to users can also be used to push browser extensions that interfere with labour organisation efforts. In many cases there's no technical barrier between something you've developed to flag compromised accounts and the same technology being used to flag users who are unhappy with certain aspects of management.

If you're asked to build technology that lets you make this sort of decision, think about whether that's what you want to be doing. Think about who can compel you to use it in ways other than how it was intended. Consider whether that's something you want on your conscience. And then think about whether you can meet those requirements in a different way. If they can simply compel one junior engineer to alter configuration, that's very different to an implementation that requires sign-offs from multiple senior developers. Make sure that all such policy changes have to be clearly documented, including not just who signed off on it but who asked them to. Build infrastructure that creates a record of who decided to fuck over your coworkers, rather than just blaming whoever committed the config update. The blame trail should never terminate in the person who was told to do something or get fired - the blame trail should clearly indicate who ordered them to do that.

But most importantly: build security features as if they'll be used against you.

comment count unavailable comments

January 23, 2023 10:44 AM

January 22, 2023

Kernel Podcast: S2E1 – 2023/01/21

Prologue

This is the pilot episode for what will become season 2 of the Linux Kernel Podcast. Back in 2008-2009 I recorded a daily “kernel podcast” that summarized the happenings of the Linux Kernel Mailing List (LKML). Eventually, daily became a little too much, and the podcast went weekly, followed by…not. This time around, I’m not committing to any specific cadence – let’s call it “periodic” (every few weeks). In each episode, I will aim to broadly summarize the latest happenings in the “plumbing” of the Linux kernel, and occasionally related bits of userspace “plumbing” (glibc, systemd, etc.), as well as impactful toolchain changes that enable new features or rebaseline requirements. I welcome your feedback. Please let me know what you think about the format, as well as what you would like to see covered in future episodes. I’m going to play with some ideas over time. These may include “deep diving” into topics of interest to a broader audience. Keep in mind that this podcast is not intended to editorialize, but only to report on what is happening. Both this author, and others, have their own personal opinions, but this podcast aims to focus only on the facts, regardless of who is involved, or their motives.”

On with the show.

For the week ending January 21st 2023, I’m Jon Masters and this is the Linux Kernel Podcast. 

Summary

The latest stable kernel is Linux 6.1.7, released by Greg K-H on January 18th 2023.

The latest mainline (development) kernel is 6.2-rc4, released on January 15th 2023.

Long Term Stable 6.1?

The “stable” kernel series is maintained by Greg K-H (Kroah-Hartman), who posts hundreds of patches with fixes to each Linus kernel. This is where the “.7” comes in on top of Linux 6.1. Such stable patches are maintained between kernel releases, so when 6.2 is released, it will become the next “stable” kernel. Once every year or so, Greg will choose a kernel to be the next “Long Term Stable” (LTS) kernel that will receive even more patches, potentially for many years at a time. Back in October, Kaiwan N Billimoria (author of a book titled “Linux Kernel Programming”), seeking a baseline for the next edition, asked if 6.1 would become the next LTS kernel. A great amount of discussion has followed, with Greg responding to a recent ping by saying, “You tell me please. How has your testing gone for 6.1 so far? Does it work properly for you? Are you and/or your company willing to test out the -rc releases and provide feedback if it works or not for your systems?” and so on. This motivated various others to pile on with comments about their level of testing, though I haven’t seen an official 6.1 LTS as of yet.

Linux 6.2 progress

Linus noted in his 6.2-rc4 announcement mail that this came “with pretty much everybody back from winter holidays, and so things should be back to normal. And you can see that in the size, this is pretty much bang in the middle of a regular rc size for this time in the merge window.” The “merge window” is the period of time during which disruptive changes are allowed to be merged (typically the first two weeks of a kernel cycle prior to the first “RC”) so Linus means to refer to a “cycle” and not “merge window” in his announcement.

Speaking of Linux 6.2, it counts among new features additional support for Rust. Linux 6.1 had added initial Rust patches capable of supporting a “hello world” kernel module (but not much more). 6.2 adds support for accessing certain kernel data structures (such as “task_struct”, the per-task/process structure) and handles converting C-style structure “objects” with collections of (possibly null pointers) into the “memory safe” structures understood by Rust. As usual, Linux Weekly News (LWN) has a great article going into much more detail.

Ongoing Development

Richard Guy Briggs posted the 6th version of a patch series titled “fanotify: Allow user space to pass back additional audit info”, which “defines a new flag (FAN_INFO) and new extensions that define additional information which are appended after the response structure returned from user space on a permission event”. This allows audit logging to much more usefully capture why a policy allowed (or disallowed) certain access. The idea is to “enable the creation of tools that can suggest changes to the policy similar to how audit2allow can help refine labeled security”.

Maximillian Luz posted a patch series titled “firmware: Add support for Qualcomm UEFI Secure Application” that allows regular UEFI applications to access EFI variable via proxy calls to the “UEFI Secure Application” (uefisecapp) running in Q’s “secure world” implementation of Arm Trustzone. He has tested using this on a variety of tables, including a Surface Pro X. The application interface was reverse engineer from the Windows QcTrEE8180.sys driver.

Kees Cook requested a stable kernel backport of support for “oops_limit”, a new kernel feature that seeks to limit the number of “oopses” allowed before a kernel will “panic”. An “oops” is what happens when the kernel attempts to access a null pointer reference. Normal application software will crash (with a “segmentation fault”) when this happens. Inside the kernel, the access is caught (provided it happened while in process context), and the associated (but perhaps unrelated) userspace task (process) is killed in the process of generating an “oops” with a backtrace. The kernel may at that moment leak critical resources associated with the process, such as file handles, memory areas, or locks. These aren’t cleaned up. Consequently, it is possible that repeated oopses can be generated by an attacker and used for privilege escalation. The “oops_limit” patches mitigate this by limiting the number of such oopses allowed before the kernel will give up and “panic” (properly crash, and reboot, depending on config).

Vegard Nossum posted version 3 of a patch series titled “kmod: harden user namespaces with new kernel.ns_modules_allowed syscall”, which seeks to “reduce the attack surface and block exploits by ensuring that user namespaces cannot trigger module (auto-)loading”.

Arseniy Lesin reposted an RFC (Request For Comments) of a “SIGOOM Proposal” that would seek to enable the kernel to send a signal whenever a task (process) was in danger of being killed by the “OOM” (Out Of Memory) killer due to consuming too much anonymous (regular) memory. Willy Tarreau and Ted Ts’o noted that we were actually essentially out of space for new signals, and so rather than declaring a new “SIGOOM”, it would be better to allow a process to select which of the existing signals should be used for this process when it registered to receive such notifications. Arseniy said they would follow up with patches that followed this approach.

Architectures

On the architecture front, Mark Brown posted the 4th version of a patch series enabling support for Arm’s SME (Scalable Matrix Extension) version 2 and 2.1. Huang Ying posted patches enabling “migrate_pages()” (which moves memory between NUMA nodes – memory chips specific to e.g. a certain socket in a server) to support batching of the new(er) memory “folios”, rather than doing them one at a time. Batching allows associated TLB invalidation (tearing down the MMU’s understanding of active virtual to physical addresses) to be batched, which is important on Intel systems using IPIs (Inter-Processor-Interrupts), which are reduced by 99.1% during the associated testing, increasing pages migrated per second on a 2P server by 291.7%.

Xin Li posted version 6 of a patch series titled “x86: Enable LKGS instruction”. The “LKGS instruction is introduced with Intel FRED (flexible return and event delivery) specification. As LKGS is independent of FRED, we enable it as a standalone feature”. LKGS (which is an abbreviation of “load into IA32_KERNEL_GS_BASE”) “behaves like the MOV to GS instruction except that it loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor cache.” This means that an Operating System can perform the necessary work to context switch a user-level thread by updating IA32_KERNEL_GS_BASE and avoiding an explicit set of balanced calls to SWAPGS. This is part of the broader “FRED” architecture defined by Intel in the Flexible Return and Event Delivery (FRED) Specification.

David E. Box posted version 2 of a patch series titled “Extend Intel On Demand (SDSi) support, noting that “Intel Software Defined Silicon (SDSi) is now known as Intel On Demand”. These patches enable support for the Intel feature intended to allow users to load signed payloads into their CPUs to turn on certain features after purchasing a system. This might include (for example) certain accelerators present in future chips that could be enabled as needed, similar to how certain automobiles now include subscription-locked heated seats and other features.

Meanwhile, Anup Patel posted patches titled “RISC-V KVM virtualize AIA CSRs” that enable support for the new AIA (Advanced Interrupt Architecture), which replaces the legacy “PLIC”, and Sia Jee Heng posted patches that enable “RISC-V Hibernation Support”.

Final words

A number of conferences are returning in 2023, including the Linux Storage, Filesystem, Memory Management, and BPF (LSF/MM/BPF) Summit, which will be held from May 8 to May 10 at the Vancouver Convention Center. Josef Bacik noted that the CFP was now open.

Don’t forget to give me your feedback on this pilot episode! jcm@jonmasters.org.

January 22, 2023 04:16 AM

January 19, 2023

Dave Airlie (blogspot): vulkan video decoding: anv status update

After hacking the Intel media-driver and ffmpeg I managed to work out how the anv hardware mostly works now for h264 decoding.

I've pushed a branch [1] and a MR[2] to mesa. The basics of h264 decoding are working great on gen9 and compatible hardware. I've tested it on my one Lenovo WhiskeyLake laptop.

I have ported the code to hasvk as well, and once we get moving on this I'll polish that up and check we can h264 decode on IVB/HSW devices.

The one feature I know is missing is status reporting, radv can't support that from what I can work out due to firmware, but anv should be able to so I might dig into that a bit.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-decode

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20782

January 19, 2023 03:53 AM

January 18, 2023

Matthew Garrett: PKCS#11. hardware keystores, and Apple frustrations

There's a bunch of ways you can store cryptographic keys. The most obvious is to just stick them on disk, but that has the downside that anyone with access to the system could just steal them and do whatever they wanted with them. At the far end of the scale you have Hardware Security Modules (HSMs), hardware devices that are specially designed to self destruct if you try to take them apart and extract the keys, and which will generate an audit trail of every key operation. In between you have things like smartcards, TPMs, Yubikeys, and other platform secure enclaves - devices that don't allow arbitrary access to keys, but which don't offer the same level of assurance as an actual HSM (and are, as a result, orders of magnitude cheaper).

The problem with all of these hardware approaches is that they have entirely different communication mechanisms. The industry realised this wasn't ideal, and in 1994 RSA released version 1 of the PKCS#11 specification. This defines a C interface with a single entry point - C_GetFunctionList. Applications call this and are given a structure containing function pointers, with each entry corresponding to a PKCS#11 function. The application can then simply call the appropriate function pointer to trigger the desired functionality, such as "Tell me how many keys you have" and "Sign this, please". This is both an example of C not just being a programming language and also of you having to shove a bunch of vendor-supplied code into your security critical tooling, but what could possibly go wrong.

(Linux distros work around this problem by using p11-kit, which is a daemon that speaks d-bus and loads PKCS#11 modules for you. You can either speak to it directly over d-bus, or for apps that only speak PKCS#11 you can load a module that just transports the PKCS#11 commands over d-bus. This moves the weird vendor C code out of process, and also means you can deal with these modules without having to speak the C ABI, so everyone wins)

One of my work tasks at the moment is helping secure SSH keys, ensuring that they're only issued to appropriate machines and can't be stolen afterwards. For Windows and Linux machines we can stick them in the TPM, but Macs don't have a TPM as such. Instead, there's the Secure Enclave - part of the T2 security chip on x86 Macs, and directly integrated into the M-series SoCs. It doesn't have anywhere near as many features as a TPM, let alone an HSM, but it can generate NIST curve elliptic curve keys and sign things with them and that's good enough. Things are made more complicated by Apple only allowing keys to be used by the app that generated them, so it's hard for applications to generate keys on behalf of each other. This can be mitigated by using CryptoTokenKit, an interface that allows apps to present tokens to the systemwide keychain. Although this is intended for allowing a generic interface for access to such tokens (kind of like PKCS#11), an app can generate its own keys in the Secure Enclave and then expose them to other apps via the keychain through CryptoTokenKit.

Of course, applications then need to know how to communicate with the keychain. Browsers mostly do so, and Apple's version of SSH can to an extent. Unfortunately, that extent is "Retrieve passwords to unlock on-disk keys", which doesn't help in our case. PKCS#11 comes to the rescue here! Apple ship a module called ssh-keychain.dylib, a PKCS#11 module that's intended to allow SSH to use keys that are present in the system keychain. Unfortunately it's not super well maintained - it got broken when Big Sur moved all the system libraries into a cache, but got fixed up a few releases later. Unfortunately every time I tested it with our CryptoTokenKit provider (and also when I retried with SecureEnclaveToken to make sure it wasn't just our code being broken), ssh would tell me "provider /usr/lib/ssh-keychain.dylib returned no slots" which is not especially helpful. Finally I realised that it was actually generating more debug output, but it was being sent to the system debug logs rather than the ssh debug output. Well, when I say "more debug output", I mean "Certificate []: algorithm is not supported, ignoring it", which still doesn't tell me all that much. So I stuck it in Ghidra and searched for that string, and the line above it was

iVar2 = __auth_stubs::_objc_msgSend(uVar7,"isEqual:",*(undefined8*)__got::_kSecAttrKeyTypeRSA);

with it immediately failing if the key isn't RSA. Which it isn't, since the Secure Enclave doesn't support RSA. Apple's PKCS#11 module appears incapable of making use of keys generated on Apple's hardware.

There's a couple of ways of dealing with this. The first, which is taken by projects like Secretive, is to implement the SSH agent protocol and have SSH delegate key management to that agent, which can then speak to the keychain. But if you want this to work in all cases you need to implement all the functionality in the existing ssh-agent, and that seems like a bunch of work. The second is to implement a PKCS#11 module, which sounds like less work but probably more mental anguish. I'll figure that out tomorrow.

comment count unavailable comments

January 18, 2023 05:26 AM

January 17, 2023

Dave Airlie (blogspot): vulkan video decoding: av1 (yes av1) status update

Needless to say h264/5 weren't my real goals in life for video decoding. Lynne and myself decided to see what we could do to drive AV1 decode forward by creating our own extensions called VK_MESA_video_decode_av1. This is a radv only extension so far, and may expose some peculiarities of AMD hardware/firmware.

Lynne's blog entry[1] has all the gory details, so go read that first. (really read it first).

Now that you've read and understood all that, I'll just rant here a bit. Figuring out the DPB management and hw frame ref and curr_pic_idx fields was a bit of a nightmare. I spent a few days hacking up a lot of wrong things before landing on the thing we agreed was the least wrong which was having the ffmpeg code allocate a frame index in the same fashion as the vaapi radeon implementation did. I had another hacky solution that involved overloading the slotIndex value to mean something that wasn't DPB slot index, but it wasn't really any better. I think there may be something about the hw I don't understand so hopefully we can achieve clarity later.

[1] https://lynne.ee/vk_mesa_video_decode_av1.html

January 17, 2023 07:54 AM

January 15, 2023

Matthew Garrett: Blogging and microblogging

Long-term Linux users may remember that Alan Cox used to write an online diary. This was before the concept of a "Weblog" had really become a thing, and there certainly weren't any expectations around what one was used for - while now blogging tends to imply a reasonably long-form piece on a specific topic, Alan was just sitting there noting small life concerns or particular technical details in interesting problems he'd solved that day. For me, that was fascinating. I was trying to figure out how to get into kernel development, and was trying to read as much LKML as I could to figure out how kernel developers did stuff. But when you see discussion on LKML, you're frequently missing the early stages. If an LKML patch is a picture of an owl, I wanted to know how to draw the owl, and most of the conversations about starting in kernel development were very "Draw two circles. Now draw the rest of the owl". Alan's musings gave me insight into the thought processes involved in getting from "Here's the bug" to "Here's the patch" in ways that really wouldn't have worked in a more long-form medium.

For the past decade or so, as I moved away from just doing kernel development and focused more on security work instead, Twitter's filled a similar role for me. I've seen people just dumping their thought process as they work through a problem, helping me come up with effective models for solving similar problems. I've learned that the smartest people in the field will spend hours (if not days) working on an issue before realising that they misread something back at the beginning and that's helped me feel like I'm not unusually bad at any of this. It's helped me learn more about my peers, about my field, and about myself.

Twitter's now under new ownership that appears to think all the worst bits of Twitter were actually the good bits, so I've mostly bailed to the Fediverse instead. There's no intrinsic length limit on posts there - Mastodon defaults to 500 characters per post, but that's configurable per instance. But even at 500 characters, it means there's more room to provide thoughtful context than there is on Twitter, and what I've seen so far is more detailed conversation and higher levels of meaningful engagement. Which is great! Except it also seems to discourage some of the posting style that I found so valuable on Twitter - if your timeline is full of nuanced discourse, it feels kind of rude to just scream "THIS FUCKING PIECE OF SHIT IGNORES THE HIGH ADDRESS BIT ON EVERY OTHER WRITE" even though that's exactly the sort of content I'm there for.

And, yeah, not everything has to be for me. But I worry that as Twitter's relevance fades for the people I'm most interested in, we're replacing it with something that's not equivalent - something that doesn't encourage just dropping 50 characters or so of your current thought process into a space where it can be seen by thousands of people. And I think that's a shame.

comment count unavailable comments

January 15, 2023 10:40 PM

January 10, 2023

Matthew Garrett: Integrating Linux with Okta Device Trust

I've written about bearer tokens and how much pain they cause me before, but sadly wishing for a better world doesn't make it happen so I'm making do with what's available. Okta has a feature called Device Trust which allows to you configure access control policies that prevent people obtaining tokens unless they're using a trusted device. This doesn't actually bind the tokens to the hardware in any way, so if a device is compromised or if a user is untrustworthy this doesn't prevent the token ending up on an unmonitored system with no security policies. But it's an incremental improvement, other than the fact that for desktop it's only supported on Windows and MacOS, which really doesn't line up well with my interests.

Obviously there's nothing fundamentally magic about these platforms, so it seemed fairly likely that it would be possible to make this work elsewhere. I spent a while staring at the implementation using Charles Proxy and the Chrome developer tools network tab and had worked out a lot, and then Okta published a paper describing a lot of what I'd just laboriously figured out. But it did also help clear up some points of confusion and clarified some design choices. I'm not going to give a full description of the details (with luck there'll be code shared for that before too long), but here's an outline of how all of this works. Also, to be clear, I'm only going to talk about the desktop support here - mobile is a bunch of related but distinct things that I haven't looked at in detail yet.

Okta's Device Trust (as officially supported) relies on Okta Verify, a local agent. When initially installed, Verify authenticates as the user, obtains a token with a scope that allows it to manage devices, and then registers the user's computer as an additional MFA factor. This involves it generating a JWT that embeds a number of custom claims about the device and its state, including things like the serial number. This JWT is signed with a locally generated (and hardware-backed, using a TPM or Secure Enclave) key, which allows Okta to determine that any future updates from a device claiming the same identity are genuinely from the same device (you could construct an update with a spoofed serial number, but you can't copy the key out of a TPM so you can't sign it appropriately). This is sufficient to get a device registered with Okta, at which point it can be used with Fastpass, Okta's hardware-backed MFA mechanism.

As outlined in the aforementioned deep dive paper, Fastpass is implemented via multiple mechanisms. I'm going to focus on the loopback one, since it's the one that has the strongest security properties. In this mode, Verify listens on one of a list of 10 or so ports on localhost. When you hit the Okta signin widget, choosing Fastpass triggers the widget into hitting each of these ports in turn until it finds one that speaks Fastpass and then submits a challenge to it (along with the URL that's making the request). Verify then constructs a response that includes the challenge and signs it with the hardware-backed key, along with information about whether this was done automatically or whether it included forcing the user to prove their presence. Verify then submits this back to Okta, and if that checks out Okta completes the authentication.

Doing this via loopback from the browser has a bunch of nice properties, primarily around the browser providing information about which site triggered the request. This means the Verify agent can make a decision about whether to submit something there (ie, if a fake login widget requests your creds, the agent will ignore it), and also allows the issued token to be cross-checked against the site that requested it (eg, if g1thub.com requests a token that's valid for github.com, that's a red flag). It's not quite at the same level as a hardware WebAuthn token, but it has many of the anti-phishing properties.

But none of this actually validates the device identity! The entire registration process is up to the client, and clients are in a position to lie. Someone could simply reimplement Verify to lie about, say, a device serial number when registering, and there'd be no proof to the contrary. Thankfully there's another level to this to provide stronger assurances. Okta allows you to provide a CA root[1]. When Okta issues a Fastpass challenge to a device the challenge includes a list of the trusted CAs. If a client has a certificate that chains back to that, it can embed an additional JWT in the auth JWT, this one containing the certificate and signed with the certificate's private key. This binds the CA-issued identity to the Fastpass validation, and causes the device to start appearing as "Managed" in the Okta device management UI. At that point you can configure policy to restrict various apps to managed devices, ensuring that users are only able to get tokens if they're using a device you've previously issued a certificate to.

I've managed to get Linux tooling working with this, though there's still a few drawbacks. The main issue is that the API only allows you to register devices that declare themselves as Windows or MacOS, followed by the login system sniffing browser user agent and only offering Fastpass if you're on one of the officially supported platforms. This can be worked around with an extension that spoofs user agent specifically on the login page, but that's still going to result in devices being logged as a non-Linux OS which makes interpreting the logs more difficult. There's also no ability to choose which bits of device state you log: there's a couple of existing integrations, and otherwise a fixed set of parameters that are reported. It'd be lovely to be able to log arbitrary material and make policy decisions based on that.

This also doesn't help with ChromeOS. There's no real way to automatically launch something that's bound to localhost (you could probably make this work using Crostini but there's no way to launch a Crostini app at login), and access to hardware-backed keys is kind of a complicated topic in ChromeOS for privacy reasons. I haven't tried this yet, but I think using an enterprise force-installed extension and the chrome.enterprise.platformKeys API to obtain a device identity cert and then intercepting requests to the appropriate port range on localhost ought to be enough to do that? But I've literally never written any Javascript so I don't know. Okta supports falling back from the loopback protocol to calling a custom URI scheme, but once you allow that you're also losing a bunch of the phishing protection, so I'd prefer not to take that approach.

Like I said, none of this prevents exfiltration of bearer tokens once they've been issued, and there's still a lot of ecosystem work to do there. But ensuring that tokens can't be issued to unmanaged machines in the first place is still a step forwards, and with luck we'll be able to make use of this on Linux systems without relying on proprietary client-side tooling.

(Time taken to code this implementation: about two days, and under 1000 lines of new code. Time taken to figure out what the fuck to write: rather a lot longer)

[1] There's also support for having Okta issue certificates, but then you're kind of back to the "How do I know this is my device" situation

comment count unavailable comments

January 10, 2023 05:48 AM

January 08, 2023

James Bottomley: Using SIP to Replace Mobile and Land Lines

If you read more than a few articles in my blog you’ve probably figured out that I’m pretty much a public cloud Luddite: I run my own cloud (including my own email server) and don’t really have much of my data in any public cloud. I still have public cloud logins: everyone wants to share documents with Google nowadays, but Google regards people who don’t use its services “properly” with extreme prejudice and I get my account flagged with a security alert quite often when I try to log in.

However, this isn’t about my public cloud phobia, it’s about the evolution of a single one of my services: a cloud based PBX. It will probably come as no surprise that the PBX I run is Asterisk on Linux but it may be a surprise that I’ve been running it since the early days (since 1999 to be exact). This is the story of why.

I should also add that the motivation for this article is that I’m unable to get a discord account: discord apparently has a verification system that requires a phone number and explicitly excludes any VOIP system, which is all I have nowadays. This got me to thinking that my choices must be pretty unusual if they’re so pejoratively excluded by a company whose mission is to “Create Space for Everyone to find Belonging”. I’m sure the suspicion that this is because Discord the company also offers VoIP services and doesn’t like the competition is unworthy.

Early Days

I’ve pretty much worked remotely in the US all my career. In the 90s this meant having three phone lines (These were actually physical lines into the house): one for the family, one for work and one for the modem. When DSL finally became a thing and we were running a business, the modem was replaced by a fax machine. The minor annoyance was knowing which line was occupied but if line 1 is the house and line 2 the office, it’s not hard. The big change was unbundling. This meant initially the call costs to the UK through the line provider skyrocketed and US out of state rates followed. The way around this was to use unbundled providers via dial-around (a prefix number), but finding the cheapest was hard and the rates changed almost monthly. I needed a system that could add the current dial-around prefix for the relevant provider automatically. The solution: asterisk running on a server in the basement with two digium FX cards for the POTS lines (fax facility now being handled by asterisk) and Aastra 9113i SIP phones wired over the house ethernet with PoE injectors. Some fun jiggery pokery with asterisk busy lamp feature allowed the lights on the SIP phone to indicate busy lines and extensions.conf could be programmed to keep the correct dial-around prefix. For a bonus, asterisk can be programmed to do call screening, so now if the phone system doesn’t recognize your number you get told we don’t accept solicitation calls and to hang up now, otherwise press 0 to ring the house phone … and we’ve had peaceful dinner times ever after. It was also somewhat useful to have each phone in the house on its own PBX extension so people could call from the living room to my office without having to yell.

Enter SIP Trunking

While dial-arounds worked successfully for a few years, they always ended with problems (usually signalled by a massive phone bill) and a new dial-around was needed. However by 2007 several companies were offering SIP trunking over the internet. The one I chose (Localphone, a UK based company) was actually a successful ring back provider before moving into SIP. They offered a pay as you go service with phone termination in whatever country you were calling. The UK and US rates were really good, so suddenly the phone bills went down and as a bonus they gave me a free UK incoming number (called a DID – Direct Inward Dialing) which family and friends in the UK could call us on at local UK rates. Pretty much every call apart from local ones was now being routed over the internet, although most incoming calls, apart for those from the UK, were still over the POTS lines.

The beginning of Mobile (For Me)

I was never really a big consumer of mobile phones, but that all changed in 2009 when google presented all kernel developers with a Nexus One. Of course, they didn’t give us SIM cards to go with it, so my initial experiments were all over wifi. I soon had CyanogenMod installed and found a SIP client called Sipdroid. This allowed me to install my Nexus One as a SIP extension on the house network. SIP calls over 2G data were not very usable (the bandwidth was too low), but implementing multiple codecs and speex support got it to at least work (and is actually made me an android developer … scratching my own itch again). The bandwidth problems on 2G evaporated on 3G and SIP became really usable (although I didn’t have a mobile “plan”, I did use pay-as-you-go SIMs while travelling). It already struck me that all you really needed the mobile network for was data and then all calls could simply travel to a SIP provider. When LTE came along it seemed to be confirming this view because IP became the main communication layer.

I suppose I should add that I used the Nexus One long beyond its design life, even updating its protocol stack so it kept working. I did this partly because it annoyed certain people to see me with an old phone (I have a set of friends who were very amused by this and kept me supplied with a stock of Nexus One phones in case my old one broke) but mostly because of inertia and liking small phones.

SIP Becomes My Only Phone Service

In 2012, thanks to a work assignment, we relocated from the US to London. Since these moves take a while, I relocated the in-house PBX machine to a dedicated server in Los Angeles (my nascent private cloud), ditched the POTS connections and used the UK incoming number as our primary line that could be delivered to us while we were in temporary accommodation as well as when we were in our final residence in London. This did have the somewhat inefficient result that when you called from the downstairs living room to the upstairs office, the call was routed over an 8,000 mile round trip from London to Los Angeles and back, but thanks to internet latency improvements, you couldn’t really tell. The other problem was that the area code I’d chosen back in 2007 was in Whitby, some 200 Miles north of London but fortunately this didn’t seem to be much of an issue except for London Pizza delivery places who steadfastly refused to believe we lived locally.

When the time came in 2013 to move back to Seattle in the USA, the adjustment was simply made by purchasing a 206 area code DID and plugging it into the asterisk system and continued using a fully VoIP system based in Los Angeles. Although I got my incoming UK number for free, being an early service consumer, renting DIDs now costs around $1 per month depending on your provider.

SIP and the Home Office

I’ve worked remotely all my career (even when in London). However, I’ve usually worked for a company with a physical office setup and that means a phone system. Most corporate PBX’s use SIP under the covers or offer a SIP connector. So, by dint of finding the PBX administrator I’ve usually managed to get a SIP extension that will simply plug into my asterisk PBX. Using correct dial plan routing (and a prefix for outbound calling), the office number usually routes to my mobile and desk phone, meaning I can make and receive calls from my office number wherever in the world I happen to be. For those who want to try this at home, the trick is to find the phone system administrator; if you just ask the IT department, chances are you’ll simply get a blanket “no” because they don’t understand it might be easy to do and definitely don’t want to find out.

Evolution to Fully SIP (Data Only) Mobile

Although I said above that I maintained a range of in-country Mobile SIMs, this became less true as the difficulty in running in-country SIMs increased (most started to insist you add cash or use them fairly regularly). When COVID hit in 2020, and I had no ability to travel, my list of in-country SIMs was reduced to one from 3 UK largely because they allowed you to keep your number provided you maintained a balance (and they had a nice internet roaming agreement which meant you paid UK data rates in a nice range of countries). The big problem giving up a mobile number was no text messaging when travelling (SMS). For years I’ve been running a xmpp server, but the subset of my friends who have xmpp accounts has always been under 10% so it wasn’t very practical (actually, this is somewhat untrue because I wrote an xmpp to google chat bridge but the interface became very impedance mismatched as Google moved to rich media).

The major events that got me to move away from xmpp and the Nexus One were the shutdown of the 3G network in the US and the viability of the Matrix federated chat service (the Matrix android client relied on too many modern APIs ever to be backported to the version of android that ran on the Nexus One). Of the available LTE phones, I chose the Pixel-3 as the smallest and most open one with the best price/performance (and rapidly became acquainted with the fact that only some of them can actually be rooted) and LineageOS 17.1 (Android 10). The integration of SIP with the Dialer is great (I can now use SIP on the car’s bluetooth, yay!) but I rapidly ran into severe bugs in the Google SIP implementation (which hasn’t been updated for years). I managed to find and fix all the bugs (or at least those that affected me most, repositories here; all beginning with android_ and having the jejb-10 branch) but that does now mean I’m stuck on Android 10 since Google ripped SIP out in Android 12.

For messaging I adopted matrix (Apart from the Plumbers Matrix problem, I haven’t really written about it since matrix on debian testing just works out of the box) and set up bridges to Signal, Google Chat, Slack and WhatsApp (The WhatsApp one requires you be running WhatsApp on your phone, but I run mine on an Android VM in my cloud) all using the 3 UK Sim number where they require a mobile number confirmation. The final thing I did was to get a universal roaming data SIM and put it in my phone, meaning I now rely on matrix for messaging and SIP for voice when I travel because the data SIM has no working mobile number at all (either for voice or SMS). In many ways, this is no hardship: I never really had a permanent SMS number when travelling because of the use of in-country SIMs, so no-one has a number for me they rely on for SMS.

Conclusion and Problems

Although I implied above I can’t receive SMS, that’s not quite true: one of my VOIP numbers does accept SMS inbound and is able to send outbound, the problem is that it doesn’t come over the SIP MESSAGE protocol, but instead goes to a web page in the provider backend, making it inconvenient to use and meaning I have to know the message is coming (although I do use it for things like Delta Boarding passes, which only send the location of the web page to receive pkpasses over SMS). However, this isn’t usually a problem because most people I know have moved on from SMS to rich messaging over one of the protocols I have (and if one came along with a new protocol, well I can install a bridge for that).

In terms of SIP over an IP substrate giving rise to unbundled services, I could claim to be half right, since most of the modern phone like services have a SIP signalling core. However, the unbundling never really came about: the silo provider just moved from landline to mobile (or a mobile resale service like Google Fi). Indeed, today, if you give anyone your US phone number they invariably assume it is a mobile (and then wonder why you don’t reply to their SMS messages). This mobile assumption problem can be worked around by emphasizing “it’s a landline” every time you give out your VOIP number, but people don’t always retain the information.

So what about the future? I definitely still like the way my phone system works … having a single number for the house which any household member can answer from anywhere and side numbers for travelling really suits me, and I have the technical skills to maintain it indefinitely (provided the SIP trunking providers sill exist), but I can see the day coming where the Discord intolerance of non-siloed numbers is going to spread and most silos will require non-VOIP phone numbers with the same prejudice, thus locking people who don’t comply out in much the same way as it’s happening with email now; however, hopefully that day for VoIP is somewhat further off.

January 08, 2023 07:25 PM

January 07, 2023

Matthew Garrett: Changing firmware config that doesn't want to be changed

Update: There's actually a more detailed writeup of this here that I somehow missed. Original entry follows:

Today I had to deal with a system that had an irritating restriction - a firmware configuration option I really wanted to be able to change appeared as a greyed out entry in the configuration menu. Some emails revealed that this was a deliberate choice on the part of the system vendor, so that seemed to be that. Thankfully in this case there was a way around that.

One of the things UEFI introduced was a mechanism to generically describe firmware configuration options, called Visual Forms Representation (or VFR). At the most straightforward level, this lets you define a set of forms containing questions, with each question associated with a value in a variable. Questions can be made dependent upon the answers to other questions, so you can have options that appear or disappear based on how other questions were answered. An example in this language might be something like:
CheckBox Prompt: "Console Redirection", Help: "Console Redirection Enable or Disable.", QuestionFlags: 0x10, QuestionId: 53, VarStoreId: 1, VarStoreOffset: 0x39, Flags: 0x0
In which question 53 asks whether console redirection should be enabled or disabled. Other questions can then rely on the answer to question 53 to influence whether or not they're relevant (eg, if console redirection is disabled, there's no point in asking which port it should be redirected to). As a checkbox, if it's set then the value will be set to 1, and 0 otherwise. But where's that stored? Earlier we have another declaration:
VarStore GUID: EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9, VarStoreId: 1, Size: 0xF4, Name: "Setup"
A UEFI variable called "Setup" and with GUID EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9 is declared as VarStoreId 1 (matching the declaration in the question) and is 0xf4 bytes long. The question indicates that the offset for that variable is 0x39. Rewriting Setup-EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9 with a modified value in offset 0x39 will allow direct manipulation of the config option.

But how do we get this data in the first place? VFR isn't built into the firmware directly - instead it's turned into something called Intermediate Forms Representation, or IFR. UEFI firmware images are typically in a standardised format, and you can use UEFITool to extract individual components from that firmware. If you use UEFITool to search for "Setup" there's a good chance you'll be able to find the component that implements the setup UI. Running IFRExtractor-RS against it will then pull out any IFR data it finds, and decompile that into something resembling the original VFR. And now you have the list of variables and offsets and the configuration associated with them, even if your firmware has chosen to hide those options from you.

Given that a bunch of these config values may be security relevant, this seems a little concerning - what stops an attacker who has access to the OS from simply modifying these variables directly? UEFI avoids this by having two separate stages of boot, one where the full firmware ("Boot Services") is available, and one where only a subset ("Runtime Services") is available. The transition is triggered by the OS calling ExitBootServices, indicating the handoff from the firmware owning the hardware to the OS owning the hardware. This is also considered a security boundary - before ExitBootServices everything running has been subject to any secure boot restrictions, and afterwards applications can do whatever they want. UEFI variables can be flagged as being visible in both Boot and Runtime Services, or can be flagged as Boot Services only. As long as all the security critical variables are Boot Services only, an attacker should never be able to run untrusted code that could alter them.

In my case, the firmware option I wanted to alter had been enclosed in "GrayOutIf True" blocks. But the questions were still defined and the code that acted on those options was still present, so simply modifying the variables while still inside Boot Services gave me what I wanted. Note that this isn't a given! The presence of configuration options in the IFR data doesn't mean that anything will later actually read and make use of that variable - a vendor may have flagged options as unavailable and then removed the code, but never actually removed the config data. And also please do note that the reason stuff was removed may have been that it doesn't actually work, and altering any of these variables risks bricking your hardware in a way that's extremely difficult to recover. And there's also no requirement that vendors use IFR to describe their configuration, so you may not get any help here anyway.

In summary: if you do this you may break your computer. If you don't break your computer, it might not work anyway. I'm not going to help you try to break your computer. And I didn't come up with any of this, I just didn't find it all written down in one place while I was researching it.

comment count unavailable comments

January 07, 2023 09:53 AM

Matthew Garrett: Asking ChatGPT to write my security-sensitive code for me

I have a patchset that makes it safe to use hibernation under Linux even in a secure boot world but it relies on preventing userland from using one of the TPM PCRs[1] that would otherwise be available to it. TPM 2 devices support emulating PCRs in NVRAM, which would obviously be beneficial in not taking away a scare resource, but the docs suggest there's no way to tie a key generation to an NVRAM value which makes this more complicated.

So, why not ask a computer to do this for me? I asked ChatGPT to write me some code that would bind a TPM key to an NVRAM value. After a couple of false starts it told me:To bind an encryption key to the value of a TPM NV index, you can use the TPM2_NV_DefineSpace command to define an NV index with the TPMA_NV_BIND attribute. Woo! That sounds perfect. Only one problem - I can't actually find the TPMA_NV_BIND attribute defined anywhere in the spec. Well, ask the AI again.

Here is the definition of TPMA_NV_BIND from the TPM 2.0 specification:
TPMA_NV_BIND (BIT 17) If SET, the contents of the NV Index are bound to the platform hierarchy.


Ok, that sounds convincing. Where is it in the spec?

The definition of TPMA_NV_BIND is on page 126 of the TPM 2.0 specification, in section 9.7.1 "TPMA_NV".

Huh. Nope. Page 126 says nothing about this. Section 9.7.1 says nothing about this. The TPMA_NV bits are actually described on page 168 of volume 2 of the spec, in section 13.4. Bit 17 is TPMA_NV_OWNERREAD. TPMA_NV_BIND does not appear in the spec anywhere. I told the AI that.

I apologize for the error in my previous response. You are correct that the TPMA_NV_BIND attribute does not appear in the TPM 2.0 specification.

To bind an encryption key to the value of a TPM NV index, you can use the TPM2_NV_DefineSpace command with the TPMA_NV_BIND attribute.


I'm not afraid of being unemployed in the near future.

[1] "Platform Configuration Registers" - a modern TPM has 24 of these. When something security sensitive happens, you generate a hash of that event and pass it to the TPM. The TPM appends that to an existing PCR value and then hashes that concatenated value and sets the PCR to that. This means the PCR value depends not only on the values provided, but also the order they're provided in. Various TPM operations can be made conditional on the PCR values meeting specific criteria.

comment count unavailable comments

January 07, 2023 09:20 AM

January 06, 2023

Linux Plumbers Conference: LPC 2022 Attendee Survey Summary

We had 206 responses to the Linux Plumbers survey in 2022, which, given the total number of in person conference participants of 401, and virtual participants of 320, has provided high confidence in the feedback.   Overall there were about 89% of those registered, either showed up as in person or virtual.   As this was the first time we’ve tried to do this type of hybrid event, the feedback has been essential as we start planning for something similar in 2023.  One piece of input, we’ll definitely be incorporating for next year is to have separate surveys for in person and virtual attendees!  So a heartfelt “thank you” to everyone who participated in this survey and waded through the non relevant questions to share their experience!   

Overall: 91.8% of respondents were positive about the event, with 6.3% as neutral and 1.9% were dissatisfied. 80.1% indicated that the discussions they participated in helped resolve problems.  The BOF track was popular and we’re looking to include it again in 2023.   Due to the fact we were having our first in person since the pandemic started, we did this event as a hybrid event with reduced in person registration compared to prior years, as we were unsure how many would be willing to travel and our venue’s capacity.   The conference sold out of regular tickets very quickly after opening up registration though, so we set up a waiting list.  With some the travel conditions and cancelations, we were able to work through the daunting waiting list, and offer spots to all of those on the list by the conference date.  Venue capacity is something we’re looking closely at for next year and will outline the plan when the CFP opens early this year.

Based on feedback from prior years, we videotaped all of the sessions, and the videos are now posted. There are 195 videos from the conference! The committee has also linked them to the detailed schedule and clicking on the video link in the presentation materials section of any given talk or discussion. 72% of respondents plan to watch them to clarify points and another 10% are planning to watch them to catch up on sessions that they were not able to attend. 

Venue: In general, 45.6% of respondents considered the venue size to be a good match, but a significant portion would have preferred it to be bigger (47%) as well. The room size was considered effective for participation by 78.6% of the respondents.

Content: In terms of track feedback, Linux Plumbers Refereed track and Kernel Summit track were indicated as very relevant by almost all respondents who attended. The BOFs track was positively received and will continue.   The hallway track continues to be regarded as most relevant, and appreciated. We will continue to evaluate options for making private meeting and hack rooms available for groups who need to meet onsite.

Communication:  The emails from the committee continue to be positively received.  We were able to incorporate some of the suggestions from prior surveys, and are continuing to look for options to make the hybrid event communications between in person and virtual attendees work better.  

Events: Our evening events are feeling the pressure from the number of attendees especially with the other factors from the pandemic.   The first night event had more issues than the closing event and we appreciate the constructive suggestions in the write-in comments.  The survey was still positive about the events overall,   so we’ll see what we can do make this part of the “hallway track” more effective for everyone next year.

There were lots of great suggestions to the “what one thing would you like to see changed” question, and the program committee has met to discuss them. Once a venue is secured, we’ll be reviewing them again to see what is possible to implement this coming year.

Thank you again to the participants for their input and help on improving the Linux Plumbers Conference.   The conference is planned to be in North America in the October/November timeframe for 2023.  As soon as we secure a venue, dates and location information will be posted in a blog by the committee chair,  Christian Brauner.

January 06, 2023 05:15 PM

December 29, 2022

Dave Airlie (blogspot): vulkan video encoding: radv update

After the video decode stuff was fairly nailed down, Lynne from ffmpeg nerdsniped^Wtalked me into looking at h264 encoding.

The AMD VCN encoder engine is a very different interface to the decode engine and required a lot of code porting from the radeon vaapi driver. Prior to Xmas I burned a few days on typing that all in, and yesterday I finished typing and moved to debugging the pile of trash I'd just typed in.

Lynne meanwhile had written the initial ffmpeg side implementation, and today we threw them at each other, and polished off a lot of sharp edges. We were rewarded with valid encoded frames.

The code at this point is only doing I-frame encoding, we will work on P/B frames when we get a chance.

There are also a bunch of hacks and workarounds for API/hw mismatches, that I need to consult with Vulkan spec and AMD teams about, but we have a good starting point to move forward from. I'll also be offline for a few days on holidays so I'm not sure it will get much further until mid January.

My branch is [1]. Lynne ffmpeg branch is [2].

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-enc-wip

[2] https://github.com/cyanreg/FFmpeg/tree/vulkan_decode

December 29, 2022 07:22 AM

December 20, 2022

Dave Airlie (blogspot): vulkan video decoding: radv status

I've been working the past couple of weeks with an ffmpeg developer (Lynne) doing Vulkan video decode bringup on radv.

The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.

Khronos has released the final specs for these extensions. This work is rebased onto the final KHR form and is in a merge request for radv[2].

This contains an initial implementation of H264 and H265 decoding for AMD GPUs from TONGA to NAVI2x. It passes the basic conformance tests but fails some of the more complicated ones, but it has decoded the streams we've been throwing at it using ffmpeg.

Building:

git clone https://gitlab.freedesktop.org/airlied/mesa

git checkout radv-vulkan-video-prelim-decode

mkdir build

meson build -Dvulkan-beta=true -Dvulkan-drivers=amd -Dvideo-codecs=h264dec,h265dec --prefix=<prefix>

cd build

ninja

ninja install

Running:

export VK_ICD_FILENAMES=<prefix>/share/vulkan/icd.d/radeon_icd.x86_64.json
 

For prelim branch[1]:

export RADV_VIDEO_DECODE=1

For merge branch[2]:

export RADV_PERFTEST=video_decode

vulkaninfo

This should show support for VK_KHR_video_queue, VK_KHR_video_decode_queue, VK_EXT_video_decode_h264 and VK_EXT_video_decode_h265.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-prelim-decode

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20388

December 20, 2022 12:04 AM

December 18, 2022

Matthew Garrett: Off Twitter for a bit

Turns out that linking to several days old public data in order to demonstrate that Elon's jet was broadcasting its tail number in the clear is apparently "posting private information" so for anyone looking for me there I'm actually here

comment count unavailable comments

December 18, 2022 03:25 AM

December 16, 2022

Dave Airlie (blogspot): vulkan video decoding: anv status

After cleaning up the radv stuff I decided to go back and dig into the anv support for H264.

The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.

This contains an initial implementation of H264 Intel GPUs that anv supports. I've only tested it on Kabylake equivalents so far. It decodes some of the basic streams I've thrown at it from ffmpeg. Now this isn't as far along as the AMD implementation, but I'm also not sure I'm programming the hardware correctly. The Windows DXVA API has 2 ways to decode H264, short and long. I believe but I'm not 100% sure the current Vulkan API is quite close to "short", but the only Intel implementations I've found source for are for "long". I've bridged this gap by writing a slice header parser in mesa, but I think the hw might be capable of taking over that task, and I could in theory dump a bunch of code. But the programming guides for the hw block are a bit vague on some of the details around how "long" works. Maybe at some point someone in Intel can tell me :-)

Building:

git clone https://gitlab.freedesktop.org/airlied/mesa

git checkout anv-vulkan-video-prelim-decode

mkdir build

meson build -Dvulkan-beta=true -Dvulkan-drivers=intel -Dvideo-codecs=h264dec --prefix=<prefix>

cd build

ninja

ninja install

Running:

export VK_ICD_FILENAMES=<prefix>/share/vulkan/icd.d/intel_icd.x86_64.json

export ANV_VIDEO_DECODE=1

vulkaninfo

This should show support for VK_KHR_video_queue, VK_KHR_video_decode_queue, VK_EXT_video_decode_h264.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode

December 16, 2022 10:37 PM

December 15, 2022

Paul E. Mc Kenney: Stupid RCU Tricks: So You Want To Add Kernel-Boot Parameters Behind rcutorture's Back?

A previous post in this series showed how you can use the --bootargs parameter and .boot files to supply kernel boot parameters to the kernels under test.  This works, but it turns out that there is another way, which is often the case with the Linux kernel.  This other way is Masami Hiramatsu's bootconfig facility, which is nicely documented in detail here.  This blog post is a how-to guide on making use of bootconfig when running rcutorture.

The bootconfig facility allows kernel boot parameters to be built into initrd or directly into the kernel itself, this last being the method used here.  This requires that the kernel build system be informed of the parameters.  Suppose that these parameters are placed in a file named /tmp/dump_tree.bootparam as follows:

kernel.rcutree.dump_tree=1
kernel.rcutree.blimit=15

Note well the "kernel." prefix, which is required here.  The other option is an "init." prefix, which would cause the parameter to instead be passed to the init process.

Then the following three Kconfig options inform the build system of this file:

CONFIG_BOOT_CONFIG=y
CONFIG_BOOT_CONFIG_EMBED=y
CONFIG_BOOT_CONFIG_EMBED_FILE="/tmp/dump_tree.bootparam"

The resulting kernel image will then contain the above pair of kernel boot parameters.  Except that you also have to tell the kernel to look for these parameters, which is done by passing in the "bootconfig" kernel boot parameter.  And no, it does not work to add a "kernel.bootconfig" line to the /tmp/dump_tree.bootparam file!  You can instead add it to a .boot file or to the kvm.sh command line like this: "--bootargs bootconfig".

For example, given this command:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 30s --configs TREE05 \
   --bootargs "bootconfig" --trust-make

The resulting console.log file would contain the following text, indicating that these boot parameters had in fact been processed correctly, as indicated by the "Boot-time adjustment of callback invocation limit to 15." and the last three lines that begin with "rcu_node tree layout dump".

-----------
Running RCU self tests
rcu: Preemptible hierarchical RCU implementation.
rcu:     CONFIG_RCU_FANOUT set to non-default value of 6.
rcu:     RCU lockdep checking is enabled.
rcu:     Build-time adjustment of leaf fanout to 6.
rcu:     Boot-time adjustment of callback invocation limit to 15.
rcu:     RCU debug GP pre-init slowdown 3 jiffies.
rcu:     RCU debug GP init slowdown 3 jiffies.
rcu:     RCU debug GP cleanup slowdown 3 jiffies.
Trampoline variant of Tasks RCU enabled.
rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
rcu: rcu_node tree layout dump
rcu:  0:7 ^0
rcu:  0:3 ^0  4:7 ^1
-----------

What happens if you use both CONFIG_BOOT_CONFIG_EMBED_FILE and the --bootargs parameter?  The kernel boot parameters passed to --bootargs will be processed first, followed by those in /tmp/dump_tree.bootparam.  Please note that the semantics of repeated kernel-boot parameters is subsystem-specific, so please also be careful.

The requirement that the "bootconfig" parameter be specified on the normal kernel command line can be an issue in environments where the this command line is not easily modified.  One way of avoiding such issues is to create a Kconfig option that causes the kernel to act as if the "bootconfig" parameter had been specified.  For example, the following -rcu commit does just this with a new CONFIG_BOOT_CONFIG_FORCE Kconfig option:

674b57ddd75e ("bootconfig: Allow forcing unconditional bootconfig processing")

It is important to note that although these embedded kernel-boot parameters show up at the beginning of the "/proc/cmdline" file, they may also be found in isolation in the "/proc/bootconfig" file, for example, like this:

$ cat /proc/bootconfig
kernel.rcutree.dump_tree = 1
kernel.rcutree.blimit = 15

Why do the embedded kernel-boot parameters show up at the beginning of "/proc/cmdline" instead of at the end, given that they are processed after the other parameters?  Because of the possibility of a "--" in the non-embedded traditionally sourced kernel boot parameters, which would make it appear that the embedded kernel-boot parameters were intended for the init process rather than for the kernel.

But what is the point of all this?

Within the context of rcutorture, probably not much.  But there are environments where external means of setting per-kernel-version kernel-boot parameters is inconvenient, and rcutorture is an easy way of testing the embedding of those parameters directly into the kernel image itself.

December 15, 2022 12:59 AM

December 13, 2022

Matthew Garrett: Trying to remove the need to trust cloud providers

First up: what I'm covering here is probably not relevant for most people. That's ok! Different situations have different threat models, and if what I'm talking about here doesn't feel like you have to worry about it, that's great! Your life is easier as a result. But I have worked in situations where we had to care about some of the scenarios I'm going to describe here, and the technologies I'm going to talk about here solve a bunch of these problems.

So. You run a typical VM in the cloud. Who has access to that VM? Well, firstly, anyone who has the ability to log into the host machine with administrative capabilities. With enough effort, perhaps also anyone who has physical access to the host machine. But the hypervisor also has the ability to inspect what's running inside a VM, so anyone with the ability to install a backdoor into the hypervisor could theoretically target you. And who's to say the cloud platform launched the correct image in the first place? The control plane could have introduced a backdoor into your image and run that instead. Or the javascript running in the web UI that you used to configure the instance could have selected a different image without telling you. Anyone with the ability to get a (cleverly obfuscated) backdoor introduced into quite a lot of code could achieve that. Obviously you'd hope that everyone working for a cloud provider is honest, and you'd also hope that their security policies are good and that all code is well reviewed before being committed. But when you have several thousand people working on various components of a cloud platform, there's always the potential for something to slip up.

Let's imagine a large enterprise with a whole bunch of laptops used by developers. If someone has the ability to push a new package to one of those laptops, they're in a good position to obtain credentials belonging to the user of that laptop. That means anyone with that ability effectively has the ability to obtain arbitrary other privileges - they just need to target someone with the privilege they want. You can largely mitigate this by ensuring that the group of people able to do this is as small as possible, and put technical barriers in place to prevent them from pushing new packages unilaterally.

Now imagine this in the cloud scenario. Anyone able to interfere with the control plane (either directly or by getting code accepted that alters its behaviour) is in a position to obtain credentials belonging to anyone running in that cloud. That's probably a much larger set of people than have the ability to push stuff to laptops, but they have much the same level of power. You'll obviously have a whole bunch of processes and policies and oversights to make it difficult for a compromised user to do such a thing, but if you're a high enough profile target it's a plausible scenario.

How can we avoid this? The easiest way is to take the people able to interfere with the control plane out of the loop. The hypervisor knows what it booted, and if there's a mechanism for the VM to pass that information to a user in a trusted way, you'll be able to detect the control plane handing over the wrong image. This can be achieved using trusted boot. The hypervisor-provided firmware performs a "measurement" (basically a cryptographic hash of some data) of what it's booting, storing that information in a virtualised TPM. This TPM can later provide a signed copy of the measurements on demand. A remote system can look at these measurements and determine whether the system is trustworthy - if a modified image had been provided, the measurements would be different. As long as the hypervisor is trustworthy, it doesn't matter whether or not the control plane is - you can detect whether you were given the correct OS image, and you can build your trust on top of that.

(Of course, this depends on you being able to verify the key used to sign those measurements. On real hardware the TPM has a certificate that chains back to the manufacturer and uniquely identifies the TPM. On cloud platforms you typically have to retrieve the public key via the metadata channel, which means you're trusting the control plane to give you information about the hypervisor in order to verify what the control plane gave to the hypervisor. This is suboptimal, even though realistically the number of moving parts in that part of the control plane is much smaller than the number involved in provisioning the instance in the first place, so an attacker managing to compromise both is less realistic. Still, AWS doesn't even give you that, which does make it all rather more complicated)

Ok, so we can (largely) decouple our trust in the VM from having to trust the control plane. But we're still relying on the hypervisor to provide those attestations. What if the hypervisor isn't trustworthy? This sounds somewhat ridiculous (if you can't run a trusted app on top of an untrusted OS, how can you run a trusted OS on top of an untrusted hypervisor?), but AMD actually have a solution for that. SEV ("Secure Encrypted Virtualisation") is a technology where (handwavily) an encryption key is generated when a new VM is created, and the memory belonging to that VM is encrypted with that key. The hypervisor has no access to that encryption key, and any access to memory initiated by the hypervisor will only see the encrypted content. This means that nobody with the ability to tamper with the hypervisor can see what's going on inside the OS (and also means that nobody with physical access can either, so that's another threat dealt with).

But how do we know that the hypervisor set this up, and how do we know that the correct image was booted? SEV has support for a "Launch attestation", a CPU generated signed statement that it booted the current VM with SEV enabled. But it goes further than that! The attestation includes a measurement of what was booted, which means we don't need to trust the hypervisor at all - the CPU itself will tell us what image we were given. Perfect.

Except, well. There's a few problems. AWS just doesn't have any VMs that implement SEV yet (there are bare metal instances that do, but obviously you're building your own infrastructure to make that work). Google only seem to provide the launch measurement via the logging service - and they only include the parsed out data, not the original measurement. So, we still have to trust (a subset of) the control plane. Azure provides it via a separate attestation service, but again it doesn't seem to provide the raw attestation and so you're still trusting the attestation service. For the newest generation of SEV, SEV-SNP, this is less of a big deal because the guest can provide its own attestation. But Google doesn't offer SEV-SNP hardware yet, and the driver you need for this only shipped in Linux 5.19 and Azure's SEV Ubuntu images only offer up to 5.15 at the moment, so making use of that means you're putting your own image together at the moment.

And there's one other kind of major problem. A normal VM image provides a bootloader and a kernel and a filesystem. That bootloader needs to run on something. That "something" is typically hypervisor-provided "firmware" - for instance, OVMF. This probably has some level of cloud vendor patching, and they probably don't ship the source for it. You're just having to trust that the firmware is trustworthy, and we're talking about trying to avoid placing trust in the cloud provider. Azure has a private beta allowing users to upload images that include their own firmware, meaning that all the code you trust (outside the CPU itself) can be provided by the user, and once that's GA it ought to be possible to boot Azure VMs without having to trust any Microsoft-provided code.

Well, mostly. As AMD admit, SEV isn't guaranteed to be resistant to certain microarchitectural attacks. This is still much more restrictive than the status quo where the hypervisor could just read arbitrary content out of the VM whenever it wanted to, but it's still not ideal. Which, to be fair, is where we are with CPUs in general.

(Thanks to Leonard Cohnen who gave me a bunch of excellent pointers on this stuff while I was digging through it yesterday)

comment count unavailable comments

December 13, 2022 09:19 PM

December 12, 2022

Matthew Garrett: Quick update on Pluton and Linux

I've been ridiculously burned out for a while now but I'm taking the month off to recover and that's giving me an opportunity to catch up on a lot of stuff. This has included me actually writing some code to work with the Pluton in my Thinkpad Z13. I've learned some more stuff in the process, but based on everything I know I'd still say that in its current form Pluton isn't a threat to free software.

So, first up: by default on the Z13, Pluton is disabled. It's not obviously exposed to the OS at all, which also means there's no obvious mechanism for Microsoft to push out a firmware update to it via Windows Update. The Windows drivers that bind to Pluton don't load in this configuration. It's theoretically possible that there's some hidden mechanism to re-enable it at runtime, but that code doesn't seem to be in Windows at the moment. I'm reasonably confident that "Disabled" is pretty genuinely disabled.

Second, when enabled, Pluton exposes two separate devices. The first of these has an MSFT0101 identifier in ACPI, which is the ID used for a TPM 2 device. The Pluton TPM implementation doesn't work out of the box with existing TPM 2 drivers, though, because it uses a custom start method. TPM 2 devices commonly use something called a "Command Response Buffer" architecture, where a command is written into a buffer, the TPM is told to do a thing, and the response to the command ends up in another buffer. The mechanism to tell the TPM to do a thing varies, and an ACPI table exposed to the OS defines which of those various things should be used for a given TPM. Pluton systems have a mechanism that isn't defined in the existing version of the spec (1.3 rev 8 at the time of writing), so I had to spend a while staring hard at the Windows drivers to figure out how to implement it. The good news is that I now have a patch that successfully gets the existing Linux TPM driver code work correctly with the Pluton implementation.

The second device has an MSFT0200 identifier, and is entirely not a TPM. The Windows driver appears to be a relatively thin layer that simply takes commands from userland and passes them on to the chip - I haven't found any userland applications that make use of this, so it's tough to figure out what functionality is actually available. But what does seem pretty clear from the code I've looked at is that it's a component that only responds when it's asked - if the OS never sends it any commands, it's not able to do anything.

One key point from this recently published Microsoft doc is that the whole "Microsoft can update Pluton firmware" thing does just seem to be the ability for the OS to push new code to the chip at runtime. That means Microsoft can't arbitrarily push new firmware to the chip - the OS needs to be involved. This is unsurprising, but it's nice to see some stronger confirmation of that.

Anyway. tl;dr - Pluton can (now) be used as a regular TPM. Pluton also exposes some additional functionality which is not yet clear, but there's no obvious mechanism for it to compromise user privacy or restrict what users can run on a Free operating system. The Pluton firmware update mechanism appears to be OS mediated, so users who control their OS can simply choose not to opt in to that.

comment count unavailable comments

December 12, 2022 12:12 PM

December 10, 2022

Matthew Garrett: On-device WebAuthn and what makes it hard to do well

WebAuthn improves login security a lot by making it significantly harder for a user's credentials to be misused - a WebAuthn token will only respond to a challenge if it's issued by the site a secret was issued to, and in general will only do so if the user provides proof of physical presence[1]. But giving people tokens is tedious and also I have a new laptop which only has USB-C but does have a working fingerprint reader and I hate the aesthetics of the Yubikey 5C Nano, so I've been thinking about what WebAuthn looks like done without extra hardware.

Let's talk about the broad set of problems first. For this to work you want to be able to generate a key in hardware (so it can't just be copied elsewhere if the machine is compromised), prove to a remote site that it's generated in hardware (so the remote site isn't confused about what security assertions you're making), and tie use of that key to the user being physically present (which may range from "I touched this object" to "I presented biometric evidence of identity"). What's important here is that a compromised OS shouldn't be able to just fake a response. For that to be possible, the chain between proof of physical presence to the secret needs to be outside the control of the OS.

For a physical security token like a Yubikey, this is pretty easy. The communication protocol involves the OS passing a challenge and the source of the challenge to the token. The token then waits for a physical touch, verifies that the source of the challenge corresponds to the secret it's being asked to respond to the challenge with, and provides a response. At the point where keys are being enrolled, the token can generate a signed attestation that it generated the key, and a remote site can then conclude that this key is legitimately sequestered away from the OS. This all takes place outside the control of the OS, meeting all the goals described above.

How about Macs? The easiest approach here is to make use of the secure enclave and TouchID. The secure enclave is a separate piece of hardware built into either a support chip (for x86-based Macs) or directly on the SoC (for ARM-based Macs). It's capable of generating keys and also capable of producing attestations that said key was generated on an Apple secure enclave ("Apple Anonymous Attestation", which has the interesting property of attesting that it was generated on Apple hardware, but not which Apple hardware, avoiding a lot of privacy concerns). These keys can have an associated policy that says they're only usable if the user provides a legitimate touch on the fingerprint sensor, which means it can not only assert physical presence of a user, it can assert physical presence of an authorised user. Communication between the fingerprint sensor and the secure enclave is a private channel that the OS can't meaningfully interfere with, which means even a compromised OS can't fake physical presence responses (eg, the OS can't record a legitimate fingerprint press and then send that to the secure enclave again in order to mimic the user being present - the secure enclave requires that each response from the fingerprint sensor be unique). This achieves our goals.

The PC space is more complicated. In the Mac case, communication between the biometric sensors (be that TouchID or FaceID) occurs in a controlled communication channel where all the hardware involved knows how to talk to the other hardware. In the PC case, the typical location where we'd store secrets is in the TPM, but TPMs conform to a standardised spec that has no understanding of this sort of communication, and biometric components on PCs have no way to communicate with the TPM other than via the OS. We can generate keys in the TPM, and the TPM can attest to those keys being TPM-generated, which means an attacker can't exfiltrate those secrets and mimic the user's token on another machine. But in the absence of any explicit binding between the TPM and the physical presence indicator, the association needs to be up to code running on the CPU. If that's in the OS, an attacker who compromises the OS can simply ask the TPM to respond to an challenge it wants, skipping the biometric validation entirely.

Windows solves this problem in an interesting way. The Windows Hello Enhanced Signin doesn't add new hardware, but relies on the use of virtualisation. The agent that handles WebAuthn responses isn't running in the OS, it's running in another VM that's entirely isolated from the OS. Hardware that supports this model has a mechanism for proving its identity to the local code (eg, fingerprint readers that support this can sign their responses with a key that has a certificate that chains back to Microsoft). Additionally, the secrets that are associated with the TPM can be held in this VM rather than in the OS, meaning that the OS can't use them directly. This means we have a flow where a browser asks for a WebAuthn response, that's passed to the VM, the VM asks the biometric device for proof of user presence (including some sort of random value to prevent the OS just replaying that), receives it, and then asks the TPM to generate a response to the challenge. Compromising the OS doesn't give you the ability to forge the responses between the biometric device and the VM, and doesn't give you access to the secrets in the TPM, so again we meet all our goals.

On Linux (and other free OSes), things are less good. Projects like tpm-fido generate keys on the TPM, but there's no secure channel between that code and whatever's providing proof of physical presence. An attacker who compromises the OS may not be able to copy the keys to their own system, but while they're on the compromised system they can respond to as many challenges as they like. That's not the same security assertion we have in the other cases.

Overall, Apple's approach is the simplest - having binding between the various hardware components involved means you can just ignore the OS entirely. Windows doesn't have the luxury of having as much control over what the hardware landscape looks like, so has to rely on virtualisation to provide a security barrier against a compromised OS. And in Linux land, we're fucked. Who do I have to pay to write a lightweight hypervisor that runs on commodity hardware and provides an environment where we can run this sort of code?


[1] As I discussed recently there are scenarios where these assertions are less strong, but even so

comment count unavailable comments

December 10, 2022 10:41 AM

December 09, 2022

Matthew Garrett: End-to-end encrypted messages need more than libsignal

(Disclaimer: I'm not a cryptographer, and I do not claim to be an expert in Signal. I've had this read over by a couple of people who are so with luck there's no egregious errors, but any mistakes here are mine)

There are indications that Twitter is working on end-to-end encrypted DMs, likely building on work that was done back in 2018. This made use of libsignal, the reference implementation of the protocol used by the Signal encrypted messaging app. There seems to be a fairly widespread perception that, since libsignal is widely deployed (it's also the basis for WhatsApp's e2e encryption) and open source and has been worked on by a whole bunch of cryptography experts, choosing to use libsignal means that 90% of the work has already been done. And in some ways this is true - the security of the protocol is probably just fine. But there's rather more to producing a secure and usable client than just sprinkling on some libsignal.

(Aside: To be clear, I have no reason to believe that the people who were working on this feature in 2018 were unaware of this. This thread kind of implies that the practical problems are why it didn't ship at the time. Given the reduction in Twitter's engineering headcount, and given the new leadership's espousal of political and social perspectives that don't line up terribly well with the bulk of the cryptography community, I have doubts that any implementation deployed in the near future will get all of these details right)

I was musing about this last night and someone pointed out some prior art. Bridgefy is a messaging app that uses Bluetooth as its transport layer, allowing messaging even in the absence of data services. The initial implementation involved a bunch of custom cryptography, enabling a range of attacks ranging from denial of service to extracting plaintext from encrypted messages. In response to criticism Bridgefy replaced their custom cryptographic protocol with libsignal, but that didn't fix everything. One issue is the potential for MITMing - keys are shared on first communication, but the client provided no mechanism to verify those keys, so a hostile actor could pretend to be a user, receive messages intended for that user, and then reencrypt them with the user's actual key. This isn't a weakness in libsignal, in the same way that the ability to add a custom certificate authority to a browser's trust store isn't a weakness in TLS. In Signal the app key distribution is all handled via Signal's servers, so if you're just using libsignal you need to implement the equivalent yourself.

The other issue was more subtle. libsignal has no awareness at all of the Bluetooth transport layer. Deciding where to send a message is up to the client, and these routing messages were spoofable. Any phone in the mesh could say "Send messages for Bob here", and other phones would do so. This should have been a denial of service at worst, since the messages for Bob would still be encrypted with Bob's key, so the attacker would be able to prevent Bob from receiving the messages but wouldn't be able to decrypt them. However, the code to decide where to send the message and the code to decide which key to encrypt the message with were separate, and the routing decision was made before the encryption key decision. An attacker could send a message saying "Route messages for Bob to me", and then another saying "Actually lol no I'm Mallory". If a message was sent between those two messages, the message intended for Bob would be delivered to Mallory's phone and encrypted with Mallory's key.

Again, this isn't a libsignal issue. libsignal encrypted the message using the key bundle it was told to encrypt it with, but the client code gave it a key bundle corresponding to the wrong user. A race condition in the client logic allowed messages intended for one person to be delivered to and readable by another.

This isn't the only case where client code has used libsignal poorly. The Bond Touch is a Bluetooth-connected bracelet that you wear. Tapping it or drawing gestures sends a signal to your phone, which culminates in a message being sent to someone else's phone which sends a signal to their bracelet, which then vibrates and glows in order to indicate a specific sentiment. The idea is that you can send brief indications of your feelings to someone you care about by simply tapping on your wrist, and they can know what you're thinking without having to interrupt whatever they're doing at the time. It's kind of sweet in a way that I'm not, but it also advertised "Private Spaces", a supposedly secure way to send chat messages and pictures, and that seemed more interesting. I grabbed the app and disassembled it, and found it was using libsignal. So I bought one and played with it, including dumping the traffic from the app. One important thing to realise is that libsignal is just the protocol library - it doesn't implement a server, and so you still need some way to get information between clients. And one of the bits of information you have to get between clients is the public key material.

Back when I played with this earlier this year, key distribution was implemented by uploading the public key to a database. The other end would download the public key, and everything works out fine. And this doesn't sound like a problem, given that the entire point of a public key is to be, well, public. Except that there was no access control on this database, and the filenames were simply phone numbers, so you could overwrite anyone's public key with one of your choosing. This didn't let you cause messages intended for them to be delivered to you, so exploiting this for anything other than a DoS would require another vulnerability somewhere, but there are contrived situations where this would potentially allow the privacy expectations to be broken.

Another issue with this app was its handling of one-time prekeys. When you send someone new a message via Signal, it's encrypted with a key derived from not only the recipient's identity key, but also from what's referred to as a "one-time prekey". Users generate a bunch of keypairs and upload the public half to the server. When you want to send a message to someone, you ask the server for one of their one-time prekeys and use that. Decrypting this message requires using the private half of the one-time prekey, and the recipient deletes it afterwards. This means that an attacker who intercepts a bunch of encrypted messages over the network and then later somehow obtains the long-term keys still won't be able to decrypt the messages, since they depended on keys that no longer exist. Since these one-time prekeys are only supposed to be used once (it's in the name!) there's a risk that they can all be consumed before they're replenished. The spec regarding pre-keys says that servers should consider rate-limiting this, but the protocol also supports falling back to just not using one-time prekeys if they're exhausted (you lose the forward secrecy benefits, but it's still end-to-end encrypted). This implementation not only implemented no rate-limiting, making it easy to exhaust the one-time prekeys, it then also failed to fall back to running without them. Another easy way to force DoS.

(And, remember, a successful DoS on an encrypted communications channel potentially results in the users falling back to an unencrypted communications channel instead. DoS may not break the encrypted protocol, but it may be sufficient to obtain plaintext anyway)

And finally, there's ClearSignal. I looked at this earlier this year - it's avoided many of these pitfalls by literally just being a modified version of the official Signal client and using the existing Signal servers (it's even interoperable with Actual Signal), but it's then got a bunch of other weirdness. The Signal database (I /think/ including the keys, but I haven't completely verified that) gets backed up to an AWS S3 bucket, identified using something derived from a key using KERI, and I've seen no external review of that whatsoever. So, who knows. It also has crash reporting enabled, and it's unclear how much internal state it sends on crashes, and it's also based on an extremely old version of Signal with the "You need to upgrade Signal" functionality disabled.

Three clients all using libsignal in one form or another, and three clients that do things wrong in ways that potentially have a privacy impact. Again, none of these issues are down to issues with libsignal, they're all in the code that surrounds it. And remember that Twitter probably has to worry about other issues as well! If I lose my phone I'm probably not going to worry too much about whether the messages sent through my weird bracelet app being gone forever, but losing all my Twitter DMs would be a significant change in behaviour from the status quo. But that's not an easy thing to do when you're not supposed to have access to any keys! Group chats? That's another significant problem to deal with. And making the messages readable through the web UI as well as on mobile means dealing with another set of key distribution issues. Get any of this wrong in one way and the user experience doesn't line up with expectations, get it wrong in another way and the worst case involves some of your users in countries with poor human rights records being executed.

Simply building something on top of libsignal doesn't mean it's secure. If you want meaningful functionality you need to build a lot of infrastructure around libsignal, and doing that well involves not just competent development and UX design, but also a strong understanding of security and cryptography. Given Twitter's lost most of their engineering and is led by someone who's alienated all the cryptographers I know, I wouldn't be optimistic.

comment count unavailable comments

December 09, 2022 06:22 AM

December 07, 2022

Matthew Garrett: Making an Orbic Speed RC400L autoboot when USB power is attached

As I mentioned a couple of weeks ago, I've been trying to hack an Orbic Speed RC400L mobile hotspot so it'll automatically boot when power is attached. When plugged in it would flash a "Welcome" screen and then switch to a display showing the battery charging - it wouldn't show up on USB, and didn't turn on any networking. So, my initial assumption was that the bootloader was making a policy decision not to boot Linux. After getting root (as described in the previous post), I was able to cat /proc/mtd and see that partition 7 was titled "aboot". Aboot is a commonly used Android bootloader, based on Little Kernel - LK provides the hardware interface, aboot is simply an app that runs on top of it. I was able to find the source code for Quectel's aboot, which is intended to run on the same SoC that's in this hotspot, so it was relatively easy to line up a bunch of the Ghidra decompilation with actual source (top tip: find interesting strings in your decompilation and paste them into github search, and see whether you get a repo back).

Unfortunately looking through this showed various cases where bootloader policy decisions were made, but all of them seemed to result in Linux booting. Patching them and flashing the patched loader back to the hotspot didn't change the behaviour. So now I was confused: it seemed like Linux was loading, but there wasn't an obvious point in the boot scripts where it then decided not to do stuff. No boot logs were retained between boots, which made things even more annoying. But then I realised that, well, I have root - I can just do my own logging. I hacked in an additional init script to dump dmesg to /var, powered it down, and then plugged in a USB cable. It booted to the charging screen. I hit the power button and it booted fully, appearing on USB. I adb shelled in, checked the logs, and saw that it had booted twice. So, we were definitely entering Linux before showing the charging screen. But what was the difference?

Diffing the dmesg showed that there was a major distinction on the kernel command line. The kernel command line is data populated by the bootloader and then passed to the kernel - it's how you provide arguments that alter kernel behaviour without having to recompile it, but it's also exposed to userland by the running kernel so it also serves as a way for the bootloader to pass information to the running userland. The boot that resulted in the charging screen had a androidboot.poweronreason=USB argument, the one that booted fully had androidboot.poweronreason=PWRKEY. Searching the filesystem for androidboot.poweronreason showed that the script that configures USB did not enable USB if poweronreason was USB, and the same string also showed up in a bunch of other applications. The bootloader was always booting Linux, but it was telling Linux why it had booted, and if that reason was "USB" then Linux was choosing not to enable USB and not starting the networking stack.

One approach would be to modify every application that parsed this data and make it work even if the power on reason was "USB". That would have been tedious. It seemed easier to modify the bootloader. Looking for that string in Ghidra showed that it was reading a register from the power management controller and then interpreting that to determine the reason it had booted. In effect, it was doing something like:

boot_reason = read_pmic_boot_reason();
switch(boot_reason) {
case 0x10:
  bootparam = strdup("androidboot.poweronreason=PWRKEY");
  break;
case 0x20:
  bootparam = strdup("androidboot.poweronreason=USB");
  break;
default:
  bootparam = strdup("androidboot.poweronreason=Hard_Reset");
}
Changing the 0x20 to 0xff meant that the USB case would never be detected, and it would fall through to the default. All the userland code was happy to accept "Hard_Reset" as a legitimate reason to boot, and now plugging in USB results in the modem booting to a functional state. Woo.

If you want to do this yourself, dump the aboot partition from your device, and search for the hex sequence "03 02 00 0a 20"". Change the final 0x20 to 0xff, copy it back to the device, and write it to mtdblock7. If it doesn't work, feel free to curse me, but I'm almost certainly going to be no use to you whatsoever. Also, please, do not just attempt to mechanically apply this to other hotspots. It's not going to end well.

comment count unavailable comments

December 07, 2022 06:33 AM

December 03, 2022

Linux Plumbers Conference: That’s a wrap! Thanks everyone for Linux Plumbers 2022

Thank you to everyone that attended Linux Plumbers 2022 both in person or virtually. After two years of being 100% virtual due to the pandemic, we were able to have a very successful hybrid conference, with 418 people registering in-person where 401 attended (96%), and 361 registered virtually and 320 who actually participated online (89%), not counting all those that used the free YouTube service. After two years of being 100% remote, we decided to keep this year’s in-person count lower than normal due to the unknowns caused by the pandemic. To compensate for the smaller venue, we tried something new, and created a virtual attendance as well. We took a different approach than other hybrid conferences, and treated this one as a virtual event with an in-person component, where the in room attendees were simply participants of the virtual event. This required all presentations to be uploaded to Big Blue Button, and the presenters presented through the virtual platform even though they were doing so on stage. This allowed the virtual attendees to be treated as first class citizens of the conference. Although we found this format a success, it wasn’t without technical difficulties, like problems with having no sound in the beginning of the first day, but that’s expected when attempting to do something for the first time. Overall, we found it to be a better experience and will continue to do so in future conferences.

We had a total of 18 microconferences (where patches are already going out on the mailing lists that are results of discussions that happened there), 16 Refereed talks, 8 Kernel Summit talks, 29 Networking and BPF Summit track talks, and 9 Toolchain track talks. There were also 17 birds-of-a-feather talks, where several were added at the last minute to solve issues that have just arrived. Most of these presentations can still be seen on video.

Stay tune for the feedback report of our attendees.

Next year Linux Plumbers will take place in North America (but not necessarily in the United States). We are still locking down on locations. As it is custom for Linux Plumbers to change chairs every year, next year will be chaired by Christian Brauner. It seems we like to have the chair live in another continent than where the conference takes place. We are hoping to find a venue that can hold at least 600 people, where we will be able to increase the number of in-person attendees.

Finally, I want to thank all those that were involved in making Linux Plumbers the best technical conference there is. This would not have happened without the hard work from the planning committee (Alice Ferrazzi, Christian Brauner, David Woodhouse, Guy Lunardi, James Bottomley, Kate Stewart, Mike Rapoport, and Paul E. McKenney), the runners of the Networking and BPF Summit track, the Toolchain track, Kernel Summit, and those that put together the very productive microconferences. I would also like to thank all those that presented as well as those who attended both in-person and virtually. I want to thank our sponsors for their continued support, and hope that this year’s conference was well worth it for them. I want to give special thanks to the Linux Foundation and their staff, who went above and beyond to make this conference run smoothly. They do a lot of work behind the scenes and the planning committee greatly appreciates it.

Before signing off from 2022, I would like to ask if anyone would be interested in volunteering with helping out at next year’s conference? We are especially looking for those that could help on a technical level, as we found running a virtual component along with a live event requires a bit more people than what we currently have. If you are interested, please send an email to contact@linuxplumbersconf.org.

Sincerely,

Steven Rostedt
Linux Plumbers 2022 Conference chair

December 03, 2022 04:55 AM

November 30, 2022

Matthew Garrett: Making unphishable 2FA phishable

One of the huge benefits of WebAuthn is that it makes traditional phishing attacks impossible. An attacker sends you a link to a site that looks legitimate but isn't, and you type in your credentials. With SMS or TOTP-based 2FA, you type in your second factor as well, and the attacker now has both your credentials and a legitimate (if time-limited) second factor token to log in with. WebAuthn prevents this by verifying that the site it's sending the secret to is the one that issued it in the first place - visit an attacker-controlled site and said attacker may get your username and password, but they won't be able to obtain a valid WebAuthn response.

But what if there was a mechanism for an attacker to direct a user to a legitimate login page, resulting in a happy WebAuthn flow, and obtain valid credentials for that user anyway? This seems like the lead-in to someone saying "The Aristocrats", but unfortunately it's (a) real, (b) RFC-defined, and (c) implemented in a whole bunch of places that handle sensitive credentials. The villain of this piece is RFC 8628, and while it exists for good reasons it can be used in a whole bunch of ways that have unfortunate security consequences.

What is the RFC 8628-defined Device Authorization Grant, and why does it exist? Imagine a device that you don't want to type a password into - either it has no input devices at all (eg, some IoT thing) or it's awkward to type a complicated password (eg, a TV with an on-screen keyboard). You want that device to be able to access resources on behalf of a user, so you want to ensure that that user authenticates the device. RFC 8628 describes an approach where the device requests the credentials, and then presents a code to the user (either on screen or over Bluetooth or something), and starts polling an endpoint for a result. The user visits a URL and types in that code (or is given a URL that has the code pre-populated) and is then guided through a standard auth process. The key distinction is that if the user authenticates correctly, the issued credentials are passed back to the device rather than the user - on successful auth, the endpoint the device is polling will return an oauth token.

But what happens if it's not a device that requests the credentials, but an attacker? What if said attacker obfuscates the URL in some way and tricks a user into clicking it? The user will be presented with their legitimate ID provider login screen, and if they're using a WebAuthn token for second factor it'll work correctly (because it's genuinely talking to the real ID provider!). The user will then typically be prompted to approve the request, but in every example I've seen the language used here is very generic and doesn't describe what's going on or ask the user. AWS simply says "An application or device requested authorization using your AWS sign-in" and has a big "Allow" button, giving the user no indication at all that hitting "Allow" may give a third party their credentials.

This isn't novel! Christoph Tafani-Dereeper has an excellent writeup on this topic from last year, which builds on Nestori Syynimaa's earlier work. But whenever I've talked about this, people seem surprised at the consequences. WebAuthn is supposed to protect against phishing attacks, but this approach subverts that protection by presenting the user with a legitimate login page and then handing their credentials to someone else.

RFC 8628 actually recognises this vector and presents a set of mitigations. Unfortunately nobody actually seems to implement these, and most of the mitigations are based around the idea that this flow will only be used for physical devices. Sadly, AWS uses this for initial authentication for the aws-cli tool, so there's no device in that scenario. Another mitigation is that there's a relatively short window where the code is valid, and so sending a link via email is likely to result in it expiring before the user clicks it. An attacker could avoid this by directing the user to a domain under their control that triggers the flow and then redirects the user to the login page, ensuring that the code is only generated after the user has clicked the link.

Can this be avoided? The best way to do so is to ensure that you don't support this token issuance flow anywhere, or if you do then ensure that any tokens issued that way are extremely narrowly scoped. Unfortunately if you're an AWS user, that's probably not viable - this flow is required for the cli tool to perform SSO login, and users are going to end up with broadly scoped tokens as a result. The logs are also not terribly useful.

The infuriating thing is that this isn't necessary for CLI tooling. The reason this approach is taken is that you need a way to get the token to a local process even if the user is doing authentication in a browser. This can be avoided by having the process listen on localhost, and then have the login flow redirect to localhost (including the token) on successful completion. In this scenario the attacker can't get access to the token without having access to the user's machine, and if they have that they probably have access to the token anyway.

There's no real moral here other than "Security is hard". Sorry.

comment count unavailable comments

November 30, 2022 10:08 PM

November 27, 2022

Matthew Garrett: Poking a mobile hotspot

I've been playing with an Orbic Speed, a relatively outdated device that only speaks LTE Cat 4, but the towers I can see from here are, uh, not well provisioned so throughput really isn't a concern (and refurbs are $18, so). As usual I'm pretty terrible at just buying devices and using them for their intended purpose, and in this case it has the irritating behaviour that if there's a power cut and the battery runs out it doesn't boot again when power returns, so here's what I've learned so far.

First, it's clearly running Linux (nmap indicates that, as do the headers from the built-in webserver). The login page for the web interface has some text reading "Open Source Notice" that highlights when you move the mouse over it, but that's it - there's code to make the text light up, but it's not actually a link. There's no exposed license notices at all, although there is a copy on the filesystem that doesn't seem to be reachable from anywhere. The notice tells you to email them to receive source code, but doesn't actually provide an email address.

Still! Let's see what else we can figure out. There's no open ports other than the web server, but there is an update utility that includes some interesting components. First, there's a copy of adb, the Android Debug Bridge. That doesn't mean the device is running Android, it's common for embedded devices from various vendors to use a bunch of Android infrastructure (including the bootloader) while having a non-Android userland on top. But this is still slightly surprising, because the device isn't exposing an adb interface over USB. There's also drivers for various Qualcomm endpoints that are, again, not exposed. Running the utility under Windows while the modem is connected results in the modem rebooting and Windows talking about new hardware being detected, and watching the device manager shows a bunch of COM ports being detected and bound by Qualcomm drivers. So, what's it doing?

Sticking the utility into Ghidra and looking for strings that correspond to the output that the tool conveniently leaves in the logs subdirectory shows that after finding a device it calls vendor_device_send_cmd(). This is implemented in a copy of libusb-win32 that, again, has no offer for source code. But it's also easy to drop that into Ghidra and discover thatn vendor_device_send_cmd() is just a wrapper for usb_control_msg(dev,0xc0,0xa0,0,0,NULL,0,1000);. Sending that from Linux results in the device rebooting and suddenly exposing some more USB endpoints, including a functional adb interface. Although, annoyingly, the rndis interface that enables USB tethering via the modem is now missing.

Unfortunately the adb user is unprivileged, but most files on the system are world-readable. data/logs/atfwd.log is especially interesting. This modem has an application processor built into the modem chipset itself, and while the modem implements the Hayes Command Set there's also a mechanism for userland to register that certain AT commands should be pushed up to userland. These are handled by the atfwd_daemon that runs as root, and conveniently logs everything it's up to. This includes having logged all the communications executed when the update tool was run earlier, so let's dig into that.

The system sends a bunch of AT+SYSCMD= commands, each of which is in the form of echo (stuff) >>/usrdata/sec/chipid. Once that's all done, it sends AT+CHIPID, receives a response of CHIPID:PASS, and then AT+SER=3,1, at which point the modem reboots back into the normal mode - adb is gone, but rndis is back. But the logs also reveal that between the CHIPID request and the response is a security check that involves RSA. The logs on the client side show that the text being written to the chipid file is a single block of base64 encoded data. Decoding it just gives apparently random binary. Heading back to Ghidra shows that atfwd_daemon is reading the chipid file and then decrypting it with an RSA key. The key is obtained by calling a series of functions, each of which returns a long base64-encoded string. Decoding each of these gives 1028 bytes of high entropy data, which is then passed to another function that decrypts it using AES CBC using a key of 000102030405060708090a0b0c0d0e0f and an initialization vector of all 0s. This is somewhat weird, since there's 1028 bytes of data and 128 bit AES works on blocks of 16 bytes. The behaviour of OpenSSL is apparently to just pad the data out to a multiple of 16 bytes, but that implies that we're going to end up with a block of garbage at the end. It turns out not to matter - despite the fact that we decrypt 1028 bytes of input only the first 200 bytes mean anything, with the rest just being garbage. Concatenating all of that together gives us a PKCS#8 private key blob in PEM format. Which means we have not only the private key, but also the public key.

So, what's in the encrypted data, and where did it come from in the first place? It turns out to be a JSON blob that contains the IMEI and the serial number of the modem. This is information that can be read from the modem in the first place, so it's not secret. The modem decrypts it, compares the values in the blob to its own values, and if they match sets a flag indicating that validation has succeeeded. But what encrypted it in the first place? It turns out that the json blob is just POSTed to http://pro.w.ifelman.com/api/encrypt and an encrypted blob returned. Of course, the fact that it's being encrypted on the server with the public key and sent to the modem that decrypted with the private key means that having access to the modem gives us the public key as well, which means we can just encrypt our own blobs.

What does that buy us? Digging through the code shows the only case that it seems to matter is when parsing the AT+SER command. The first argument to this is the serial mode to transition to, and the second is whether this should be a temporary transition or a permanent one. Once parsed, these arguments are passed to /sbin/usb/compositions/switch_usb which just writes the mode out to /usrdata/mode.cfg (if permanent) or /usrdata/mode_tmp.cfg (if temporary). On boot, /data/usb/boot_hsusb_composition reads the number from this file and chooses which USB profile to apply. This requires no special permissions, except if the number is 3 - if so, the RSA verification has to be performed first. This is somewhat strange, since mode 9 gives the same rndis functionality as mode 3, but also still leaves the debug and diagnostic interfaces enabled.

So what's the point of all of this? I'm honestly not sure! It doesn't seem like any sort of effective enforcement mechanism (even ignoring the fact that you can just create your own blobs, if you change the IMEI on the device somehow, you can just POST the new values to the server and get back a new blob), so the best I've been able to come up with is to ensure that there's some mapping between IMEI and serial number before the device can be transitioned into production mode during manufacturing.

But, uh, we can just ignore all of this anyway. Remember that AT+SYSCMD= stuff that was writing the data to /usrdata/sec/chipid in the first place? Anything that's passed to AT+SYSCMD is just executed as root. Which means we can just write a new value (including 3) to /usrdata/mode.cfg in the first place, without needing to jump through any of these hoops. Which also means we can just adb push a shell onto there and then use the AT interface to make it suid root, which avoids needing to figure out how to exploit any of the bugs that are just sitting there given it's running a 3.18.48 kernel.

Anyway, I've now got a modem that's got working USB tethering and also exposes a working adb interface, and I've got root on it. Which let me dump the bootloader and discover that it implements fastboot and has an oem off-mode-charge command which solves the problem I wanted to solve of having the device boot when it gets power again. Unfortunately I still need to get into fastboot mode. I haven't found a way to do it through software (adb reboot bootloader doesn't do anything), but this post suggests it's just a matter of grounding a test pad, at which point I should just be able to run fastboot oem off-mode-charge and it'll be all set. But that's a job for tomorrow.

Edit: Got into fastboot mode and ran fastboot oem off-mode-charge 0 but sadly it doesn't actually do anything, so I guess next is going to involve patching the bootloader binary. Since it's signed with a cert titled "General Use Test Key (for testing only)" it apparently doesn't have secure boot enabled, so this should be easy enough.

comment count unavailable comments

November 27, 2022 12:29 AM

November 06, 2022

Paul E. Mc Kenney: Hiking Hills

A few years ago, I posted on the challenges of maintaining low weight as one ages.  I have managed to  stay near my target weight, with the occasional excursion in either direction, though admittedly more often up than down.  My suspicion that maintaining weight would prove 90% as difficult as losing it has proven to be all too well founded.  As has the observation that exercise is inherently damaging to muscles (see for example here), especially as one's body's ability to repair itself decreases inexorably with age.

It can be helpful to refer back to those old college physics courses.  One helpful formula is the well-worn Newtonian formula for kinetic energy, which is equal to half your mass times the square of your velocity.  Now, the human body does not maintain precisely the same speed while moving (that is after all what bicycles are for), and the faster you are going, the more energy your body must absorb when decreasing your velocity by a set amount on each footfall.  In fact, this amount of energy increases linearly with your average velocity.  So you can reduce the energy absorption (and thus the muscle and joint damage) by decreasing your speed.  And here you were wondering why old people often move much more slowly than do young people!

But moving more slowly decreases the quality of the exercise, for example, requiring more time to gain the same cardiovascular benefits.  One approach is to switch from (say) running to hiking uphill, thus decreasing velocity while increasing exertion.  This works quite well, at least until it comes time to hike back down the hill.

At this point, another formula comes into play, that for potential energy.  The energy released by lowering your elevation is your mass times the force of gravity time the difference in elevation.  With each step you take downhill, your body must dissipate this amount of energy.  Alternatively, you can transfer the potential energy into kinetic energy, but please see the previous discussion.  And again, this is what bicycles are for, at least for those retaining sufficiently fast reflexes to operate them safely under those conditions.  (Not me!!!)

The potential energy can be dissipated by your joints or by your muscles, with muscular dissipation normally being less damaging.  In other words, bend your knee and hip before, during, and after the time that your foot strikes the ground.  This gives your leg muscles more time to dissipate that step's worth of potential energy.  Walking backwards helps by bringing your ankle joint into play and also by increasing the extent to which your hip and knee can flex.  Just be careful to watch where you are going, as falling backwards down a hill is not normally what you want to be doing.  (Me, I walk backwards down the steepest slopes, which allow me to see behind myself just by looking down.  It is also helpful to have someone else watching out for you.)

Also, take small steps.  This reduces the difference in elevation, thus reducing the amount of energy that must be dissipated per step.

But wait!  This also increases the number of steps, so that the effect of reducing your stride cancels out, right?

Wrong.

First, longer stride tends to result in higher velocity, the damaging effects of which were described above.  Second, the damage your muscles incur while dissipating energy is non-linear with both the force that your muscles are exerting and the energy per unit time (also known as "power") that they are dissipating.  To see this, recall that a certain level of force/power will cause your muscle to rupture completely, so that a (say) 10x reduction in force/power results in much greater than a 10x reduction in damage.

These approaches allow you to get good exercise with minimal damage to your body.  Other strategies include the aforementioned bicycling as well as swimming.  Although I am fond of swimming, I recognize that it is my exercise endgame, and that I will therefore need to learn to like it.  But not just yet.

To circle back to the subject of the earlier blog post, one common term in the formulas for both kinetic and potential energy is one's mass.  And I do find hiking easier than it was when I weighed 30 pounds more than I do now.  Should I lose more weight?  On this, I defer to my wife, who is a dietitian.  She assures me that 180 pounds is my target weight.

So here I am!  And here I will endeavor to stay, despite my body's continued fear of the food-free winter that it has never directly experienced.

November 06, 2022 09:15 PM

October 13, 2022

Dave Airlie (blogspot): LPC 2022 Accelerators BOF outcomes summary

 At Linux Plumbers Conference 2022, we held a BoF session around accelerators.

This is a summary made from memory and notes taken by John Hubbard.

We started with defining categories of accelerator devices.

1. single shot data processors, submit one off jobs to a device. (simpler image processors)

2. single-user, single task offload devices (ML training devices)

3. multi-app devices (GPU, ML/inference execution engines)

One of the main points made is that common device frameworks are normally about targeting a common userspace (e.g. mesa for GPUs). Since a common userspace doesn't exist for accelerators, this presents a problem of what sort of common things can be targetted. Discussion about tensorflow, pytorch as being the userspace, but also camera image processing and OpenCL. OpenXLA was also named as a userspace API that might be of interest to use as a target for implementations.

 There was a discussion on what to call the subsystem and where to place it in the tree. It was agreed that the drivers would likely need to use DRM subsystem functionality but having things live in drivers/gpu/drm would not be great. Moving things around now for current drivers is too hard to deal with for backports etc. Adding a new directory for accel drivers would be a good plan, even if they used the drm framework. There was a lot naming discussion, I think we landed on drivers/skynet or drivers/accel (Greg and I like skynet more).

We had a discussion about RAS (Reliability, Availability, Serviceability) which is how hardware is monitored in data centers. GPU and acceleration drivers for datacentre operations define a their own RAS interfaces that get plugged into monitoring systems. This seems like an area that could be standardised across drivers. Netlink was suggested as a possible solution for this area.

Namespacing for devices was brought up. I think the advice was if you ever think you are going to namespace something in the future, you should probably consider creating a namespace for it up front, as designing one in later might prove difficult to secure properly.

We should use the drm framework with another major number to avoid some of the pain points and lifetime issues other frameworks see.

There was discussion about who could drive this forward, and Oded Gabbay from Intel's Habana Labs team was the obvious and best placed person to move it forward, Oded said he would endeavor to make it happen.

This is mostly a summary of the notes, I think we have a fair idea on a path forward we just need to start bringing the pieces together upstream now.

 




October 13, 2022 07:49 PM

October 07, 2022

Paul E. Mc Kenney: Stupid RCU Tricks: CPP Summit Presentation

I had the privilege of presenting Unraveling Fence & RCU Mysteries (C++ Concurrency Fundamentals) to the CPP Summit.  As the title suggests, this covered RCU from a C++ viewpoint.  At the end, there were several excellent questions:

  1. How does the rcu_synchronize() wait-for-readers operation work?
  2. But the rcu_domain class contains lock() and unlock() member functions!!!
  3. Lockless things make me nervous!

There was limited time for questions, and each question's answer could easily have consumed the full 50 minutes alloted for the full talk.  Therefore, I address these questions in the following sections.

How Does rcu_synchronize() Work?

There are a great many ways to make this work.  Very roughly speaking, userspace RCU implementations usually have per-thread counters that are updated by readers and sampled by updaters, with the updaters waiting for all of the counters to reach zero.  There are a large number of pitfalls and optimizations, some of which are covered in the 2012 Transactions On Parallel and Distributed Systems paper (non-paywalled draft).  The most detailed discussion is in the supplementary materials.

More recent information may be found in Section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?

The rcu_domain Class Contains lock() and unlock() Member Functions?

Indeed it does!

But names notwithstanding, these lock() and unlock() member functions need not contain memory-barrier instructions, let alone read-modify-write atomic operations, let alone acquisition and release of actual locks.

So why these misleading names???  These misleading names exist so that the rcu_domain class meets the requirements of Cpp17BasicLockable, which provides RAII capability for C++ RCU readers.  Earlier versions of the RCU proposal for the C++ standard rolled their own RAII capability, but the committee wisely insisted that Cpp17BasicLockable's existing RAII capabilities be used instead.

So it is that rcu_domain::lock() simply enters an RCU read-side critical section and rcu_domain::unlock() exits that critical section.  Yes, RCU read-side critical sections can be nested.

Lockless Things Make Me Nervous!!!

As well they should!

The wise developer will be at least somewhat nervous when implementing lockless code because that nervousness will help motivate the developer to be careful, to test and stress-test carefully, and, when necessary, make good use of formal verification.

In fact, one of the purposes of RCU is to package lockless code so as to make it easier to use.  This presentation dove into one RCU use case, and other materials called out in this CPP Summit presentation looked into many other RCU use cases.

So proper use of RCU should enable developers to be less nervous.  But hopefully not completely lacking in nervousness!  :-)

October 07, 2022 11:05 AM

October 06, 2022

Matthew Garrett: Cloud desktops aren't as good as you'd think

Fast laptops are expensive, cheap laptops are slow. But even a fast laptop is slower than a decent workstation, and if your developers want a local build environment they're probably going to want a decent workstation. They'll want a fast (and expensive) laptop as well, though, because they're not going to carry their workstation home with them and obviously you expect them to be able to work from home. And in two or three years they'll probably want a new laptop and a new workstation, and that's even more money. Not to mention the risks associated with them doing development work on their laptop and then drunkenly leaving it in a bar or having it stolen or the contents being copied off it while they're passing through immigration at an airport. Surely there's a better way?

This is the thinking that leads to "Let's give developers a Chromebook and a VM running in the cloud". And it's an appealing option! You spend far less on the laptop, and the VM is probably cheaper than the workstation - you can shut it down when it's idle, you can upgrade it to have more CPUs and RAM as necessary, and you get to impose all sorts of additional neat security policies because you have full control over the network. You can run a full desktop environment on the VM, stream it to a cheap laptop, and get the fast workstation experience on something that weighs about a kilogram. Your developers get the benefit of a fast machine wherever they are, and everyone's happy.

But having worked at more than one company that's tried this approach, my experience is that very few people end up happy. I'm going to give a few reasons here, but I can't guarantee that they cover everything - and, to be clear, many (possibly most) of the reasons I'm going to describe aren't impossible to fix, they're simply not priorities. I'm also going to restrict this discussion to the case of "We run a full graphical environment on the VM, and stream that to the laptop" - an approach that only offers SSH access is much more manageable, but also significantly more restricted in certain ways. With those details mentioned, let's begin.

The first thing to note is that the overall experience is heavily tied to the protocol you use for the remote display. Chrome Remote Desktop is extremely appealing from a simplicity perspective, but is also lacking some extremely key features (eg, letting you use multiple displays on the local system), so from a developer perspective it's suboptimal. If you read the rest of this post and want to try this anyway, spend some time working with your users to find out what their requirements are and figure out which technology best suits them.

Second, let's talk about GPUs. Trying to run a modern desktop environment without any GPU acceleration is going to be a miserable experience. Sure, throwing enough CPU at the problem will get you past the worst of this, but you're still going to end up with users who need to do 3D visualisation, or who are doing VR development, or who expect WebGL to work without burning up every single one of the CPU cores you so graciously allocated to their VM. Cloud providers will happily give you GPU instances, but that's going to cost more and you're going to need to re-run your numbers to verify that this is still a financial win. "But most of my users don't need that!" you might say, and we'll get to that later on.

Next! Video quality! This seems like a trivial point, but if you're giving your users a VM as their primary interface, then they're going to do things like try to use Youtube inside it because there's a conference presentation that's relevant to their interests. The obvious solution here is "Do your video streaming in a browser on the local system, not on the VM" but from personal experience that's a super awkward pain point! If I click on a link inside the VM it's going to open a browser there, and now I have a browser in the VM and a local browser and which of them contains the tab I'm looking for WHO CAN SAY. So your users are going to watch stuff inside their VM, and re-compressing decompressed video is going to look like shit unless you're throwing a huge amount of bandwidth at the problem. And this is ignoring the additional irritation of your browser being unreadable while you're rapidly scrolling through pages, or terminal output from build processes being a muddy blur of artifacts, or the corner case of "I work for Youtube and I need to be able to examine 4K streams to determine whether changes have resulted in a degraded experience" which is a very real job and one that becomes impossible when you pass their lovingly crafted optimisations through whatever codec your remote desktop protocol has decided to pick based on some random guesses about the local network, and look everyone is going to have a bad time.

The browser experience. As mentioned before, you'll have local browsers and remote browsers. Do they have the same security policy? Who knows! Are all the third party services you depend on going to be ok with the same user being logged in from two different IPs simultaneously because they lost track of which browser they had an open session in? Who knows! Are your users going to become frustrated? Who knows oh wait no I know the answer to this one, it's "yes".

Accessibility! More of your users than you expect rely on various accessibility interfaces, be those mechanisms for increasing contrast, screen magnifiers, text-to-speech, speech-to-text, alternative input mechanisms and so on. And you probably don't know this, but most of these mechanisms involve having accessibility software be able to introspect the UI of applications in order to provide appropriate input or expose available options and the like. So, I'm running a local text-to-speech agent. How does it know what's happening in the remote VM? It doesn't because it's just getting an a/v stream, so you need to run another accessibility stack inside the remote VM and the two of them are unaware of each others existence and this works just as badly as you'd think. Alternative input mechanism? Good fucking luck with that, you're at best going to fall back to "Send synthesized keyboard inputs" and that is nowhere near as good as "Set the contents of this text box to this unicode string" and yeah I used to work on accessibility software maybe you can tell. And how is the VM going to send data to a braille output device? Anyway, good luck with the lawsuits over arbitrarily making life harder for a bunch of members of a protected class.

One of the benefits here is supposed to be a security improvement, so let's talk about WebAuthn. I'm a big fan of WebAuthn, given that it's a multi-factor authentication mechanism that actually does a good job of protecting against phishing, but if my users are running stuff inside a VM, how do I use it? If you work at Google there's a solution, but that does mean limiting yourself to Chrome Remote Desktop (there are extremely good reasons why this isn't generally available). Microsoft have apparently just specced a mechanism for doing this over RDP, but otherwise you're left doing stuff like forwarding USB over IP, and that means that your USB WebAuthn no longer works locally. It also doesn't work for any other type of WebAuthn token, such as a bluetooth device, or an Apple TouchID sensor, or any of the Windows Hello support. If you're planning on moving to WebAuthn and also planning on moving to remote VM desktops, you're going to have a bad time.

That's the stuff that comes to mind immediately. And sure, maybe each of these issues is irrelevant to most of your users. But the actual question you need to ask is what percentage of your users will hit one or more of these, because if that's more than an insignificant percentage you'll still be staffing all the teams that dealt with hardware, handling local OS installs, worrying about lost or stolen devices, and the glorious future of just being able to stop worrying about this is going to be gone and the financial benefits you promised would appear are probably not going to work out in the same way.

A lot of this falls back to the usual story of corporate IT - understand the needs of your users and whether what you're proposing actually meets them. Almost everything I've described here is a corner case, but if your company is larger than about 20 people there's a high probability that at least one person is going to fall into at least one of these corner cases. You're going to need to spend a lot of time understanding your user population to have a real understanding of what the actual costs are here, and I haven't seen anyone do that work before trying to launch this and (inevitably) going back to just giving people actual computers.

There are alternatives! Modern IDEs tend to support SSHing out to remote hosts to perform builds there, so as long as you're ok with source code being visible on laptops you can at least shift the "I need a workstation with a bunch of CPU" problem out to the cloud. The laptops are going to need to be more expensive because they're also going to need to run more software locally, but it wouldn't surprise me if this ends up being cheaper than the full-on cloud desktop experience in most cases.

Overall, the most important thing to take into account here is that your users almost certainly have more use cases than you expect, and this sort of change is going to have direct impact on the workflow of every single one of your users. Make sure you know how much that's going to be, and take that into consideration when suggesting it'll save you money.

comment count unavailable comments

October 06, 2022 07:52 PM

September 28, 2022

James Bottomley: Paying Maintainers isn’t a Magic Bullet

Over the last few years it’s become popular to suggest that open source maintainers should be paid. There are a couple of stated motivations for this, one being that it would improve the security and reliability of the ecosystem (pioneered by several companies like Tidelift) and the others contending that it would be a solution to the maintainer burnout and finally that it would solve the open source free rider problem. The purpose of this blog is to examine each of these in turn to seeing if paying maintainers actually would solve the problem (or, for some, does the problem even exist in the first place).

Free Riders

The free rider problem is simply expressed: There’s a class of corporations which consume open source, for free, as the foundation of their profits but don’t give back enough of their allegedly ill gotten gains. In fact, a version of this problem is as old as time: the “workers” don’t get paid enough (or at all) by the “bosses”; greedy people disproportionately exploit the free but limited resources of the planet. Open Source is uniquely vulnerable to this problem because of the free (as in beer) nature of the software: people who don’t have to pay for something often don’t. Part of the problem also comes from the general philosophy of open source which tries to explain that it’s free (as in freedom) which matters not free (as in beer) and everyone, both producers and consumers should care about the former. In fact, in economic terms, the biggest impact open source has had on industry is from the free (as in beer) effect.

Open Source as a Destroyer of Market Value

Open Source is often portrayed as a “disrupter” of the market, but it’s not often appreciated that a huge part of that disruption is value destruction. Consider one of the older Open Source systems: Linux. As an operating system (when coupled with GNU or other user space software) it competed in the early days with proprietary UNIX. However, it’s impossible to maintain your margin competing against free and the net result was that one by one the existing players were forced out of the market or refocussed on other offerings and now, other than for historical or niche markets, there’s really no proprietary UNIX maker left … essentially the value contained within the OS market was destroyed. This value destruction effect was exploited brilliantly by Google with Android: to enter and disrupt an existing lucrative smart phone market, created and owned by Apple, with a free OS based on Open Source successfully created a load of undercutting handset manufacturers eager to be cheaper than Apple who went on to carve out an 80% market share. Here, the value isn’t completely destroyed, but it has significantly reduced (smart phones going from a huge margin business to a medium to low margin one).

All of this value destruction is achieved by the free (as in beer) effect of open source: the innovator who uses it doesn’t have to pay the full economic cost for developing everything from scratch, they just have to pay the innovation expense of adapting it (such adaptation being made far easier by access to the source code). This effect is also the reason why Microsoft and other companies railed about Open Source being a cancer on intellectual property: because it is1. However, this view is also the product of rigid and incorrect thinking: by destroying value in existing markets, open source presents far more varied and unique opportunities in newly created ones. The cardinal economic benefit of value destruction is that it lowers the barrier to entry (as Google demonstrated with Android) thus opening the market up to new and varied competition (or turning monopoly markets into competitive ones).

Envy isn’t a Good Look

If you follow the above, you’ll see the supposed “free rider” problem is simply a natural consequence of open source being free as in beer (someone is creating a market out of the thing you offered for free precisely because they didn’t have to pay for it): it’s not a problem to be solved, it’s a consequence to be embraced and exploited (if you’re clever enough). Not all of us possess the business acumen to exploit market opportunities like this, but if you don’t, envying those who do definitely won’t cause your quality of life to improve.

The bottom line is that having a mechanism to pay maintainers isn’t going to do anything about this supposed “free rider” problem because the companies that exploit open source and don’t give back have no motivation to use it.

Maintainer Burnout

This has become a hot topic over recent years with many blog posts and support groups devoted to it. From my observation it seems to matter what kind of maintainer you are: If you only have hobby projects you maintain on an as time becomes available basis, it seems the chances of burn out isn’t high. On the other hand, if you’re effectively a full time Maintainer, burn out becomes a distinct possibility. I should point out I’m the former not the latter type of maintainer, so this is observation not experience, but it does seem to me that burn out at any job (not just that of a Maintainer) seems to happen when delivery expectations exceed your ability to deliver and you start to get depressed about the ever increasing backlog and vocal complaints. In industry when someone starts to burn out, the usual way of rectifying it is either lighten the load or provide assistance. I have noticed that full time Maintainers are remarkably reluctant to give up projects (presumably because each one is part of their core value), so helping with tooling to decrease the load is about the only possible intervention here.

As an aside about tooling, from parallels with Industry, although tools correctly used can provide useful assistance, there are sure fire ways to increase the possibility of burn out with inappropriate demands:

It does strike me that some of our much venerated open source systems, like github, have some of these same management anti-patterns, like encouraging Maintainers to chase repository stars to show value, having a daily reminder of outstanding PRs and Issues, showing everyone who visits your home page your contribution records for every project over the last year.

To get back to the main point, again by parallel with Industry, paying people more doesn’t decrease industrial burn out; it may produce a temporary feel good high, but the backlog pile eventually overcomes this. If someone is already working at full stretch at something they want to do giving them more money isn’t going to make them stretch further. For hobby maintainers like me, even if you could find a way to pay me that my current employer wouldn’t object to, I’m already devoting as much time as I can spare to my Maintainer projects, so I’m unlikely to find more (although I’m not going to refuse free money …).

Security and Reliability

Everyone wants Maintainers to code securely and reliably and also to respond to bug reports within a fixed SLA. Obviously usual open source Maintainers are already trying to code securely and reliably and aren’t going to do the SLA thing because they don’t have to (as the licence says “NO WARRANTY …”), so paying them won’t improve the former and if they’re already devoting all the time they can to Maintenance, it won’t achieve the latter either. So how could Security and Reliability be improved? All a maintainer can really do is keep current with current coding techniques (when was the last time someone offered a free course to Maintainers to help with this?). Suggesting to a project that if they truly believed in security they’d rewrite it in Rust from tens of thousands of lines of C really is annoying and unhelpful.

One of the key ways to keep software secure and reliable is to run checkers over the code and do extensive unit and integration testing. The great thing about this is that it can be done as a side project from the main Maintenance task provided someone triages and fixes the issues generated. This latter is key; simply dumping issue reports on an overloaded maintainer makes the overload problem worse and adds to a pile of things they might never get around to. So if you are thinking of providing checker or tester resources, please also think how any generated issues might get resolved without generating more work for a Maintainer.

Business Models around Security and Reliability

A pretty old business model for code checking and testing is to have a distribution do it. The good ones tend to send patches upstream and their business model is to sell the software (or at least a support licence to it) which gives the recipients a SLA as well. So what’s the problem? Mainly the economics of this tried and trusted model. Firstly what you want supported must be shipped by a distribution, which means it must have a big enough audience for a distribution to consider it a fairly essential item. Secondly you end up paying a per instance use cost that’s an average of everything the distribution ships. The main killer is this per instance cost, particularly if you are a hyperscaler, so it’s no wonder there’s a lot of pressure to shift the cost from being per instance to per project.

As I said above, Maintainers often really need more help than more money. One good way to start would potentially be to add testing and checking (including bug fixing and upstreaming) services to a project. This would necessarily involve liaising with the maintainer (and could involve an honorarium) but the object should be to be assistive (and thus scalably added) to what the Maintainer is already doing and prevent the service becoming a Maintainer time sink.

Professional Maintainers

Most of the above analysis assumed Maintainers are giving all the time they have available to the project. However, in the case where a Maintainer is doing the project in their spare time or is an Employee of a Company and being paid partly to work on the project and partly on other things, paying them to become a full time Maintainer (thus leaving their current employment) has the potential to add the hours spent on “other things” to the time spent on the project and would thus be a net win. However, you have to also remember that turning from employee to independent contractor also comes with costs in terms of red tape (like health care, tax filings, accounting and so on), which can become a significant time sink, so the net gain in hours to the project might not be as many as one would think. In an ideal world, entities paying maintainers would also consider this problem and offer services to offload the burden (although none currently seem to consider this). Additionally, turning from part time to full time can increase the problem of burn out, particularly if you spend increasing portions of your newly acquired time worrying about admin issues or other problems associated with running your own consulting business.

Conclusions

The obvious conclusion from the above analysis is that paying maintainers mostly doesn’t achieve it’s stated goals. However, you have to remember that this is looking at the problem thorough the prism of claimed end results. One thing paying maintainers definitely does do is increase the mechanisms by which maintainers themselves make a living (which is a kind of essential existential precursor). Before paying maintainers became a thing, the only real way of making a living as a maintainer was reputation monetization (corporations paid you to have a maintainer on staff or because being a maintainer demonstrated a skill set they needed in other aspects of their business) but now a Maintainer also has the option to turn Professional. Increasing the net ways of rewarding Maintainership therefore should be a net benefit to attracting people into all ecosystems.

In general, I think that paying maintainers is a good thing, but should be the beginning of the search for ways of remunerating Open Source contributors, not the end.

September 28, 2022 08:16 PM

September 25, 2022

Paul E. Mc Kenney: Parallel Programming: September 2022 Update

The v2022.09.25a release of Is Parallel Programming Hard, And, If So, What Can You Do About It? is now available! The double-column version is also available from arXiv.org.

This version boasts an expanded index and API index, and also adds a number of improvements, perhaps most notably boldface for the most pertinent pages for a given index entry, courtesy of Akira Yokosawa. Akira also further improved the new ebook-friendly PDFs, expanded the list of acronyms, updated the build system to allow different perfbook formats to be built concurrently, adjusted for Ghostscript changes, carried out per-Linux-version updates, and did a great deal of formatting and other cleanup.

One of the code samples now use C11 thread-local storage instead of the GCC __thread storage class, courtesy of Elad Lahav. Elad also added support for building code samples on QNX.

Johann Klähn, SeongJae Park, Xuwei Fu, and Zhouyi Zhou provided many welcome fixes throughout the book.

This release also includes a number of updates to RCU, memory ordering, locking, and non-blocking synchronization, as well as additional information on the combined use of synchronization mechanisms.

September 25, 2022 10:43 PM

September 20, 2022

Matthew Garrett: Handling WebAuthn over remote SSH connections

Being able to SSH into remote machines and do work there is great. Using hardware security tokens for 2FA is also great. But trying to use them both at the same time doesn't work super well, because if you hit a WebAuthn request on the remote machine it doesn't matter how much you mash your token - it's not going to work.

But could it?

The SSH agent protocol abstracts key management out of SSH itself and into a separate process. When you run "ssh-add .ssh/id_rsa", that key is being loaded into the SSH agent. When SSH wants to use that key to authenticate to a remote system, it asks the SSH agent to perform the cryptographic signatures on its behalf. SSH also supports forwarding the SSH agent protocol over SSH itself, so if you SSH into a remote system then remote clients can also access your keys - this allows you to bounce through one remote system into another without having to copy your keys to those remote systems.

More recently, SSH gained the ability to store SSH keys on hardware tokens such as Yubikeys. If configured appropriately, this means that even if you forward your agent to a remote site, that site can't do anything with your keys unless you physically touch the token. But out of the box, this is only useful for SSH keys - you can't do anything else with this support.

Well, that's what I thought, at least. And then I looked at the code and realised that SSH is communicating with the security tokens using the same library that a browser would, except it ensures that any signature request starts with the string "ssh:" (which a genuine WebAuthn request never will). This constraint can actually be disabled by passing -O no-restrict-websafe to ssh-agent, except that was broken until this weekend. But let's assume there's a glorious future where that patch gets backported everywhere, and see what we can do with it.

First we need to load the key into the security token. For this I ended up hacking up the Go SSH agent support. Annoyingly it doesn't seem to be possible to make calls to the agent without going via one of the exported methods here, so I don't think this logic can be implemented without modifying the agent module itself. But this is basically as simple as adding another key message type that looks something like:

type ecdsaSkKeyMsg struct {
       Type        string `sshtype:"17|25"`
       Curve       string
       PubKeyBytes []byte
       RpId        string
       Flags       uint8
       KeyHandle   []byte
       Reserved    []byte
       Comments    string
       Constraints []byte `ssh:"rest"`
}
Where Type is ssh.KeyAlgoSKECDSA256, Curve is "nistp256", RpId is the identity of the relying party (eg, "webauthn.io"), Flags is 0x1 if you want the user to have to touch the key, KeyHandle is the hardware token's representation of the key (basically an opaque blob that's sufficient for the token to regenerate the keypair - this is generally stored by the remote site and handed back to you when it wants you to authenticate). The other fields can be ignored, other than PubKeyBytes, which is supposed to be the public half of the keypair.

This causes an obvious problem. We have an opaque blob that represents a keypair. We don't have the public key. And OpenSSH verifies that PubKeyByes is a legitimate ecdsa public key before it'll load the key. Fortunately it only verifies that it's a legitimate ecdsa public key, and does nothing to verify that it's related to the private key in any way. So, just generate a new ECDSA key (ecdsa.GenerateKey(elliptic.P256(), rand.Reader)) and marshal it ( elliptic.Marshal(ecKey.Curve, ecKey.X, ecKey.Y)) and we're good. Pass that struct to ssh.Marshal() and then make an agent call.

Now you can use the standard agent interfaces to trigger a signature event. You want to pass the raw challenge (not the hash of the challenge!) - the SSH code will do the hashing itself. If you're using agent forwarding this will be forwarded from the remote system to your local one, and your security token should start blinking - touch it and you'll get back an ssh.Signature blob. ssh.Unmarshal() the Blob member to a struct like
type ecSig struct {
        R *big.Int
        S *big.Int
}
and then ssh.Unmarshal the Rest member to
type authData struct {
        Flags    uint8
        SigCount uint32
}
The signature needs to be converted back to a DER-encoded ASN.1 structure (eg,
var b cryptobyte.Builder
b.AddASN1(asn1.SEQUENCE, func(b *cryptobyte.Builder) {
        b.AddASN1BigInt(ecSig.R)
        b.AddASN1BigInt(ecSig.S)
})
signatureDER, _ := b.Bytes()
, and then you need to construct the Authenticator Data structure. For this, take the RpId used earlier and generate the sha256. Append the one byte Flags variable, and then convert SigCount to big endian and append those 4 bytes. You should now have a 37 byte structure. This needs to be CBOR encoded (I used github.com/fxamacker/cbor and just called cbor.Marshal(data, cbor.EncOptions{})).

Now base64 encode the sha256 of the challenge data, the DER-encoded signature and the CBOR-encoded authenticator data and you've got everything you need to provide to the remote site to satisfy the challenge.

There are alternative approaches - you can use USB/IP to forward the hardware token directly to the remote system. But that means you can't use it locally, so it's less than ideal. Or you could implement a proxy that communicates with the key locally and have that tunneled through to the remote host, but at that point you're just reinventing ssh-agent.

And you should bear in mind that the default behaviour of blocking this sort of request is for a good reason! If someone is able to compromise a remote system that you're SSHed into, they can potentially trick you into hitting the key to sign a request they've made on behalf of an arbitrary site. Obviously they could do the same without any of this if they've compromised your local system, but there is some additional risk to this. It would be nice to have sensible MAC policies that default-denied access to the SSH agent socket and only allowed trustworthy binaries to do so, or maybe have some sort of reasonable flatpak-style portal to gate access. For my threat model I think it's a worthwhile security tradeoff, but you should evaluate that carefully yourself.

Anyway. Now to figure out whether there's a reasonable way to get browsers to work with this.

comment count unavailable comments

September 20, 2022 02:18 AM

September 19, 2022

Matthew Garrett: Bring Your Own Disaster

After my last post, someone suggested that having employers be able to restrict keys to machines they control is a bad thing. So here's why I think Bring Your Own Device (BYOD) scenarios are bad not only for employers, but also for users.

There's obvious mutual appeal to having developers use their own hardware rather than rely on employer-provided hardware. The user gets to use hardware they're familiar with, and which matches their ergonomic desires. The employer gets to save on the money required to buy new hardware for the employee. From this perspective, there's a clear win-win outcome.

But once you start thinking about security, it gets more complicated. If I, as an employer, want to ensure that any systems that can access my resources meet a certain security baseline (eg, I don't want my developers using unpatched Windows ME), I need some of my own software installed on there. And that software doesn't magically go away when the user is doing their own thing. If a user lends their machine to their partner, is the partner fully informed about what level of access I have? Are they going to feel that their privacy has been violated if they find out afterwards?

But it's not just about monitoring. If an employee's machine is compromised and the compromise is detected, what happens next? If the employer owns the system then it's easy - you pick up the device for forensic analysis and give the employee a new machine to use while that's going on. If the employee owns the system, they're probably not going to be super enthusiastic about handing over a machine that also contains a bunch of their personal data. In much of the world the law is probably on their side, and even if it isn't then telling the employee that they have a choice between handing over their laptop or getting fired probably isn't going to end well.

But obviously this is all predicated on the idea that an employer needs visibility into what's happening on systems that have access to their systems, or which are used to develop code that they'll be deploying. And I think it's fair to say that not everyone needs that! But if you hold any sort of personal data (including passwords) for any external users, I really do think you need to protect against compromised employee machines, and that does mean having some degree of insight into what's happening on those machines. If you don't want to deal with the complicated consequences of allowing employees to use their own hardware, it's rational to ensure that only employer-owned hardware can be used.

But what about the employers that don't currently need that? If there's no plausible future where you'll host user data, or where you'll sell products to others who'll host user data, then sure! But if that might happen in future (even if it doesn't right now), what's your transition plan? How are you going to deal with employees who are happily using their personal systems right now? At what point are you going to buy new laptops for everyone? BYOD might work for you now, but will it always?

And if your employer insists on employees using their own hardware, those employees should ask what happens in the event of a security breach. Whose responsibility is it to ensure that hardware is kept up to date? Is there an expectation that security can insist on the hardware being handed over for investigation? What information about the employee's use of their own hardware is going to be logged, who has access to those logs, and how long are those logs going to be kept for? If those questions can't be answered in a reasonable way, it's a huge red flag. You shouldn't have to give up your privacy and (potentially) your hardware for a job.

Using technical mechanisms to ensure that employees only use employer-provided hardware is understandably icky, but it's something that allows employers to impose appropriate security policies without violating employee privacy.

comment count unavailable comments

September 19, 2022 07:12 AM

September 16, 2022

Dave Airlie (blogspot): LPC 2022 GPU BOF (user console and cgroups)

 At LPC 2022 we had a BoF session for GPUs with two topics.

Moving to userspace consoles:

Currently most mainline distros still use the kernel console, provided by the VT subsystem. We'd like to move to CONFIG_VT=n as the console and vt subsystem have historically been a source of bugs but are also nasty places for locking etc. It also can be the cause of oops going missing when it takes out the panic path with locking bugs stopping other paths from completely processing the oops (like pstore or serial).

 The session started discussing what things would like. Lennart gave a great summary of the work David did a few years ago and the train of thought involved.

Once you think through all the paths and things you want supported, you realise the best user console is going to be one that supports emojis and non-Latin scripts. This probably means you want a lightweight wayland compositor running a fullscreen VTE based terminal. Working back from the consequences of this means you probably aren't going to want this in systemd, and it should be a separate development.

The other area discussed was around the requirements for a panic/emergency kernel console, likely called drmlog, this would just be something to output to the display whenever the kernel panics or during boot before the user console is loaded.

cgroups for GPU

This has been an ongoing topic, where vendors often suggest cgroup patches, and on review told they are too vendor specific and to simplify and try again, never to try again. We went over the problem space of both memory allocation accounting/enforcement and processing restrictions. We all agreed the local memory accounting was probably the easiest place to start doing something. We also agreed that accounting should be implemented and solid before we start digging into enforcement. We had a discussion about where CPU memory should be accounted, especially around integrated vs discrete graphics, since users might care about the application memory usage apart from the GPU objects usage which would be CPU on integrated and GPU on discrete. I believe a follow-up hallway discussion also covered a bunch of this with upstream cgroups maintainer.

September 16, 2022 03:56 PM

September 15, 2022

Matthew Garrett: git signatures with SSH certificates

Last night I complained that git's SSH signature format didn't support using SSH certificates rather than raw keys, and was swiftly corrected, once again highlighting that the best way to make something happen is to complain about it on the internet in order to trigger the universe to retcon it into existence to make you look like a fool. But anyway. Let's talk about making this work!

git's SSH signing support is actually just it shelling out to ssh-keygen with a specific set of options, so let's go through an example of this with ssh-keygen. First, here's my certificate:

$ ssh-keygen -L -f id_aurora-cert.pub
id_aurora-cert.pub:
Type: ecdsa-sha2-nistp256-cert-v01@openssh.com user certificate
Public key: ECDSA-CERT SHA256:(elided)
Signing CA: RSA SHA256:(elided)
Key ID: "mgarrett@aurora.tech"
Serial: 10505979558050566331
Valid: from 2022-09-13T17:23:53 to 2022-09-14T13:24:23
Principals:
mgarrett@aurora.tech
Critical Options: (none)
Extensions:
permit-agent-forwarding
permit-port-forwarding
permit-pty

Ok! Now let's sign something:

$ ssh-keygen -Y sign -f ~/.ssh/id_aurora-cert.pub -n git /tmp/testfile
Signing file /tmp/testfile
Write signature to /tmp/testfile.sig

To verify this we need an allowed signatures file, which should look something like:

*@aurora.tech cert-authority ssh-rsa AAA(elided)

Perfect. Let's verify it:

$ cat /tmp/testfile | ssh-keygen -Y verify -f /tmp/allowed_signers -I mgarrett@aurora.tech -n git -s /tmp/testfile.sig
Good "git" signature for mgarrett@aurora.tech with ECDSA-CERT key SHA256:(elided)


Woo! So, how do we make use of this in git? Generating the signatures is as simple as

$ git config --global commit.gpgsign true
$ git config --global gpg.format ssh
$ git config --global user.signingkey /home/mjg59/.ssh/id_aurora-cert.pub


and then getting on with life. Any commits will now be signed with the provided certificate. Unfortunately, git itself won't handle verification of these - it calls ssh-keygen -Y find-principals which doesn't deal with wildcards in the allowed signers file correctly, and then falls back to verifying the signature without making any assertions about identity. Which means you're going to have to implement this in your own CI by extracting the commit and the signature, extracting the identity from the commit metadata and calling ssh-keygen on your own. But it can be made to work!

But why would you want to? The current approach of managing keys for git isn't ideal - you can kind of piggy-back off github/gitlab SSH key infrastructure, but if you're an enterprise using SSH certificates for access then your users don't necessarily have enrolled keys to start with. And using certificates gives you extra benefits, such as having your CA verify that keys are hardware-backed before issuing a cert. Want to ensure that whoever made a commit was actually on an authorised laptop? Now you can!

I'll probably spend a little while looking into whether it's plausible to make the git verification code work with certificates or whether the right thing is to fix up ssh-keygen -Y find-principals to work with wildcard identities, but either way it's probably not much effort to get this working out of the box.

Edit to add: thanks to this commenter for pointing out that current OpenSSH git actually makes this work already!

comment count unavailable comments

September 15, 2022 09:02 PM

September 13, 2022

Paul E. Mc Kenney: Kangrejos 2022: The Rust for Linux Workshop

I had the honor of attending this workshop, however, this is not a full report. Instead, I am reporting on what I learned and how it relates to my So You Want to Rust the Linux Kernel? blog series that I started in 2021, and of which this post is a member. I will leave more complete coverage of a great many interesting sessions to others (for example, here, here, and here), who are probably better versed in Rust than am I in any case.

Device Communication and Memory Ordering

Andreas Hindborg talked about his work on a Rust-language PCI NVMe driver for the Linux kernel x86 architecture. I focus on the driver's interaction with device firmware in main memory. As is often the case, there is a shared structure in normal memory in which the device firmware reports the status of an I/O request. This structure has a special 16-bit field that is used to report that all of the other fields are now filled in, so that the Linux device driver can now safely access them.

This of course requires some sort of memory ordering, both on the part of the device firmware and on the part of the device driver. One straightforward approach is for the driver to use smp_load_acquire() to access that 16-bit field, and only if the returned value indicates that the other fields have been filled in, access those other fields.

To the surprise of several Linux-kernel developers in attendance, Andreas's driver instead does a volatile load of the entire structure, 16-bit field and all. If the 16-bit field indicates that all the fields have been updated by the firmware, it then does a second volatile load of the entire structure and then proceeds to use the values from this second load. Apparently, the performance degradation from these duplicate accesses, though measurable, is quite small. However, given the possibility of larger structures for which the performance degradation might be more profound, Andreas was urged to come up with a more elaborate Rust construction that would permit separate access to the 16-bit field, thus avoiding the redundant memory traffic.

In a subsequent conversation, Andreas noted that the overall performance degradation of this Rust-language prototype compared to the C-language version is at most 2.61%, and that in one case there is a yet-as-unexplained performance increase of 0.67%. Of course, given that both C and Rust are compiled, statically typed, non-garbage-collected, and handled by the same compiler back-end, it should not be a surprise that they achieve similar performance. An interesting question is whether the larger amount of information available to the Rust compiler will ever allow it to consistently provide better performance than does C.

Rust and Linux-Kernel Filesystems

Wedson Almeida Filho described his work on a Rust implementations of portions of the Linux kernel's 9p filesystem. The part of this effort that caught my attention was use of the Rust language's asynchronous facilities to implement some of the required state machines and use of a Rust-language type-aware packet-decoding facility.

So what do these two features have to do with concurrency in general, let alone memory-ordering in particular?

Not much. Sure, call_rcu() is (most of) the async equivalent of synchronize_rcu(), but that should be well-known and thus not particularly newsworthy.

But these two features might have considerable influence over the success or lack thereof for the Rust-for-Linux effort.

In the past, much of the justification for use of Rust in Linux has been memory safety, based on the fact that 70% of the Linux exploits have involved memory unsafety. Suppose that Rust manages to address the entire 70%, as opposed to relegating (say) half of that to Rust-language unsafe code. Then use of Rust might at best provide a factor-of-three improvement.

Of course, a factor of three is quite good in absolute terms. However, in my experience, at least an order of magnitude is usually required to with high probability motivate the kind of large change represented by the C-to-Rust transition. Therefore, in order for me to rate the Rust-for-Linux projects success as highly probable, another factor of three is required.

One possible source of the additional factor of three might come from the Rust-for-Linux community giving neglected drivers some much-needed attention. Perhaps the 9p work would qualify, but it seems unlikely that the NVMe drivers are lacking attention.

Perhaps Rust async-based state machines and type-aware packet decoding will go some ways towards providing more of the additional factor of three. Or perhaps a single factor of three will suffice in this particular case. Who knows?

“Pinning” for the Linux Kernel in Rust

Benno Lossin and Gary Guo reported on their respective efforts to implement pinning in Rust for the Linux kernel (a more detailed report may be found here).

But perhaps you, like me, are wondering just what pinning is and why the Linux kernel needs it.

It turns out that the Rust language allows the compiler great freedom to rearrange data structures by default, similar to what would happen in C-language code if each and every structure was marked with the Linux kernel's __randomize_layout directive. In addition, by default, Rust-language data can be moved at runtime.

This apparently allows additional optimizations, but it is not particularly friendly to Linux-kernel idioms that take addresses of fields in structures. Such idioms include pretty much all of Linux's linked-list code, as well as things like RCU callbacks. And who knows, perhaps this situation helps to explain recent hostility towards linked lists expressed on social media. ;-)

The Rust language accommodates these idioms by allowing data to be pinned, thus preventing the compiler from moving it around.

However, pinning data has consequences, including the fact that such pinning is visible in Rust's type system. This visibility allows Rust compilers to complain when (for example) list_for_each_entry() is passed an unpinned data structure. Such complaints are important, given that passing unpinned data to primitives expecting pinned data would result in memory unsafety. The code managing the pinning is expected to be optimized away, but there are concerns that it would make itself all too apparent during code review and debugging.

Benno's and Gary's approaches were compared and contrasted, but my guess is that additional work will be required to completely solve this problem. Some attendees would like to see pinning addressed at the Rust-language level rather than at the library level, but time will tell what approach is most appropriate.

RCU Use Cases

Although RCU was not an official topic, for some reason it came up frequently during the hallway tracks that I participated in. Which is probably a good thing. ;-)

One question is exactly how the various RCU use cases should be represented in Rust. Rust's atomic-reference-count facility has been put forward, and it is quite possible that, in combination with liberal use of atomics, it will serve for some of the more popular use cases. Wedson suggested that Revocable covers another group of use cases, and at first glance, it appears that Wedson might be quite right. Still others might be handled by wait-wakeup mechanisms. There are still some use cases to be addressed, but some progress is being made.

Rust and Python

Some people suggested that Rust might take over from Python, adding that Rust solves many problems that Python is said to encounter in large-scale programs, and that Rust should prove more ergonomic than Python. The first is hard to argue with, given that Python seems to be most successful in smaller-scale projects that make good use of Python's extensive libraries. The second seems quite unlikely to me, in fact, it is hard for me to imagine that anything about Rust would seem in any way ergonomic to a dyed-in-the-wool Python developer.

One particularly non-ergonomic attribute of current Rust is likely to be its libraries, which last I checked were considerably less extensive than those of Python. Although the software-engineering shortcomings of large-scale Python projects can and have motivated conversion to Rust, it appears to me that smaller Python projects making heavier use of a wider variety of libraries might need more motivation.

Automatic conversion of Python libraries, anyone? ;-)

Conclusions

I found Kangrejos 2022 to be extremely educational and thought-provoking, and am very glad that I was able to attend. I am still uncertain about Rust's prospects in the Linux kernel, but the session provided some good examples of actual work done and problems solved. Perhaps Rust's async and type-sensitive packet-decoding features will help increase its probability of success, and perhaps there are additional Rust features waiting to be applied, for example, use of Rust iterators to simplify lockdep cycle detection. And who knows? Perhaps future versions of Rust will be able to offer consistent performance advantages. Stranger things have happened!

History

September 19, 2022: Changes based on LPC and LWN reports.

September 13, 2022 07:59 AM

September 12, 2022

Linux Plumbers Conference: Linux Plumbers Conference Free Live Streams Available

As with previous years, we have a free live stream available.  You can find the details here:

https://lpc.events/event/16/page/173-watch-live-free

September 12, 2022 08:26 AM

September 02, 2022

Linux Plumbers Conference: LPC 2022 Open for Last-Minute On-Site Registrations

Against all expectations, we have now worked through the entire waitlist, courtesy of some last-minute cancellations due to late-breaking corporate travel restrictions. We are therefore re-opening general registration. So if you can somehow arrange to be in Dublin on September 12-14 at this late date, please register.

Virtual registration never has closed, and is still open.

Either way, we look forward to seeing you there!

September 02, 2022 03:07 PM

September 01, 2022

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XIX: Concurrent Computational Models

The September 2022 issue of the Communications of the ACM features an entertaining pair of point/counterpoint articles, with William Dally advocating for concurrent computational models that more accurately model both energy costs and latency here and Uzi Vishkin advocating for incremental modifications to the existing Parallel Random Access Machine (PRAM) model here. Some cynics might summarize Dally's article as “please develop software that makes better use of our hardware so that we can sell you more hardware” while other cynics might summarize Vishkin's article as “please keep funding our research project“. In contrast, I claim that both articles are worth a deeper look. The next section looks at Dally's article, the section after that at Vishkin's article, followed by discussion.

TL;DR: In complete consonance with a depressingly large fraction of 21st-century discourse, they are talking right past each other.

On the Model of Computation: Point: We Must Extend Our Model of Computation to Account for Cost and Location

Dally makes a number of good points. First, he notes that past decades have seen a shift from computation being more expensive than memory references to memory references being more expensive than computation, and by a good many orders of magnitude. Second, he points out that in production, constant factors can be more important than asymptotic behavior. Third, he describes situations where hardware caches are not a silver bullet. Fourth, he calls out the necessity of appropriate computational models for performance programming, though to his credit, he does clearly state that not all programming is performance programming. Fifth, he calls out the global need for reduced energy expended on computation, motivated by projected increases in computation-driven energy demand. Sixth and finally, he advocates for a new computational model that builds on PRAM by (1) assigning different costs to computation and to memory references and (1) making memory-reference costs a function of a distance metric based on locations assigned to processing units and memories.

On the Model of Computation: Counterpoint: Parallel Programming Wall and Multicore Software Spiral: Denial Hence Crisis

Vishkin also makes some interesting points. First, he calls out the days of Moore's-Law frequency scaling, calling for the establishing of a similar “free lunch” scaling in terms of numbers of CPUs. He echoes Dally's point about not all programming being performance programming, but argues that in addition to a “memory wall” and an “energy wall”, there is also a “parallel programming wall” because programming multicore systems is in “simply too difficult”. Second, Vishkin calls on hardware vendors to take more account of ease of use when creating computer systems, including systems with GPUs. Third, Vishkin argues that compilers should take up much of the burden of squeezing performance out of hardware. Fourth, he makes several arguments that concurrent systems should adapt to PRAM rather than adapting PRAM to existing hardware. Fifth and finally, he calls for funding for a “multicore software spiral” that he hopes will bring back free-lunch performance increases driven by Moore's Law, but based on increasing numbers of cores rather than increasing clock frequency.

Discussion

Figure 2.3 of Is Parallel Programming Hard, And, If So, What Can You Do About It? (AKA “perfbook”) shows the “Iron Triangle” of concurrent programming. Vishkin focuses primarily at the upper edge of this triangle, which is primarily about productivity. Dally focuses primarily on the lower left edge of this triangle, which is primarily about performance. Except that Vishkin's points about the cost of performance programming are all too valid, which means that defraying the cost of performance programming in the work-a-day world requires very large numbers of users, which is often accomplished by making performance-oriented concurrent software be quite general, thus amortizing its cost over large numbers of use cases and users. This puts much performance-oriented concurrent software at the lower tip of the iron triangle, where performance meets generality.

One could argue that Vishkin's proposed relegation of performance-oriented code to compilers is one way to break the iron triangle of concurrent programming, but this argument fails because compilers are themselves low-level software (thus at the lower tip of the triangle). Worse yet, there have been many attempts to put the full concurrent-programming burden on the compiler (or onto transactional memory), but rarely to good effect. Perhaps SQL is the exception that proves this rule, but this is not an example that Vishkin calls out.

Both Dally and Vishkin correctly point to the ubiquity of multicore systems, so it only reasonable to look at how these systems are handled in practice. After all, it is only fair to check their apparent assumption of “badly”. And Section 2.3 of perfbook lists several straightforward approaches: (1) Using multiple instances of a sequential application, (2) Using the large quantity of existing parallel software, ranging from operating-system kernels to database servers (SQL again!) to numerical software to image-processing libraries to machine-learning libraries, to name but a few of the areas that Python developers could teach us about, and (3) Applying sequential performance optimizations such that single-threaded performance suffices. Those taking door 1 or door 2 will need only a few parallel performance programmers, and it seems to have proven eminently feasible to have those few to use a sufficiently accurate computational model. The remaining users simply invoke the software produced by these parallel performance programmers, and need not worry much about concurrency. The cases where none of these three doors apply are more challenging, but surmounting the occasional challenge is part of the programming game, not just the parallel-programming game.

One additional historical trend is the sharply decreasing cost of concurrent systems, both in terms of purchase price and energy consumption. Where an entry-level late-1980s parallel system cost millions of dollars and consumed kilowatts, an entry-level 2022 system can be had for less than $100. This means that there is no longer an overwhelming economic imperative to extract every ounce of performance from most year-2022 concurrent systems. For example, much smartphone code attains the required performance while running single-threaded, which means that additional performance from parallelism could well be a waste of energy. Instead, the smartphone automatically powers down the unused hardware, thus providing only the performance that is actually needed, and does so in an energy-efficient manner. The occasional heavy-weight application might turn on all the CPUs, but only such applications need the ministrations of parallel performance programmers and complex computational models.

Thus, Dally and Vishkin are talking past each other, describing different types of software that is produced by different types of developers, who can each use whatever computational model is appropriate to the task at hand.

Which raises the question of whether there might be a one-size-fits-all computational model. I have my doubts. In my 45 years of supporting myself and (later) my family by developing software, I have seen many models, frameworks, languages, and libraries come and go. In contrast, to the best of my knowledge, the speed of light and the sizes of the various types of atoms have remained constant. Don't get me wrong, I do hope that the trend towards vertical stacking of electronics within CPU chips will continue, and I also hope that this trend will result in smaller systems that are correspondingly less constrained by speed-of-light and size-of-atoms limitations. However, the laws of physics appear to be exceedingly stubborn things, and we would therefore be well-advised to work together. Dally's attempts to dictate current hardware conveniences to software developers is about as likely to succeed as is Vishkin's attempts to dictate current software conveniences to hardware developers. Both of these approaches are increasingly likely to fail should the trend towards special-purpose hardware continue full force. Software developers cannot always ignore the laws of physics, and by the same token, hardware developers cannot always ignore the large quantity of existing software and the large number of existing software developers. To be sure, in order to continue to break new ground, both software and hardware developers will need to continue to learn new tricks, but many of these tricks will require coordinated hardware/software effort.

Sadly, there is no hint in either Dally's or Vishkin's article of any attempt to learn what the many successful developers of concurrent software actually do. Thus, it is entirely possible that neither Dally's nor Vishkin's preferred concurrent computational model is applicable to actual parallel programming practice. In fact, in my experience, developers often use the actual hardware and software as the model, driving their development and optimization efforts from actual measurements and from intuitions built up from those measurements. Some might look down on this approach as being both non-portable and non-mathematical, but portability is often achieved simply by testing on a variety of hardware, courtesy of the relatively low prices of modern computer systems. As for mathematics, those of us who appreciate mathematics would do well to admit that it is often much more efficient to let the objective universe carry out the mathematical calculations in its normal implicit and ineffable manner, especially given the complexity of present day hardware and software.

Circling back to the cynics, there are no doubt some cynics that would summarize this blog post as “please download and read my book”. Except that in contrast to the hypothesized cynical summarization of Dally's and Vishkin's articles, you can download and read my book free of charge. Which is by design, given that the whole point is to promulgate information. Cue cynics calling out which of the three of us is the greater fool! ;-)

September 01, 2022 03:06 PM

August 31, 2022

Linux Plumbers Conference: LPC 2022 Evening Events Announcement

Linux Plumbers Conference 2022 will have evening events on Monday September 12 from 19:30 to 22:30 and on Wednesday September 14, again from 19:30 to 22:30. Tuesday is on your own, and we anticipate that a fair number of the people registered both for LPC and OSS EU might choose to attend their evening event on that day. Looking forward to seeing you all in Dublin the week after this coming one!

August 31, 2022 09:49 AM

August 12, 2022

Paul E. Mc Kenney: Stupid SMP Tricks: A Review of Locking Engineering Principles and Hierarchy

Daniel Vetter put together a pair of intriguing blog posts entitled Locking Engineering Principles and Locking Engineering Hierarchy. These appear to be an attempt to establish a set of GPU-wide or perhaps even driver-tree-wide concurrency coding conventions.

Which would normally be none of my business. After all, to establish such conventions, Daniel needs to negotiate with the driver subsystem's developers and maintainers, and I am neither. Except that he did call me out on Twitter on this topic. So here I am, as promised, offering color commentary and the occasional suggestion for improvement, both of Daniel's proposal and of the kernel itself. The following sections review his two posts, and then summarize and amplify suggestions for improvement.

Locking Engineering Principles


Make it Dumb” should be at least somewhat uncontroversial, as this is quite similar to a rather more flamboyant sound bite from my 1970s Mechanical Engineering university coursework. Kernighan's Law asserting that debugging is twice as hard as coding should also be a caution.

This leads us to “Make it Correct”, which many might argue should come first in the list. After all, if correctness is not a higher priority than dumbness (or simplicity, if you prefer), then the do-nothing null solution would always be preferred. And to his credit, Daniel does explicitly state that “simple doesn't necessarily mean correct”.

It is hard to argue with Daniel's choosing to give lockdep place of pride, especially given how many times it has saved me over the years, and not just from deadlock. I never have felt the need to teach lockdep RCU's locking rules, but then again, RCU is much more monolithic than is the GPU subsystem. Perhaps if RCU's locking becomes more ornate, I would also feel the need to acquire RCU's key locks in the intended order at boot time. Give or take the fact that much of RCU can be used very early, even before the call to rcu_init().

The validation point is also a good one, with Daniel calling out might_lock(), might_sleep(), and might_alloc(). I go further and add lockdep_assert_irqs_disabled(), lockdep_assert_irqs_enabled(), and lockdep_assert_held(), but perhaps these additional functions are needed in low-level code like RCU more than they are in GPU drivers.

Daniel's admonishments to avoid reinventing various synchronization wheels of course comprise excellent common-case advice. And if you believe that your code is the uncommon case, you are most likely mistaken. Furthermore, it is hard to argue with “pick the simplest lock that works”.

It is hard to argue with the overall message of “Make it Fast”, especially the bit about the pointlessness of crashing faster. However, a minimum level of speed is absolutely required. For a 1977 example, an old school project required keeping up with the display. If the code failed to keep up with the display, that code was useless. More recent examples include deep sub-second request-response latency requirements in many Internet datacenters. In other words, up to a point, performance is not simply a “Make it Fast” issue, but also a “Make it Correct” issue. On the other hand, once the code meets its performance requirements, any further performance improvements instead falls into the lower-priority “Make it Fast” bucket.

Except that things are not always quite so simple. Not all performance improvements can be optimized into existence late in the game. In fact, if your software's performance requirements are expected to tax your system's resources, it will be necessary to make performance a first-class consideration all the way up to and including the software-architecture level, as discussed here. And, to his credit, Daniel hints at this when he notes that “the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts”. He also helpfully calls out io_uring as one avenue for radical updates.

The “Protect Data, not Code” is good general advice and goes back to Jack Inman's 1985 USENIX paper entitled “Implementing Loosely Coupled Functions on Tightly Coupled Engines”, if not further. However, there are times where what needs to be locked is not data, but perhaps a particular set of state transitions, or maybe a rarely accessed state, which might be best represented by the code for that state. That said, there can be no denying the big-kernel-lock hazards of excessive reliance on code locking. Furthermore, I have only rarely needed abstract non-data locking.

Nevertheless, a body of code as large as the full set of Linux-kernel device drivers should be expected to have a large number of exceptions to the “Protect Data” rule of thumb, reliable though that rule has proven in my own experience. The usual way of handling this is a waiver process. After all, given that every set of safety-critical coding guidelines I have seen has a waiver process, it only seems reasonable that device-driver concurrency coding conventions should also involve a waiver process. But that decision belongs to the device-driver developers and maintainers, not with me.

Locking Engineering Hierarchy


Most of the Level 0: No Locking section should be uncontroversial: Do things the easy way, at least when the easy way works. This approach goes back decades, for but one example, single-threaded user-interface code delegating concurrency to a database management system. And it should be well-understood that complicated things can be made easy (or at least easier) through use of carefully constructed and time-tested APIs, for example, the queue_work() family of APIs called out in this section. One important question for all such APIs is this: “Can someone with access to only the kernel source tree and a browser quickly find and learn how to use the given API?” After all, people are able to properly use the APIs they see elsewhere if they can quickly and easily learn about them.

And yes, given Daniel's comments regarding struct dma_fence later in this section, the answer for SLAB_TYPESAFE_BY_RCU appears to be “no” (but see Documentation/RCU/rculist_nulls.rst). The trick with SLAB_TYPESAFE_BY_RCU is that readers must validate the allocation. A common way of doing this is to have a reference count that is set to the value one immediately after obtaining the object from kmem_struct_alloc() and set to the value zero immediately before passing the object to kmem_struct_free(). And yes, I have a patch queued to fix the misleading text in Documentation/RCU/whatisRCU.rst, and I apologize to anyone who might have attempted to use locking without a reference count. But please also let the record show that there was no bug report.

A reader attempting to obtain a new lookup-based reference to a SLAB_TYPESAFE_BY_RCU object must do something like this:

  1. atomic_inc_not_zero(), which will return false and not do the increment if initial value was zero. Presumably dma_fence_get_rcu() handles this for struct dma_fence.
  2. If the value returned above was false, pretend that the lookup failed. Otherwise, continue with the following steps.
  3. Check the identity of the object. If this turns out to not be the desired object, release the reference (cleaning up if needed) and pretend that the lookup failed. Otherwise, continue. Presumably dma_fence_get_rcu_safe() handles this for struct dma_fence, in combination with its call to dma_fence_put().
  4. Use the object.
  5. Release the reference, cleaning up if needed.
This of course underscores Daniel's point (leaving aside the snark), which is that you should not use SLAB_TYPESAFE_BY_RCU unless you really need to, and even then you should make sure that you know how to use it properly.

But just when do you need to use SLAB_TYPESAFE_BY_RCU? One example is when you need RCU protection on such a hot fastpath that you cannot tolerate RCU-freed objects becoming cold in the CPU caches due to RCU's grace-period delays. Another example is where objects are being allocated and freed at such a high rate that if each and every kmem_cache_free() was subject to grace-period delays, excessive memory would be tied up waiting for grace periods, perhaps even resulting in out-of-memory (OOM) situations.

In addition, carefully constructed semantics are required. Semantics based purely on ordering tend to result in excessive conceptual complexity and thus in confusion. More appropriate semantics tend to be conditional, for example: (1) Objects that remain in existence during the full extent of the lookup are guaranteed to be found, (2) Objects that do not exist at any time during the full extent of the lookup are guaranteed not to be found, and (3) Objects that exist for only a portion of the lookup might or might not be found.

Does struct dma_fence need SLAB_TYPESAFE_BY_RCU? Is the code handling struct dma_fence doing the right thing? For the moment, I will just say that: (1) The existence of dma_fence_get_rcu_safe() gives me at least some hope, (2) Having two different slab allocators stored into different static variables having the same name is a bit code-reader-unfriendly, and (3) The use of SLAB_TYPESAFE_BY_RCU for a structure that contains an rcu_head structure is a bit unconventional, but perhaps these structures are sometimes obtained from kmalloc() instead of from kmem_struct_alloc().

One thing this section missed (or perhaps intentionally omitted): If you do need memory barriers, you are almost always better off using smp_store_release() and smp_load_acquire() than the old-style smp_wmb() and smp_rmb().

The Level 1: Big Dumb Lock certainly summarizes the pain of finding the right scope for your locks. Despite Daniel's suggesting that lockless tricks are almost always the wrong approach, such tricks are in common use. I would certainly agree that randomly hacking lockless tricks into your code will almost always result in disaster. In fact, this process will use your own intelligence against you: The smarter you think you are, the deeper a hole you will have dug for yourself before you realize that you are in trouble.

Therefore, if you think that you need a lockless trick, first actually measure the performance. Second, if there really is a performance issue, do the work to figure out which code and data is actually responsible, because your blind guesses will often be wrong. Third, read documentation, look at code, and talk to people to learn and fully understand how this problem has been solved in the past, then carefully apply those solutions. Fourth, if there is no solution, review your overall design, because an ounce of proper partitioning is worth some tons of clever lockless code. Fifth, if you must use a new solution, verify it beyond all reason. The code in kernel/rcu/rcutorture.c will give you some idea of the level of effort required.

The Level 2: Fine-grained Locking sections provide some excellent advice, guidance, and cautionary tales. Yes, lockdep is a great thing, but there are deadlocks that it does not detect, so some caution is still required.

The yellow-highlighted Locking Antipattern: Confusing Object Lifetime and Data Consistency describes how holding locks across non-memory-style barrier functions, that is, things like flush_work() as opposed to things like smp_mb(), can cause problems, up to and including deadlock. In contrast, whatever other problems memory-barrier functions such as smp_mb() cause, it does take some creativity to add them to a lock-based critical section so as to cause them to participate in a deadlock cycle. I would not consider flush_work() to be a memory barrier in disguise, although it is quite true that a correct high-performance implementation of flush_work() will require careful memory ordering.

I would have expected a rule such as “Don't hold a lock across flush_work() that is acquired in any corresponding workqueue handler”, but perhaps Daniel is worried about future deadlocks as well as present-day deadlocks. After all, if you do hold a lock across a call to flush_work(), perhaps there will soon be some compelling reason to acquire that lock in a workqueue handler, at which point it is game over due to deadlock.

This issue is of course by no means limited to workqueues. For but one example, it is quite possible to generate similar deadlocks by holding spinlocks across calls to del_timer_sync() that are acquired within timer handlers.

And that is why both flush_work() and del_timer_sync() tell lockdep what they are up to. For example, flush_work() invokes __flush_work() which in turn invokes lock_map_acquire() and lock_map_release() on a fictitious lock, and this same fictitious lock is acquired and released by process_one_work(). Thus, if you acquire a lock in a workqueue handler that is held across a corresponding flush_work(), lockdep will complain, as shown in the 2018 commit 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") by Johannes Berg. Of course, del_timer_sync() uses this same trick, as shown in the 2009 commit 6f2b9b9a9d75 ("timer: implement lockdep deadlock detection"), also by Johannes Berg.

Nevertheless, deadlocks involving flush_work() and workqueue handlers can be subtle. It is therefore worth investing some up-front effort to avoid them.

The orange-highlighted Level 2.5: Splitting Locks for Performance Reasons discusses splitting locks. As the section says, there are complications. For one thing, lock acquisitions are anything but free, having overheads of hundreds or thousands of instructions at best, which means that adding additional levels of finer-grained locks can actually slow things down. Therefore, Daniel's point about prioritizing architectural restructuring over low-level synchronization changes is an extremely good one, and one that is all too often ignored.

As Daniel says, when moving to finer-grained locking, it is necessary to avoid increasing the common-case number of locks being acquired. As one might guess from the tone of this section, this is not necessarily easy. Daniel suggests reader-writer locking, which can work well in some cases, but suffers from performance and scalability limitations, especially in situations involving short read-side critical sections. The fact that readers and writers exclude each other can also result in latency/response-time issues. But again, reader-writer locking can be a good choice in some cases.

The last paragraph is an excellent cautionary tale. Take it from Daniel: Never let userspace dictate the order in which the kernel acquires locks. Otherwise, you, too, might find yourself using wait/wound mutexes. Worse yet, if you cannot set up an two-phase locking discipline in which all required locks are acquired before any work is done, you might find yourself writing deadlock-recovery code, or wounded-mutex recovery code, if you prefer. This recovery code can be surprisingly complex. In fact, one of the motivations for the early 1990s introduction of RCU into DYNIX/ptx was that doing so allowed deletion of many thousands of lines of such recovery code, along with all the yet-undiscovered bugs that code contained.

The red-highlighted Level 3: Lockless Tricks section begins with the ominous sentence “Do not go here wanderer!”

And I agree. After all, if you are using the facilities discussed in this section, namely RCU, atomics, preempt_disable(), local_bh_disable(), local_irq_save(), or the various memory barriers, you had jolly well better not be wandering!!!

Instead, you need to understand what you are getting into and you need to have a map in the form of a principled design and a careful well-validated implementation. And new situations may require additional maps to be made, although there are quite a few well-used maps already in Documentation/rcu and over here. In addition, as Daniel says, algorithmic and architectural fixes can often provide much better results than can lockless tricks applied at low levels of abstraction. Past experience suggests that some of those algorithmic and architectural fixes will involve lockless tricks on fastpaths, but life is like that sometimes.

It is now time to take a look at Daniel's alleged antipatterns.

The “Locking Antipattern: Using RCU” section does have some “interesting” statements:

  1. RCU is said to mix up lifetime and consistency concerns. It is not clear exactly what motivated this statement, but this view is often a symptom of a strictly temporal view of RCU. To use RCU effectively, one must instead take a combined spatio-temporal view, as described here and here.
  2. rcu_read_lock() is said to provide both a read-side critical section and to extend the lifetime of any RCU-protected object. This is true in common use cases because the whole purpose of an RCU read-side critical section is to ensure that any RCU-protected object that was in existence at any time during that critical section remains in existence up to the end of that critical section. It is not clear what distinction Daniel is attempting to draw here. Perhaps he likes the determinism provided by full mutual exclusion. If so, never forget that the laws of physics dictate that such determinism is often surprisingly expensive.
  3. The next paragraph is a surprising assertion that RCU readers' deadlock immunity is a bad thing! Never forget that although RCU can be used to replace reader-writer locking in a great many situations, RCU is not reader-writer locking. Which is a good thing from a performance and scalability viewpoint as well as from a deadlock-immunity viewpoint.
  4. In a properly designed system, locks and RCU are not “papering over” lifetime issues, but instead properly managing object lifetimes. And if your system is not properly designed, then any and all facilities, concurrent or not, are weapons-grade dangerous. Again, perhaps more maps are needed to help those who might otherwise wander into improper designs. Or perhaps existing maps need to be more consistently used.
  5. RCU is said to practically force you to deal with “zombie objects”. Twitter discussions with Daniel determined that such “zombie objects” can no longer be looked up, but are still in use. Which means that many use cases of good old reference counting also force you to deal with zombie objects: After all, removing an object from its search structure does not invalidate the references already held on that object. But it is quite possible to avoid zombie objects for both RCU and for reference counting through use of per-object locks, as is done in the Linux-kernel code that maps from a System-V semaphore ID to the corresponding in-kernel data structure, first described in Section of this paper. In short, if you don't like zombies, there are simple RCU use cases that avoid them. You get RCU's speed and deadlock immunity within the search structure, but full ordering, consistency, and zombie-freedom within the searched-for object.
  6. The last bullet in this section seems to argue that people should upgrade from RCU read-side critical sections to locking or reference counting as quickly as possible. Of course, such locking or reference counting can add problematic atomic-operation and cache-miss overhead to those critical sections, so specific examples would be helpful. One non-GPU example is the System-V semaphore ID example mentioned above, where immediate lock acquisition is necessary to provide the necessary System-V semaphore semantics.
  7. On freely using RCU, again, proper design is required, and not just with RCU. One can only sympathize with a driver dying in synchronize_rcu(). Presumably Daniel means that one of the driver's tasks hung in synchronize_rcu(), in which case there should have been an RCU CPU stall warning message which would point out what CPU or task was stuck in an RCU read-side critical section. It is quite easy to believe that diagnostics could be improved, both within RCU and elsewhere, but if this was intended to be a bug report, it is woefully insufficient.
  8. It is good to see that Daniel found at least one RCU use case that he likes (or at least doesn't hate too intensely), namely xarray lookups combined with kref_get_unless_zero() and kfree_rcu(). Perhaps this is a start, especially if the code following that kref_get_unless_zero() invocation does additional atomic operations on the xarray object, which would hide at least some of the kref_get_unless_zero() overhead.

The following table shows how intensively the RCU API is used by various v5.19 kernel subsystems:

Subsystem   Uses          LoC  Uses/KLoC
---------   ----   ----------  ---------
ipc           91        9,822       9.26
virt          68        9,013       7.54
net         7457    1,221,681       6.10
security     599      107,622       5.57
kernel      1796      423,581       4.24
mm           324      170,176       1.90
init           8        4,236       1.89
block        108       65,291       1.65
lib          319      214,291       1.49
fs          1416    1,470,567       0.96
include      836    1,167,274       0.72
drivers     5596   20,861,746       0.27
arch         546    2,189,975       0.25
crypto         6      102,307       0.06
sound         21    1,378,546       0.02
---------   ----   ----------  ---------
Total      19191   29,396,128       0.65

As you can see, the drivers subsystem has the second-highest total number of RCU API uses, but it also has by far the largest number of lines of code. As a result, the RCU usage intensity in the drivers subsystem is quite low, at about 27 RCU uses per 100,000 lines of code. But this view is skewed, as can be seen by looking more deeply within the drivers subsystem, but leaving out (aside from in the Total line) and drivers containing fewer than 100 instances of the RCU API:

Subsystem           Uses          LoC  Uses/KLoC
---------           ----   ----------  ---------
drivers/target       293       62,532       4.69
drivers/block        355       97,265       3.65
drivers/md           334      147,881       2.26
drivers/infiniband   548      434,430       1.26
drivers/net         2607    4,381,955       0.59
drivers/staging      114      618,763       0.18
drivers/scsi         150    1,011,146       0.15
drivers/gpu          399    5,753,571       0.07
---------           ----   ----------  ---------
Total               5596   20,861,746       0.27

The drivers/infiniband and drivers/net subtrees account for more than half of the RCU usage in the Linux kernel's drivers, and could be argued to be more about networking than about generic device drivers. And although drivers/gpu comes in third in terms of RCU usage, it also comes in first in terms of lines of code, making it one of the least intense users of RCU. So one could argue that Daniel already has his wish, at least within the confines of drivers/gpu.

This data suggests that Daniel might usefully consult with the networking folks in order to gain valuable guidelines on the use of RCU and perhaps atomics and memory barriers as well. On the other hand, it is quite possible that such consultations actually caused some of Daniel's frustration. You see, a system implementing networking must track the state of the external network, and it can take many seconds or even minutes for changes in that external state to propagate to that system. Therefore, expensive synchronization within that system is less useful than one might think: No matter how many locks and mutexes that system acquires, it cannot prevent external networking hardware from being reconfigured or even from failing completely.

Moving to drivers/gpu, in theory, if there are state changes initiated by the GPU hardware without full-system synchronization, networking RCU usage patterns should apply directly to GPU drivers. In contrast, there might be significant benefits from more tightly synchronizing state changes initiated by the system. Again, perhaps the aforementioned System-V semaphore ID example can help in such cases. But to be fair, given Daniel's preference for immediately acquiring a reference to RCU-protected data, perhaps drivers/gpu code is already taking this approach.

The “Locking Antipattern: Atomics” section is best taken point by point:

  1. For good or for ill, the ordering (or lack thereof) of Linux kernel atomics predates C++ atomics by about a decade. One big advantage of the Linux-kernel approach is that all accesses to atomic*_t variables are marked. This is a great improvement over C++, where a sequentially consistent load or store looks just like a normal access to a local variable, which can cause a surprising amount of confusion.
  2. Please please please do not “sprinkle” memory barriers over the code!!! Instead, actually design the required communication and ordering, and then use the best primitives for the job. For example, instead of “smp_mb__before_atomic(); atomic_inc(&myctr); smp_mb__after_atomic();”, maybe you should consider invoking atomic_inc_return(), discarding the return value if it is not needed.
  3. Indeed, some atomic functions operate on non-atomic*_t variables. But in many cases, you can use their atomic*_t counterparts. For example, instead of READ_ONCE(), atomic_read(). Instead of WRITE_ONCE(), atomic_set(). Instead of cmpxchg(), atomic_cmpxchg(). Instead of set_bit(), in many situations, atomic_or(). On the other hand, I will make no attempt to defend the naming of set_bit() and __set_bit().

To Daniel's discussion of “unnecessary trap doors”, I can only agree that reinventing read-write semaphores is a very bad thing.

I will also make no attempt to defend ill-thought-out hacks involving weak references or RCU. Sure, a quick fix to get your production system running is all well and good, but the real fix should be properly designed.

And Daniel makes an excellent argument when he says that if a counter can be protected by an already held lock, that counter should be implemented using normal C-language accesses to normal integral variables. For those situations where no such lock is at hand, there are a lot of atomic and per-CPU counting examples that can be followed, both in the Linux kernel and in Chapter 5 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?”. Again, why unnecessarily re-invent the wheel?

However, the last paragraph, stating that atomic operations should only be used for locking and synchronization primitives in the core kernel is a bridge too far. After all, a later section allows for memory barriers to be used in libraries (at least driver-hacker-proof libraries), so it seems reasonable that atomic operations can also be used in libraries.

Some help is provided by the executable Linux-kernel memory model (LKMM) in tools/memory-model along with the kernel concurrency sanitizer (KCSAN), which is documented in Documentation/dev-tools/kcsan.rst, but there is no denying that these tools currently require significant expertise. Help notwithstanding, it almost always makes a lot of sense to hide complex operations, including complex operations involving concurrency, behind well-designed APIs.

And help is definitely needed, given that there are more than 10,000 invocations of atomic operations in the drivers tree, more than a thousand of which are in drivers/gpu.

There is not much to say about the “Locking Antipattern: preempt/local_irq/bh_disable() and Friends” section. These primitives are not heavily used in the drivers tree. However, lockdep does have enough understanding of these primitives to diagnose misuse of irq-disabled and bh-disabled spinlocks. Which is a good thing, given that there are some thousands of uses of irq-disabled spinlocks in the drivers tree, along with a good thousand uses of bh-disabled spinlocks.

The “Locking Antipattern: Memory Barriers” suggests that memory barriers should be packaged in a library or core kernel service, which is in the common case excellent advice. Again, the executable LKMM and KCSAN can help, but again these tools currently require some expertise. I was amused by Daniel's “I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly”, and I am glad that my articles and talks provide at least a little entertainment value, if nothing else. ;-)

Summing up my view of these two blog posts, Daniel recommends that most driver code avoid concurrency entirely. Failing that, he recommends sticking to certain locking and reference-counting use cases, albeit including a rather complex acquire-locks-in-any-order use case. For the most part, he recommends against atomics, RCU, and memory barriers, with a very few exceptions.

For me, reading these blog posts induced great nostalgia, taking me back to my early 1990s days at Sequent, when the guidelines were quite similar, give or take a large number of non-atomically manipulated per-CPU counters. But a few short years later, many Sequent engineers were using atomic operations, and yes, a few were even using RCU. Including one RCU use case in a device driver, though that use case could instead be served by the Linux kernel's synchronize_irq() primitive.

Still, the heavy use of RCU and (even more so) of atomics within the drivers tree, combined with Daniel's distaste for these primitive, suggests that some sort of change might be in order.

Can We Fix This?

Of course we can!!!

But will a given fix actually improve the situation? That is the question.

Reading through this reminded me that I need to take another pass through the RCU documentation. I have queued a commit to fix the misleading wording for SLAB_TYPESAFE_BY_RCU on the -rcu tree: 08f8f09b2a9e ("doc: SLAB_TYPESAFE_BY_RCU uses cannot rely on spinlocks"). I also expect to improve the documentation of reference counting and its relation to SLAB_TYPESAFE_BY_RCU, and will likely find a number of other things in need of improvement.

This is also as good a time as any to announce that I will be holding an an RCU Office Hours birds-of-a-feather session at the 2022 Linux Plumbers Conference, in case that is helpful.

However, the RCU documentation must of necessity remain fairly high level. And to that end, GPU-specific advice about use of xarray, kref_get_unless_zero(), and kfree_rcu() really needs to be Documentation/gpu as opposed to Documentation/RCU. This would allow that advice to be much more specific and thus much more helpful to the GPU developers and maintainers. Alternatively, perhaps improved GPU-related APIs are required in order to confine concurrency to functions designed for that purpose. This alternative approach has the benefit of allowing GPU device drivers to focus more on GPU-specific issues and less on concurrency. On the other hand, given that the GPU drivers comprise some millions of lines of code, this might be easier said than done.

It is all too easy to believe that it is possible to improve the documentation for a number of other facilities that Daniel called on the carpet. At the same time, it is important to remember that the intent of documentation is communication, and that the optimal mode of communication depends on the target audience. At its best, documentation builds a bridge from where the target audience currently is to where they need to go. Which means the broader the target audience, the more difficult it is to construct that bridge. Which in turn means that a given subsystem likely need usage advice and coding standards specific to that subsystem. One size does not fit all.

The SLAB_TYPESAFE_BY_RCU facility was called on the carpet, and perhaps understandably so. Would it help if SLAB_TYPESAFE_BY_RCU were to be changed so as to allow locks to be acquired on objects that might at any time be passed to kmem_cache_free() and then reallocated via kmem_cache_alloc()? In theory, this is easy: Just have the kmem_cache in question zero pages allocated from the system before splitting them up into objects. Then a given object could have an “initialized” flag, and if that flag was cleared in an object just returned from kmem_struct_alloc(), then and only then would that lock be initialized. This would allow a lock to be acquired (under rcu_read_lock(), of course) on a freed object, and would allow a lock to be held on an object despite its being passed to kmem_struct_free() and returned from kmem_struct_alloc() in the meantime.

In practice, some existing SLAB_TYPESAFE_BY_RCU users might not be happy with the added overhead of page zeroing, so this might require an additional GFP_ flag to allow zeroing on a kmem_cache-by-kmem_cache basis. However, the first question is “Would this really help?”, and answering that question requires feedback developers and maintainers who are actually using SLAB_TYPESAFE_BY_RCU.

Some might argue that the device drivers should all be rewritten in Rust, and cynics might argue that Daniel wrote his pair of blog posts with exactly that thought in mind. I am happy to let those cynics make that argument, especially given that I have already held forth on Linux-kernel concurrency in Rust here. However, a possible desire to rust Linux-kernel device drivers does not explain Daniel's distaste for what he calls “zombie objects” because Rust is in fact quite capable of maintaining references to objects that have been removed from their search structure.

Summary and Conclusions

As noted earlier, reading these blog posts induced great nostalgia, taking me back to my time at Sequent in the early 1990s. A lot has happened in the ensuing three decades, including habitual use of locking in across the industry, and sometimes even correct use of locking.

But will generic developers ever be able to handle more esoteric techniques involving atomic operations and RCU?

I believe that the answer to this question is “yes”, as laid out in my 2012 paper Beyond Expert-Only Parallel Programming?. As in the past, tooling (including carefully designed APIs), economic forces (including continued ubiquitous multi-core systems), and acculturation (assisted by a vast quantity of open-source software) have done the trick, and I see no reason why these trends will not continue.

But what happens in drivers/gpu is up to the GPU developers and maintainers!

References


Atomic Operations and Memory Barriers, though more description than reference:

  1. Documentation/atomic_t.txt
  2. Documentation/atomic_bitops.txt
  3. Documentation/memory-barriers.txt
  4. Sometimes the docbook header is on the x86 arch_ function, for example, arch_atomic_inc() in arch/x86/include/asm/atomic.h rather than atomic_inc().
  5. The LKMM references below can also be helpful.

Kernel Concurrency Sanitizer (KCSAN):

  1. Documentation/dev-tools/kcsan.rst
  2. Finding race conditions with KCSAN
  3. Concurrency bugs should fear the big bad data-race detector (part 1)
  4. Concurrency bugs should fear the big bad data-race detector (part 2)
  5. Detecting missing memory barriers with KCSAN

Linux-Kernel Memory Model (LKMM):

  1. tools/memory-model, including its Documentation subdirectory.
  2. A formal kernel memory-ordering model (part 1)
  3. A formal kernel memory-ordering model (part 2)
  4. Who's afraid of a big bad optimizing compiler?
  5. Calibrating your fear of big bad optimizing compilers
  6. Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel (non-paywalled extended )edition

Read-Copy Update (RCU):

  1. Documentation/RCU
  2. The RCU API, 2019 edition
  3. Unraveling RCU-Usage Mysteries (Fundamentals)
  4. Unraveling RCU-Usage Mysteries (Additional Use Cases)
  5. Sections 9.5 and 9.6 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?
  6. Many other references, some of which are listed here.

August 12, 2022 08:06 PM

August 03, 2022

Daniel Vetter: Locking Engineering Hierarchy

The first part of this series covered principles of locking engineering. This part goes through a pile of locking patterns and designs, from most favourable and easiest to adjust and hence resulting in a long term maintainable code base, to the least favourable since hardest to ensure it works correctly and stays that way while the code evolves. For convenience even color coded, with the dangerous levels getting progressively more crispy red indicating how close to the burning fire you are! Think of it as Dante’s Inferno, but for locking.

As a reminder from the intro of the first part, with locking engineering I mean the art of ensuring that there’s sufficient consistency in reading and manipulating data structures, and not just sprinkling mutex_lock() and mutex_unlock() calls around until the result looks reasonable and lockdep has gone quiet.

Level 0: No Locking

The dumbest possible locking is no need for locking at all. Which does not mean extremely clever lockless tricks for a “look, no calls to mutex_lock()” feint, but an overall design which guarantees that any writers cannot exist concurrently with any other access at all. This removes the need for consistency guarantees while accessing an object at the architectural level.

There’s a few standard patterns to achieve locking nirvana.

Locking Pattern: Immutable State

The lesson in graphics API design over the last decade is that immutable state objects rule, because they both lead to simpler driver stacks and also better performance. Vulkan instead of the OpenGL with it’s ridiculous amount of mutable and implicit state is the big example, but atomic instead of legacy kernel mode setting or Wayland instead of the X11 are also built on the assumption that immutable state objects are a Great Thing (tm).

The usual pattern is:

  1. A single thread fully constructs an object, including any sub structures and anything else you might need. Often subsystems provide initialization helpers for objects that driver can subclass through embedding, e.g. drm_connector_init() for initializing a kernel modesetting output object. Additional functions can set up different or optional aspects of an object, e.g. drm_connector_attach_encoder() sets up the invariant links to the preceding element in a kernel modesetting display chain.

  2. The fully formed object is published to the world, in the kernel this often happens by registering it under some kind of identifier. This could be a global identifier like register_chrdev() for character devices, something attached to a device like registering a new display output on a driver with drm_connector_register() or some struct xarray in the file private structure. Note that this step here requires memory barriers of some sort. If you hand roll the data structure like a list or lookup tree with your own fancy locking scheme instead of using existing standard interfaces you are on a fast path to level 3 locking hell. Don’t do that.

  3. From this point on there are no consistency issues anymore and all threads can access the object without any locking.

Locking Pattern: Single Owner

Another way to ensure there’s no concurrent access is by only allowing one thread to own an object at a given point of time, and have well defined handover points if that is necessary.

Most often this pattern is used for asynchronously processing a userspace request:

  1. The syscall or IOCTL constructs an object with sufficient information to process the userspace’s request.

  2. That object is handed over to a worker thread with e.g. queue_work().

  3. The worker thread is now the sole owner of that piece of memory and can do whatever it feels like with it.

Again the second step requires memory barriers, which means if you hand roll your own lockless queue you’re firmly in level 3 territory and won’t get rid of the burned in red hot afterglow in your retina for quite some time. Use standard interfaces like struct completion or even better libraries like the workqueue subsystem here.

Note that the handover can also be chained or split up, e.g. for a nonblocking atomic kernel modeset requests there’s three asynchronous processing pieces involved:

Locking Pattern: Reference Counting

Users generally don’t appreciate if the kernel leaks memory too much, and cleaning up objects by freeing their memory and releasing any other resources tends to be an operation of the very much mutable kind. Reference counting to the rescue!

Note that this scheme falls apart when released objects are put into some kind of cache and can be resurrected. In that case your cleanup code needs to somehow deal with these zombies and ensure there’s no confusion, and vice versa any code that resurrects a zombie needs to deal the wooden spikes the cleanup code might throw at an inopportune time. The worst example of this kind is SLAB_TYPESAFE_BY_RCU, where readers that are only protected with rcu_read_lock() may need to deal with objects potentially going through simultaneous zombie resurrections, potentially multiple times, while the readers are trying to figure out what is going on. This generally leads to lots of sorrow, wailing and ill-tempered maintainers, as the GPU subsystem has and continues to experience with struct dma_fence.

Hence use standard reference counting, and don’t be tempted by the siren of trying to implement clever caching of any kind.

Level 1: Big Dumb Lock

It would be great if nothing ever changes, but sometimes that cannot be avoided. At that point you add a single lock for each logical object. An object could be just a single structure, but it could also be multiple structures that are dynamically allocated and freed under the protection of that single big dumb lock, e.g. when managing GPU virtual address space with different mappings.

The tricky part is figuring out what is an object to ensure that your lock is neither too big nor too small:

Ideally, your big dumb lock would always be right-sized everytime the requirements on the datastructures changes. But working magic 8 balls tend to be on short supply, and you tend to only find out that your guess was wrong when the pain of the lock being too big or too small is already substantial. The inherent struggles of resizing a lock as the code evolves then keeps pushing you further away from the optimum instead of closer. Good luck!

Level 2: Fine-grained Locking

It would be great if this is all the locking we ever need, but sometimes there’s functional reasons that force us to go beyond the single lock for each logical object approach. This section will go through a few of the common examples, and the usual pitfalls to avoid.

But before we delve into details remember to document in kerneldoc with the inline per-member kerneldoc comment style once you go beyond a simple single lock per object approach. It’s the best place for future bug fixers and reviewers - meaning you - to find the rules for how at least things were meant to work.

Locking Pattern: Object Tracking Lists

One of the main duties of the kernel is to track everything, least to make sure there’s no leaks and everything gets cleaned up again. But there’s other reasons to maintain lists (or other container structures) of objects.

Now sometimes there’s a clear parent object, with its own lock, which could also protect the list with all the objects, but this does not always work:

Simplicity should still win, therefore only add a (nested) lock for lists or other container objects if there’s really no suitable object lock that could do the job instead.

Locking Pattern: Interrupt Handler State

Another example that requires nested locking is when part of the object is manipulated from a different execution context. The prime example here are interrupt handlers. Interrupt handlers can only use interrupt safe spinlocks, but often the main object lock must be a mutex to allow sleeping or allocating memory or nesting with other mutexes.

Hence the need for a nested spinlock to just protect the object state shared between the interrupt handler and code running from process context. Process context should generally only acquire the spinlock nested with the main object lock, to avoid surprises and limit any concurrency issues to just the singleton interrupt handler.

Locking Pattern: Async Processing

Very similar to the interrupt handler problems is coordination with async workers. The best approach is the single owner pattern, but often state needs to be shared between the worker and other threads operating on the same object.

The naive approach of just using a single object lock tends to deadlock:

start_processing(obj)
{
	mutex_lock(&obj->lock);
	/* set up the data for the async work */;
	schedule_work(&obj->work);
	mutex_unlock(&obj->lock);
}

stop_processing(obj)
{
	mutex_lock(&obj->lock);
	/* clear the data for the async work */;
	cancel_work_sync(&obj->work);
	mutex_unlock(&obj->lock);
}

work_fn(work)
{
	obj = container_of(work, work);

	mutex_lock(&obj->lock);
	/* do some processing */
	mutex_unlock(&obj->lock);
}

Do not worry if you don’t spot the deadlock, because it is a cross-release dependency between the entire work_fn() and cancel_work_sync() and these are a lot trickier to spot. Since cross-release dependencies are a entire huge topic on themselves I won’t go into more details, a good starting point is this LWN article.

There’s a bunch of variations of this theme, with problems in different scenarios:

Like with interrupt handlers the clean solution tends to be an additional nested lock which protects just the mutable state shared with the work function and nests within the main object lock. That way work can be cancelled while the main object lock is held, which avoids a ton of races. But without holding the sublock that work_fn() needs, which avoids the deadlock.

Note that in some cases the superior lock doesn’t need to exist, e.g. struct drm_connector_state is protected by the single owner pattern, but drivers might have some need for some further decoupled asynchronous processing, e.g. for handling the content protect or link training machinery. In that case only the sublock for the mutable driver private state shared with the worker exists.

Locking Pattern: Weak References

Reference counting is a great pattern, but sometimes you need be able to store pointers without them holding a full reference. This could be for lookup caches, or because your userspace API mandates that some references do not keep the object alive - we’ve unfortunately committed that mistake in the GPU world. Or because holding full references everywhere would lead to unreclaimable references loops and there’s no better way to break them than to make some of the references weak. In languages with a garbage collector weak references are implemented by the runtime, and so no real worry. But in the kernel the concept has to be implemented by hand.

Since weak references are such a standard pattern struct kref has ready-made support for them. The simple approach is using kref_put_mutex() with the same lock that also protects the structure containing the weak reference. This guarantees that either the weak reference pointer is gone too, or there is at least somewhere still a strong reference around and it is therefore safe to call kref_get(). But there are some issues with this approach:

The much better approach is using kref_get_unless_zero(), together with a spinlock for your data structure containing the weak reference. This looks especially nifty in combination with struct xarray.

obj_find_in_cache(id)
{
	xa_lock();
	obj = xa_find(id);
	if (!kref_get_unless_zero(&obj->kref))
		obj = NULL;
	xa_unlock();

	return obj;
}

With this all the issues are resolved:

With both together the locking does no longer leak beyond the lookup structure and it’s associated code any more, unlike with kref_put_mutex() and similar approaches. Thankfully kref_get_unless_zero() has become the much more popular approach since it was added 10 years ago!

Locking Antipattern: Confusing Object Lifetime and Data Consistency

We’ve now seen a few examples where the “no locking” patterns from level 0 collide in annoying ways when more locking is added to the point where we seem to violate the principle to protect data, not code. It’s worth to look at this a bit closer, since we can generalize what’s going on here to a fairly high-level antipattern.

The key insight is that the “no locking” patterns all rely on memory barrier primitives in disguise, not classic locks, to synchronize access between multiple threads. In the case of the single owner pattern there might also be blocking semantics involved, when the next owner needs to wait for the previous owner to finish processing first. These are functions like flush_work() or the various wait functions like wait_event() or wait_completion().

Calling these barrier functions while holding locks commonly leads to issues:

For these reasons try as hard as possible to not hold any locks, or as few as feasible, when calling any of these memory barriers in disguise functions used to manage object lifetime or ownership in general. The antipattern here is abusing locks to fix lifetime issues. We have seen two specific instances thus far:

We will see some more, but the antipattern holds in general as a source of troubles.

Level 2.5: Splitting Locks for Performance Reasons

We’ve looked at a pile of functional reasons for complicating the locking design, but sometimes you need to add more fine-grained locking for performance reasons. This is already getting dangerous, because it’s very tempting to tune some microbenchmark just because we can, or maybe delude ourselves that it will be needed in the future. Therefore only complicate your locking if:

Only then make your future maintenance pain guaranteed worse by applying more tricky locking than the bare minimum necessary for correctness. Still, go with the simplest approach, often converting a lock to its read-write variant is good enough.

Sometimes this isn’t enough, and you actually have to split up a lock into more fine-grained locks to achieve more parallelism and less contention among threads. Note that doing so blindly will backfire because locks are not free. When common operations still have to take most of the locks anyway, even if it’s only for short time and in strict succession, the performance hit on single threaded workloads will not justify any benefit in more threaded use-cases.

Another issue with more fine-grained locking is that often you cannot define a strict nesting hierarchy, or worse might need to take multiple locks of the same object or lock class. I’ve written previously about this specific issue, and more importantly, how to teach lockdep about lock nesting, the bad and the good ways.

One really entertaining story from the GPU subsystem, for bystanders at least, is that we really screwed this up for good by defacto allowing userspace to control the lock order of all the objects involved in an IOCTL. Furthermore disjoint operations should actually proceed without contention. If you ever manage to repeat this feat you can take a look at the wait-wound mutexes. Or if you just want some pretty graphs, LWN has an old article about wait-wound mutexes too.

Level 3: Lockless Tricks

Do not go here wanderer!

Seriously, I have seen a lot of very fancy driver subsystem locking designs, I have not yet found a lot that were actually justified. Because only real world, non-contrived performance issues can ever justify reaching for this level, and in almost all cases algorithmic or architectural fixes yield much better improvements than any kind of (locking) micro-optimization could ever hope for.

Hence this is just a long list of antipatterns, so that people who have not yet a grumpy expression permanently chiseled into their facial structure know when they’re in trouble.

Note that this section isn’t limited to lockless tricks in the academic sense of guaranteed constant overhead forward progress, meaning no spinning or retrying anywhere at all. It’s for everything which doesn’t use standard locks like struct mutex, spinlock_t, struct rw_semaphore, or any of the others provided in the Linux kernel.

Locking Antipattern: Using RCU

Yeah RCU is really awesome and impressive, but it comes at serious costs:

All together all freely using RCU achieves is proving that there really is no bottom on the code maintainability scale. It is not a great day when your driver dies in synchronize_rcu() and lockdep has no idea what’s going on, and I’ve seen such days.

Personally I think in driver subsystem the most that’s still a legit and justified use of RCU is for object lookup with struct xarray and kref_get_unless_zero(), and cleanup handled entirely by kfree_rcu(). Anything more and you’re very likely chasing a rabbit down it’s hole and have not realized it yet.

Locking Antipattern: Atomics

Firstly, Linux atomics have two annoying properties just to start:

Those are a lot of unnecessary trap doors, but the real bad part is what people tend to build with atomic instructions:

In short, unless you’re actually building a new locking or synchronization primitive in the core kernel, you most likely do not want to get seen even looking at atomic operations as an option.

Locking Antipattern: preempt/local_irq/bh_disable() and Friends …

This one is simple: Lockdep doesn’t understand them. The real-time folks hate them. Whatever it is you’re doing, use proper primitives instead, and at least read up on the LWN coverage on why these are problematic what to do instead. If you need some kind of synchronization primitive - maybe to avoid the lifetime vs. consistency antipattern pitfalls - then use the proper functions for that like synchronize_irq().

Locking Antipattern: Memory Barriers

Or more often, lack of them, incorrect or imbalanced use of barriers, badly or wrongly or just not at all documented memory barriers, or …

Fact is that exceedingly most kernel hackers, and more so driver people, have no useful understanding of the Linux kernel’s memory model, and should never be caught entertaining use of explicit memory barriers in production code. Personally I’m pretty good at spotting holes, but I’ve had to learn the hard way that I’m not even close to being able to positively prove correctness. And for better or worse, nothing short of that tends to cut it.

For a still fairly cursory discussion read the LWN series on lockless algorithms. If the code comments and commit message are anything less rigorous than that it’s fairly safe to assume there’s an issue.

Now don’t get me wrong, I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly. But aside from extreme exceptions this kind of maintenance cost has simply no justification in a driver subsystem. At least unless it’s packaged in a driver hacker proof library or core kernel service of some sorts with all the memory barriers well hidden away where ordinary fools like me can’t touch them.

Closing Thoughts

I hope you enjoyed this little tour of progressively more worrying levels of locking engineering, with really just one key take away:

Simple, dumb locking is good locking, since with that you have a fighting chance to make it correct locking.

Thanks to Daniel Stone and Jason Ekstrand for reading and commenting on drafts of this text.

August 03, 2022 12:00 AM

July 29, 2022

Linux Plumbers Conference: LPC 2022 Schedule is posted!

 

The schedule for when the miniconferences and tracks are going to occur is now posted at: https://lpc.events/event/16/timetable/#all

The runners for the miniconferences will be adding more details to each of their schedules over the coming weeks.

The Linux Plumbers Refereed track schedule and Kernel Summit schedule is now available at: https://lpc.events/event/16/timetable/#all.detailed

The leads for the networking and toolchain tracks will be adding more details to each of their schedules over the coming weeks, as well.

For those that are registered as in person, you are free to continue to submit Birds of a Feather(BOF) sessions. They will be allocated space in the BOF rooms on a first come, first serve basis. Please note that the BOFs will not be recorded.

We’re looking forward to a great 3 days of presentations and discussions. We hope you can join us either in-person or virtually!

July 29, 2022 03:50 PM

July 28, 2022

Matthew Garrett: UEFI rootkits and UEFI secure boot

Kaspersky describes a UEFI-implant used to attack Windows systems. Based on it appearing to require patching of the system firmware image, they hypothesise that it's propagated by manually dumping the contents of the system flash, modifying it, and then reflashing it back to the board. This probably requires physical access to the board, so it's not especially terrifying - if you're in a situation where someone's sufficiently enthusiastic about targeting you that they're reflashing your computer by hand, it's likely that you're going to have a bad time regardless.

But let's think about why this is in the firmware at all. Sophos previously discussed an implant that's sufficiently similar in some technical details that Kaspersky suggest they may be related to some degree. One notable difference is that the MyKings implant described by Sophos installs itself into the boot block of legacy MBR partitioned disks. This code will only be executed on old-style BIOS systems (or UEFI systems booting in BIOS compatibility mode), and they have no support for code signatures, so there's no need to be especially clever. Run malicious code in the boot block, patch the next stage loader, follow that chain all the way up to the kernel. Simple.

One notable distinction here is that the MBR boot block approach won't be persistent - if you reinstall the OS, the MBR will be rewritten[1] and the infection is gone. UEFI doesn't really change much here - if you reinstall Windows a new copy of the bootloader will be written out and the UEFI boot variables (that tell the firmware which bootloader to execute) will be updated to point at that. The implant may still be on disk somewhere, but it won't be run.

But there's a way to avoid this. UEFI supports loading firmware-level drivers from disk. If, rather than providing a backdoored bootloader, the implant takes the form of a UEFI driver, the attacker can set a different set of variables that tell the firmware to load that driver at boot time, before running the bootloader. OS reinstalls won't modify these variables, which means the implant will survive and can reinfect the new OS install. The only way to get rid of the implant is to either reformat the drive entirely (which most OS installers won't do by default) or replace the drive before installation.

This is much easier than patching the system firmware, and achieves similar outcomes - the number of infected users who are going to wipe their drives to reinstall is fairly low, and the kernel could be patched to hide the presence of the implant on the filesystem[2]. It's possible that the goal was to make identification as hard as possible, but there's a simpler argument here - if the firmware has UEFI Secure Boot enabled, the firmware will refuse to load such a driver, and the implant won't work. You could certainly just patch the firmware to disable secure boot and lie about it, but if you're at the point of patching the firmware anyway you may as well just do the extra work of installing your implant there.

I think there's a reasonable argument that the existence of firmware-level rootkits suggests that UEFI Secure Boot is doing its job and is pushing attackers into lower levels of the stack in order to obtain the same outcomes. Technologies like Intel's Boot Guard may (in their current form) tend to block user choice, but in theory should be effective in blocking attacks of this form and making things even harder for attackers. It should already be impossible to perform attacks like the one Kaspersky describes on more modern hardware (the system should identify that the firmware has been tampered with and fail to boot), which pushes things even further - attackers will have to take advantage of vulnerabilities in the specific firmware they're targeting. This obviously means there's an incentive to find more firmware vulnerabilities, which means the ability to apply security updates for system firmware as easily as security updates for OS components is vital (hint hint if your system firmware updates aren't available via LVFS you're probably doing it wrong).

We've known that UEFI rootkits have existed for a while (Hacking Team had one in 2015), but it's interesting to see a fairly widespread one out in the wild. Protecting against this kind of attack involves securing the entire boot chain, including the firmware itself. The industry has clearly been making progress in this respect, and it'll be interesting to see whether such attacks become more common (because Secure Boot works but firmware security is bad) or not.

[1] As we all remember from Windows installs overwriting Linux bootloaders
[2] Although this does run the risk of an infected user booting another OS instead, and being able to see the implant

comment count unavailable comments

July 28, 2022 10:19 PM

July 27, 2022

Daniel Vetter: Locking Engineering Principles

For various reasons I spent the last two years way too much looking at code with terrible locking design and trying to rectify it, instead of a lot more actual building cool things. Symptomatic that the last post here on my neglected blog is also a rant on lockdep abuse.

I tried to distill all the lessons learned into some training slides, and this two part is the writeup of the same. There are some GPU specific rules, but I think the key points should apply to at least apply to kernel drivers in general.

The first part here lays out some principles, the second part builds a locking engineering design pattern hierarchy from the most easiest to understand and maintain to the most nightmare inducing approaches.

Also with locking engineering I mean the general problem of protecting data structures against concurrent access by multiple threads and trying to ensure that each sufficiently consistent view of the data it reads and that the updates it commits won’t result in confusion. Of course it highly depends upon the precise requirements what exactly sufficiently consistent means, but figuring out these kind of questions is out of scope for this little series here.

Priorities in Locking Engineering

Designing a correct locking scheme is hard, validating that your code actually implements your design is harder, and then debugging when - not if! - you screwed up is even worse. Therefore the absolute most important rule in locking engineering, at least if you want to have any chance at winning this game, is to make the design as simple and dumb as possible.

1. Make it Dumb

Since this is the key principle the entire second part of this series will go through a lot of different locking design patterns, from the simplest and dumbest and easiest to understand, to the most hair-raising horrors of complexity and trickiness.

Meanwhile let’s continue to look at everything else that matters.

2. Make it Correct

Since simple doesn’t necessarily mean correct, especially when transferring a concept from design to code, we need guidelines. On the design front the most important one is to design for lockdep, and not fight it, for which I already wrote a full length rant. Here I will only go through the main lessons: Validating locking by hand against all the other locking designs and nesting rules the kernel has overall is nigh impossible, extremely slow, something only few people can do with any chance of success and hence in almost all cases a complete waste of time. We need tools to automate this, and in the Linux kernel this is lockdep.

Therefore if lockdep doesn’t understand your locking design your design is at fault, not lockdep. Adjust accordingly.

A corollary is that you actually need to teach lockdep your locking rules, because otherwise different drivers or subsystems will end up with defacto incompatible nesting and dependencies. Which, as long as you never exercise them on the same kernel boot-up, much less same machine, wont make lockdep grumpy. But it will make maintainers very much question why they are doing what they’re doing.

Hence at driver/subsystem/whatever load time, when CONFIG_LOCKDEP is enabled, take all key locks in the correct order. One example for this relevant to GPU drivers is in the dma-buf subsystem.

In the same spirit, at every entry point to your library or subsytem, or anything else big, validate that the callers hold up the locking contract with might_lock(), might_sleep(), might_alloc() and all the variants and more specific implementations of this. Note that there’s a huge overlap between locking contracts and calling context in general (like interrupt safety, or whether memory allocation is allowed to call into direct reclaim), and since all these functions compile away to nothing when debugging is disabled there’s really no cost in sprinkling them around very liberally.

On the implementation and coding side there’s a few rules of thumb to follow:

3. Make it Fast

Speed doesn’t matter if you don’t understand the design anymore in the future, you need simplicity first.

Speed doesn’t matter if all you’re doing is crashing faster. You need correctness before speed.

Finally speed doesn’t matter where users don’t notice it. If you micro-optimize a path that doesn’t even show up in real world workloads users care about, all you’ve done is wasted time and committed to future maintenance pain for no gain at all.

Similarly optimizing code paths which should never be run when you instead improve your design are not worth it. This holds especially for GPU drivers, where the real application interfaces are OpenGL, Vulkan or similar, and there’s an entire driver in the userspace side - the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts.

The big example here is GPU address patch list processing at command submission time, which was necessary for old hardware that completely lacked any useful concept of a per process virtual address space. But that has changed, which means virtual addresses can stay constant, while the kernel can still freely manage the physical memory by manipulating pagetables, like on the CPU. Unfortunately one driver in the DRM subsystem instead spent an easy engineer decade of effort to tune relocations, write lots of testcases for the resulting corner cases in the multi-level fastpath fallbacks, and even more time handling the impressive amounts of fallout in the form of bugs and future headaches due to the resulting unmaintainable code complexity …

In other subsystems where the kernel ABI is the actual application contract these kind of design simplifications might instead need to be handled between the subsystem’s code and driver implementations. This is what we’ve done when moving from the old kernel modesetting infrastructure to atomic modesetting. But sometimes no clever tricks at all help and you only get true speed with a radically revamped uAPI - io_uring is a great example here.

Protect Data, not Code

A common pitfall is to design locking by looking at the code, perhaps just sprinkling locking calls over it until it feels like it’s good enough. The right approach is to design locking for the data structures, which means specifying for each structure or member field how it is protected against concurrent changes, and how the necessary amount of consistency is maintained across the entire data structure with rules that stay invariant, irrespective of how code operates on the data. Then roll it out consistently to all the functions, because the code-first approach tends to have a lot of issues:

The big antipattern of how you end up with code centric locking is to protect an entire subsystem (or worse, a group of related subsystems) with a single huge lock. The canonical example was the big kernel lock BKL, that’s gone, but in many cases it’s just replaced by smaller, but still huge locks like console_lock().

This results in a lot of long term problems when trying to adjust the locking design later on:

For these reasons big subsystem locks tend to live way past their justified usefulness until code maintenance becomes nigh impossible: Because no individual bugfix is worth the task to really rectify the design, but each bugfix tends to make the situation worse.

From Principles to Practice

Stay tuned for next week’s installment, which will cover what these principles mean when applying to practice: Going through a large pile of locking design patterns from the most desirable to the most hair raising complex.

July 27, 2022 12:00 AM

July 20, 2022

Dave Airlie (blogspot): lavapipe Vulkan 1.3 conformant

The software Vulkan renderer in Mesa, lavapipe, achieved official Vulkan 1.3 conformance. The official entry in the table is  here . We can now remove the nonconformant warning from the driver. Thanks to everyone involved!

July 20, 2022 12:42 AM

July 12, 2022

Matthew Garrett: Responsible stewardship of the UEFI secure boot ecosystem

After I mentioned that Lenovo are now shipping laptops that only boot Windows by default, a few people pointed to a Lenovo document that says:

Starting in 2022 for Secured-core PCs it is a Microsoft requirement for the 3rd Party Certificate to be disabled by default.

"Secured-core" is a term used to describe machines that meet a certain set of Microsoft requirements around firmware security, and by and large it's a good thing - devices that meet these requirements are resilient against a whole bunch of potential attacks in the early boot process. But unfortunately the 2022 requirements don't seem to be publicly available, so it's difficult to know what's being asked for and why. But first, some background.

Most x86 UEFI systems that support Secure Boot trust at least two certificate authorities:

1) The Microsoft Windows Production PCA - this is used to sign the bootloader in production Windows builds. Trusting this is sufficient to boot Windows.
2) The Microsoft Corporation UEFI CA - this is used by Microsoft to sign non-Windows UEFI binaries, including built-in drivers for hardware that needs to work in the UEFI environment (such as GPUs and network cards) and bootloaders for non-Windows.

The apparent secured-core requirement for 2022 is that the second of these CAs should not be trusted by default. As a result, drivers or bootloaders signed with this certificate will not run on these systems. This means that, out of the box, these systems will not boot anything other than Windows[1].

Given the association with the secured-core requirements, this is presumably a security decision of some kind. Unfortunately, we have no real idea what this security decision is intended to protect against. The most likely scenario is concerns about the (in)security of binaries signed with the third-party signing key - there are some legitimate concerns here, but I'm going to cover why I don't think they're terribly realistic.

The first point is that, from a boot security perspective, a signed bootloader that will happily boot unsigned code kind of defeats the point. Kaspersky did it anyway. The second is that even a signed bootloader that is intended to only boot signed code may run into issues in the event of security vulnerabilities - the Boothole vulnerabilities are an example of this, covering multiple issues in GRUB that could allow for arbitrary code execution and potential loading of untrusted code.

So we know that signed bootloaders that will (either through accident or design) execute unsigned code exist. The signatures for all the known vulnerable bootloaders have been revoked, but that doesn't mean there won't be other vulnerabilities discovered in future. Configuring systems so that they don't trust the third-party CA means that those signed bootloaders won't be trusted, which means any future vulnerabilities will be irrelevant. This seems like a simple choice?

There's actually a couple of reasons why I don't think it's anywhere near that simple. The first is that whenever a signed object is booted by the firmware, the trusted certificate used to verify that object is measured into PCR 7 in the TPM. If a system previously booted with something signed with the Windows Production CA, and is now suddenly booting with something signed with the third-party UEFI CA, the values in PCR 7 will be different. TPMs support "sealing" a secret - encrypting it with a policy that the TPM will only decrypt it if certain conditions are met. Microsoft make use of this for their default Bitlocker disk encryption mechanism. The disk encryption key is encrypted by the TPM, and associated with a specific PCR 7 value. If the value of PCR 7 doesn't match, the TPM will refuse to decrypt the key, and the machine won't boot. This means that attempting to attack a Windows system that has Bitlocker enabled using a non-Windows bootloader will fail - the system will be unable to obtain the disk unlock key, which is a strong indication to the owner that they're being attacked.

The second is that this is predicated on the idea that removing the third-party bootloaders and drivers removes all the vulnerabilities. In fact, there's been rather a lot of vulnerabilities in the Windows bootloader. A broad enough vulnerability in the Windows bootloader is arguably a lot worse than a vulnerability in a third-party loader, since it won't change the PCR 7 measurements and the system will boot happily. Removing trust in the third-party CA does nothing to protect against this.

The third reason doesn't apply to all systems, but it does to many. System vendors frequently want to ship diagnostic or management utilities that run in the boot environment, but would prefer not to have to go to the trouble of getting them all signed by Microsoft. The simple solution to this is to ship their own certificate and sign all their tooling directly - the secured-core Lenovo I'm looking at currently is an example of this, with a Lenovo signing certificate. While everything signed with the third-party signing certificate goes through some degree of security review, there's no requirement for any vendor tooling to be reviewed at all. Removing the third-party CA does nothing to protect the user against the code that's most likely to contain vulnerabilities.

Obviously I may be missing something here - Microsoft may well have a strong technical justification. But they haven't shared it, and so right now we're left making guesses. And right now, I just don't see a good security argument.

But let's move on from the technical side of things and discuss the broader issue. The reason UEFI Secure Boot is present on most x86 systems is that Microsoft mandated it back in 2012. Microsoft chose to be the only trusted signing authority. Microsoft made the decision to assert that third-party code could be signed and trusted.

We've certainly learned some things since then, and a bunch of things have changed. Third-party bootloaders based on the Shim infrastructure are now reviewed via a community-managed process. We've had a productive coordinated response to the Boothole incident, which also taught us that the existing revocation strategy wasn't going to scale. In response, the community worked with Microsoft to develop a specification for making it easier to handle similar events in future. And it's also worth noting that after the initial Boothole disclosure was made to the GRUB maintainers, they proactively sought out other vulnerabilities in their codebase rather than simply patching what had been reported. The free software community has gone to great lengths to ensure third-party bootloaders are compatible with the security goals of UEFI Secure Boot.

So, to have Microsoft, the self-appointed steward of the UEFI Secure Boot ecosystem, turn round and say that a bunch of binaries that have been reviewed through processes developed in negotiation with Microsoft, implementing technologies designed to make management of revocation easier for Microsoft, and incorporating fixes for vulnerabilities discovered by the developers of those binaries who notified Microsoft of these issues despite having no obligation to do so, and which have then been signed by Microsoft are now considered by Microsoft to be insecure is, uh, kind of impolite? Especially when unreviewed vendor-signed binaries are still considered trustworthy, despite no external review being carried out at all.

If Microsoft had a set of criteria used to determine whether something is considered sufficiently trustworthy, we could determine which of these we fell short on and do something about that. From a technical perspective, Microsoft could set criteria that would allow a subset of third-party binaries that met additional review be trusted without having to trust all third-party binaries[2]. But, instead, this has been a decision made by the steward of this ecosystem without consulting major stakeholders.

If there are legitimate security concerns, let's talk about them and come up with solutions that fix them without doing a significant amount of collateral damage. Don't complain about a vendor blocking your apps and then do the same thing yourself.

[Edit to add: there seems to be some misunderstanding about where this restriction is being imposed. I bought this laptop because I'm interested in investigating the Microsoft Pluton security processor, but Pluton is not involved at all here. The restriction is being imposed by the firmware running on the main CPU, not any sort of functionality implemented on Pluton]

[1] They'll also refuse to run any drivers that are stored in flash on Thunderbolt devices, which means eGPU setups may be more complicated, as will netbooting off Thunderbolt-attached NICs
[2] Use a different leaf cert to sign the new trust tier, add the old leaf cert to dbx unless a config option is set, leave the existing intermediate in db

comment count unavailable comments

July 12, 2022 08:25 AM

July 09, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Rust

Linux Plumbers Conference 2022 is pleased to host the Rust MC

Rust is a systems programming language that is making great strides in becoming the next big one in the domain.

Rust for Linux aims to bring it into the kernel since it has a key property that makes it very interesting to consider as the second language in the kernel: it guarantees no undefined behavior takes place (as long as unsafe code is sound). This includes no use-after-free mistakes, no double frees, no data races, etc.

This microconference intends to cover talks and discussions on both Rust for Linux as well as other non-kernel Rust topics.

Possible Rust for Linux topics:

Possible Rust topics:

Please come and join us in the discussion about Rust in the Linux ecosystem.

We hope to see you there!

July 09, 2022 01:50 PM

July 08, 2022

Matthew Garrett: Lenovo shipping new laptops that only boot Windows by default

I finally managed to get hold of a Thinkpad Z13 to examine a functional implementation of Microsoft's Pluton security co-processor. Trying to boot Linux from a USB stick failed out of the box for no obvious reason, but after further examination the cause became clear - the firmware defaults to not trusting bootloaders or drivers signed with the Microsoft 3rd Party UEFI CA key. This means that given the default firmware configuration, nothing other than Windows will boot. It also means that you won't be able to boot from any third-party external peripherals that are plugged in via Thunderbolt.

There's no security benefit to this. If you want security here you're paying attention to the values measured into the TPM, and thanks to Microsoft's own specification for measurements made into PCR 7, switching from booting Windows to booting something signed with the 3rd party signing key will change the measurements and invalidate any sealed secrets. It's trivial to detect this. Distrusting the 3rd party CA by default doesn't improve security, it just makes it harder for users to boot alternative operating systems.

Lenovo, this isn't OK. The entire architecture of UEFI secure boot is that it allows for security without compromising user choice of OS. Restricting boot to Windows by default provides no security benefit but makes it harder for people to run the OS they want to. Please fix it.

comment count unavailable comments

July 08, 2022 06:49 AM

July 07, 2022

Vegard Nossum: Stigmergy in programming

Ants are known to leave invisible pheromones on their paths in order to inform both themselves and their fellow ants where to go to find food or signal that a path leads to danger. In biology, this phenomenon is known as stigmergy: the act of modifying your environment to manipulate the future behaviour of yourself or others. From the Wikipedia article:

Stigmergy (/ˈstɪɡmərdʒi/ STIG-mər-jee) is a mechanism of indirect coordination, through the environment, between agents or actions. The principle is that the trace left in the environment by an individual action stimulates the performance of a succeeding action by the same or different agent. Agents that respond to traces in the environment receive positive fitness benefits, reinforcing the likelihood of these behaviors becoming fixed within a population over time.

For ants in particular, stigmergy is useful as it alleviates the need for memory and more direct communication; instead of broadcasting a signal about where a new source of food has been found, you can instead just leave a breadcrumb trail of pheromones that will naturally lead your community to the food.

We humans also use stigmergy in a lot of ways: most notably, we write things down. From post-it notes posted on the fridge to remind ourselves to buy more cheese to writing books that can potentially influence the behaviour of a whole future generation of young people.

Let's face it: We don't have infinite brains and we need to somehow alleviate the need to remember everything. If you remember the movie Memento, the protagonist Leonard has lost his ability to form new long-term memories and relies on stigmergy to inform his future actions; everything that's important he writes down in a place he's sure to come across it again when needed. His most important discoveries he turns into tattoos that he cannot lose or avoid seeing when he wakes up in the morning.

Perhaps a biologist would object and say this is stretching the definition of stigmergy, but I contest that it fits: leaving a trace in the environment in order to stimulate a future action.

For stigmergy to be effective, it must be placed in the right location so that whoever comes across it will perform the correct action at that time. If we return briefly to the shopping list example, we typically keep the list close to the fridge because that is often where we are when we need to write something down -- or when we go to check what we need to buy.

Let's take an example from computing: Have you ever seen a line at the top of a file that says "AUTOMATICALLY GENERATED; DON'T MODIFY THIS"? Well, that's stigmergy. Somebody made sure that line would be there in order to influence the behaviour of whomever came across that file. A little note from the past placed in the environment to manipulate future actions.

In programming, stigmergy mainly manifests as comments scattered throughout the code -- the most common form is perhaps leaving a comment to explain what a piece of code is there to do, where we know somebody will find it and, hopefully, be able to make use of it. Another one is leaving a "TODO" comment where something isn't quite finished -- you may not know that a piece of code isn't handling some corner case just by glancing at it, but a "TODO" comment stands out and may even contain enough information to complete the implementation. In other cases, we see the opposite: "here be dragons"-type comments instructing the reader not to change something, perhaps because the code is known to be complicated, complex, brittle, or prone to breaking.

Stigmergy is a powerful idea, and once you are aware of it you can consciously make use of it to help yourself and others down the line. We're not robotic ants and we can make deliberate choices regarding when, where, and how we modify our environment in order to most effectively influence future behaviour.

July 07, 2022 01:47 PM

July 06, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Power Management and Thermal Control

Linux Plumbers Conference 2022 is pleased to host the Power Management and Thermal Control Microconference

The Power Management and Thermal Control microconference focuses on frameworks related to power management and thermal control, CPU and device power-management mechanisms, and thermal-control methods. In particular, we are interested in extending the energy-efficient scheduling concept beyond the energy-aware scheduling (EAS), improving the thermal control framework in the kernel to cover more use cases and making system-wide suspend (and power management in general) more robust.

The goal is to facilitate cross-framework and cross-platform discussions that can help improve energy-awareness and thermal control in Linux.

Suggested topics:

Please come and join us in the discussion about keeping your systems cool.

We hope to see you there!

July 06, 2022 05:07 PM

July 03, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: System Boot and Security

Linux Plumbers Conference 2022 is pleased to host the System Boot and Security Microconference

In the fourth year in a row, System Boot and Security microconference is are going to bring together people interested in the firmware, bootloaders, system boot, security, etc., and discuss all these topics. This year we would particularly like to focus on better communication and closer cooperation between different Free Software and Open Source projects. In the past we have seen that the lack of cooperation’s between projects very often delays introduction of very interesting and important features with TrenchBoot being very prominent example.

The System Boot and Security MC is very important to improve such communication and cooperation, but it is not limited to this kind of problems. We would like to encourage all stakeholders to bring and discuss issues that they encounter in the broad sense of system boot and security.

Expected topics:

Please come and join us in the discussion about how to keep your system secure from the very boot.

We hope to see you there!

July 03, 2022 02:39 PM

June 30, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: VFIO/IOMMU/PCI

Linux Plumbers Conference 2022 is pleased to host the VFIO/IOMMU/PCI Microconference

The PCI interconnect specification, the devices that implement it, and the system IOMMUs that provide memory and access control to them are nowadays a de-facto standard for connecting high-speed components, incorporating more and more features such as:

These features are aimed at high-performance systems, server and desktop computing, embedded and SoC platforms, virtualization, and ubiquitous IoT devices.

The kernel code that enables these new system features focuses on coordination between the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems.

The VFIO/IOMMU/PCI micro-conference focuses on the kernel code that enables these new system features that often require coordination between the VFIO, IOMMU and PCI sub-systems.

Tentative topics include (but not limited to):

Come and join us in the discussion in helping Linux keep up with the new features being added to the PCI interconnect specification.

We hope to see you there !

June 30, 2022 04:56 PM

June 27, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Android

Linux Plumbers Conference 2022 is pleased to host the Android Microconference

Continuing in the same direction as last year, this year’s Android microconference will be an opportunity to foster collaboration between the Android and Linux kernel communities. Discussions will be centered on the goal of ensuring that both the Android and Linux development moves in a lockstep fashion going forward.

Projected topics:

Please come and join us in the discussion of making Android a better partner with Linux.

We hope to see you there!

June 27, 2022 01:55 PM

June 24, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Open Printing

Linux Plumbers Conference 2022 is pleased to host the Open Printing Microconference

OpenPrinting has been improving the way we print in Linux. Over the years we have changed many conventional ways of printing and scanning. Over the last few years we have been emphasizing on the fact that driverless print and scan has made life easier however this does not make us stop improving. Every day we are trying to design new ways of printing to make your printing and scanning experience better than that of today.

Proposed Topics :

Please come and join us in the discussion to bring Linux printing, scanning and fax a better experience.

We hope to see you there!

June 24, 2022 09:20 PM

Kees Cook: finding binary differences

As part of the continuing work to replace 1-element arrays in the Linux kernel, it’s very handy to show that a source change has had no executable code difference. For example, if you started with this:

struct foo {
    unsigned long flags;
    u32 length;
    u32 data[1];
};

void foo_init(int count)
{
    struct foo *instance;
    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
    ...
    instance = kmalloc(bytes, GFP_KERNEL);
    ...
};

And you changed only the struct definition:

-    u32 data[1];
+    u32 data[];

The bytes calculation is going to be incorrect, since it is still subtracting 1 element’s worth of space from the desired count. (And let’s ignore for the moment the open-coded calculation that may end up with an arithmetic over/underflow here; that can be solved separately by using the struct_size() helper or the size_mul(), size_add(), etc family of helpers.)

The missed adjustment to the size calculation is relatively easy to find in this example, but sometimes it’s much less obvious how structure sizes might be woven into the code. I’ve been checking for issues by using the fantastic diffoscope tool. It can produce a LOT of noise if you try to compare builds without keeping in mind the issues solved by reproducible builds, with some additional notes. I prepare my build with the “known to disrupt code layout” options disabled, but with debug info enabled:

$ KBF="KBUILD_BUILD_TIMESTAMP=1970-01-01 KBUILD_BUILD_USER=user KBUILD_BUILD_HOST=host KBUILD_BUILD_VERSION=1"
$ OUT=gcc
$ make $KBF O=$OUT allmodconfig
$ ./scripts/config --file $OUT/.config \
        -d GCOV_KERNEL -d KCOV -d GCC_PLUGINS -d IKHEADERS -d KASAN -d UBSAN \
        -d DEBUG_INFO_NONE -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
$ make $KBF O=$OUT olddefconfig

Then I build a stock target, saving the output in “before”. In this case, I’m examining drivers/scsi/megaraid/:

$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/before
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/before/

Then I patch and build a modified target, saving the output in “after”:

$ vi the/source/code.c
$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/after
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/after/

And then run diffoscope:

$ diffoscope $OUT/before/ $OUT/after/

If diffoscope output reports nothing, then we’re done. 🥳

Usually, though, when source lines move around other stuff will shift too (e.g. WARN macros rely on line numbers, so the bug table may change contents a bit, etc), and diffoscope output will look noisy. To examine just the executable code, the command that diffoscope used is reported in the output, and we can run it directly, but with possibly shifted line numbers not reported. i.e. running objdump without --line-numbers:

$ ARGS="--disassemble --demangle --reloc --no-show-raw-insn --section=.text"
$ for i in $(cd $OUT/before && echo *.o); do
        echo $i
        diff -u <(objdump $ARGS $OUT/before/$i | sed "0,/^Disassembly/d") \
                <(objdump $ARGS $OUT/after/$i  | sed "0,/^Disassembly/d")
done

If I see an unexpected difference, for example:

-    c120:      movq   $0x0,0x800(%rbx)
+    c120:      movq   $0x0,0x7f8(%rbx)

Then I'll search for the pattern with line numbers added to the objdump output:

$ vi <(objdump --line-numbers $ARGS $OUT/after/megaraid_sas_fp.o)

I'd search for "0x0,0x7f8", find the source file and line number above it, open that source file at that position, and look to see where something was being miscalculated:

$ vi drivers/scsi/megaraid/megaraid_sas_fp.c +329

Once tracked down, I'd start over at the "patch and build a modified target" step above, repeating until there were no differences. For example, in the starting example, I'd also need to make this change:

-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = sizeof(*instance) + sizeof(u32) * count;

Though, as hinted earlier, better yet would be:

-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = struct_size(instance, data, count);

But sometimes adding the helper usage will add binary output differences since they're performing overflow checking that might saturate at SIZE_MAX. To help with patch clarity, those changes can be done separately from fixing the array declaration.

© 2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

June 24, 2022 08:11 PM

June 22, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: CPU Isolation

Linux Plumbers Conference 2022 is pleased to host the CPU Isolation Microconference

CPU Isolation is an ability to shield workloads with extreme latency or performance requirements from interruptions (also known as Operating System noise) provided by a close combination of several kernel and userspace components. An example of such workloads are DPDK use cases in Telco/5G where even the shortest interruption can cause packet losses, eventually leading to exceeding QoS requirements.

Despite considerable improvements in the last few years towards implementing full CPU Isolation (nohz_full, rcu_nocb, isolcpus, etc.), there are issues to be addressed, as it is still relatively simple to highlight sources of OS noise just by running synthetic workloads mimicking polling (always running) type of application similar to the ones mentioned above.

There were recent improvements and discussions about CPU isolation features on LKML, and tools such as osnoise tracer and rtla osnoise improved the CPU isolation analysis. Nevertheless, this is an ongoing process, and discussions are needed to speed up solutions for existing issues and to improve the existing tools and methods.

The purpose of CPU Isolation MC is to get together to discuss open problems, most notably: how to improve the identification of OS noise sources, how to track them publicly and how to tackle the sources of noise that have already been identified.

A non exhaustive list of potential topics is:

Please come and join us in the discussion about CPU isolation.

We hope to see you there!

June 22, 2022 02:56 AM

Linux Plumbers Conference: Registration Still Sold Out, But There is Now a Waitlist

Because we ran out of places so fast, we are setting up a waitlist for in-person registration (virtual attendee places are still available). Please fill in this form and try to be clear about your reasons for wanting to attend. This year we’re giving waitlist priority to new attendees and people expected to contribute content. We expect to be able to accept our first group of attendees from the waitlist in mid July.

June 22, 2022 01:22 AM

June 18, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: IoTs a 4-Letter Word

Linux Plumbers Conference 2022 is pleased to host the IoT Microconference

The IoT microconference is back for its fourth year and our Open Source HW / SW / FW communities are productizing Linux and Zephyr in ways that we have never seen before.

A lot has happened in the last year to discuss and bring forward:

Each of the above items were large efforts made by Linux centric communities actively pushing the bounds of what is possible in IoT.

Whether you are an apprentice or master, we welcome you to bring your plungers and join us for a deep dive into the pipework of Linux IoT!

We hope to see you there!

June 18, 2022 02:37 PM

June 15, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Real-time and Scheduling

Linux Plumbers Conference 2022 is pleased to host the Real-time and Scheduling Microconference

The real-time and scheduling micro-conference joins these two intrinsically connected communities to discuss the next steps together.

Over the past decade, many parts of PREEMPT_RT have been included in the official Linux codebase. Examples include real-time mutexes, high-resolution timers, lockdep, ftrace, RCU_PREEMPT, threaded interrupt handlers and more. The number of patches that need integration has been significantly reduced, and the rest is mature enough to make their way into mainline Linux.

The scheduler is the core of Linux performance. With different topologies and workloads, it is not an easy task to give the user the best experience possible, from low latency to high throughput, and from small power-constrained devices to HPC.

This year’s topics to be discussed include:

Please come and join us in the discussion of controlling what tasks get to run on your machine and when.

We hope to see you there!

June 15, 2022 04:10 PM