Kernel Planet

Single signon is a pretty vital part of modern enterprise security. You have users who need access to a bewildering array of services, and you want to be able to avoid the fallout of one of those services being compromised and your users having to change their passwords everywhere (because they're clearly going to be using the same password everywhere), or you want to be able to enforce some reasonable MFA policy without needing to configure it in 300 different places, or you want to be able to disable all user access in one place when someone leaves the company, or, well, all of the above. There's any number of providers for this, ranging from it being integrated with a more general app service platform (eg, Microsoft or Google) or a third party vendor (Okta, Ping, any number of bizarre companies). And, in general, they'll offer a straightforward mechanism to either issue OIDC tokens or manage SAML login flows, requiring users present whatever set of authentication mechanisms you've configured.

This is largely optimised for web authentication, which doesn't seem like a huge deal - if I'm logging into Workday then being bounced to another site for auth seems entirely reasonable. The problem is when you're trying to gate access to a non-web app, at which point consistency in login flow is usually achieved by spawning a browser and somehow managing submitting the result back to the remote server. And this makes some degree of sense - browsers are where webauthn token support tends to live, and it also ensures the user always has the same experience.

But it works poorly for CLI-based setups. There's basically two options - you can use the device code authorisation flow, where you perform authentication on what is nominally a separate machine to the one requesting it (but in this case is actually the same) and as a result end up with a straightforward mechanism to have your users socially engineered into giving Johnny Badman a valid auth token despite webauthn nominally being unphisable (as described years ago), or you reduce that risk somewhat by spawning a local server and POSTing the token back to it - which works locally but doesn't work well if you're dealing with trying to auth on a remote device. The user experience for both scenarios sucks, and it reduces a bunch of the worthwhile security properties that modern MFA supposedly gives us.

There's a third approach, which is in some ways the obviously good approach and in other ways is obviously a screaming nightmare. All the browser is doing is sending a bunch of requests to a remote service and handling the response locally. Why don't we just do the same? Okta, for instance, has an API for auth. We just need to submit the username and password to that and see what answer comes back. This is great until you enable any kind of MFA, at which point the additional authz step is something that's only supported via the browser. And basically everyone else is the same.

Of course, when we say "That's only supported via the browser", the browser is still just running some code of some form and we can figure out what it's doing and do the same. Which is how you end up scraping constants out of Javascript embedded in the API response in order to submit that data back in the appropriate way. This is all possible but it's incredibly annoying and fragile - the contract with the identity provider is that a browser is pointed at a URL, not that any of the internal implementation remains consistent.

I've done this. I've implemented code to scrape an identity provider's auth responses to extract the webauthn challenges and feed those to a local security token without using a browser. I've also written support for forwarding those challenges over the SSH agent protocol to make this work with remote systems that aren't running a GUI. This week I'm working on doing the same again, because every identity provider does all of this differently.

There's no fundamental reason all of this needs to be custom. It could be a straightforward "POST username and password, receive list of UUIDs describing MFA mechanisms, define how those MFA mechanisms work". That even gives space for custom auth factors (I'm looking at you, Okta Fastpass). But instead I'm left scraping JSON blobs out of Javascript and hoping nobody renames a field, even though I only care about extremely standard MFA mechanisms that shouldn't differ across different identity providers.

Someone, please, write a spec for this. Please don't make it be me.

comments

June 24, 2025 06:03 AM

June 20, 2025

Matthew Garrett: My a11y journey

23 years ago I was in a bad place. I'd quit my first attempt at a PhD for various reasons that were, with hindsight, bad, and I was suddenly entirely aimless. I lucked into picking up a sysadmin role back at TCM where I'd spent a summer a year before, but that's not really what I wanted in my life. And then Hanna mentioned that her PhD supervisor was looking for someone familiar with Linux to work on making Dasher, one of the group's research projects, more usable on Linux. I jumped.

The timing was fortuitous. Sun were pumping money and developer effort into accessibility support, and the Inference Group had just received a grant from the Gatsy Foundation that involved working with the ACE Centre to provide additional accessibility support. And I was suddenly hacking on code that was largely ignored by most developers, supporting use cases that were irrelevant to most developers. Being in a relatively green field space sounds refreshing, until you realise that you're catering to actual humans who are potentially going to rely on your software to be able to communicate. That's somewhat focusing.

This was, uh, something of an on the job learning experience. I had to catch up with a lot of new technologies very quickly, but that wasn't the hard bit - what was difficult was realising I had to cater to people who were dealing with use cases that I had no experience of whatsoever. Dasher was extended to allow text entry into applications without needing to cut and paste. We added support for introspection of the current applications UI so menus could be exposed via the Dasher interface, allowing people to fly through menu hierarchies and pop open file dialogs. Text-to-speech was incorporated so people could rapidly enter sentences and have them spoke out loud.

But what sticks with me isn't the tech, or even the opportunities it gave me to meet other people working on the Linux desktop and forge friendships that still exist. It was the cases where I had the opportunity to work with people who could use Dasher as a tool to increase their ability to communicate with the outside world, whose lives were transformed for the better because of what we'd produced. Watching someone use your code and realising that you could write a three line patch that had a significant impact on the speed they could talk to other people is an incomparable experience. It's been decades and in many ways that was the most impact I've ever had as a developer.

I left after a year to work on fruitflies and get my PhD, and my career since then hasn't involved a lot of accessibility work. But it's stuck with me - every improvement in that space is something that has a direct impact on the quality of life of more people than you expect, but is also something that goes almost unrecognised. The people working on accessibility are heroes. They're making all the technology everyone else produces available to people who would otherwise be blocked from it. They deserve recognition, and they deserve a lot more support than they have.

But when we deal with technology, we deal with transitions. A lot of the Linux accessibility support depended on X11 behaviour that is now widely regarded as a set of misfeatures. It's not actually good to be able to inject arbitrary input into an arbitrary window, and it's not good to be able to arbitrarily scrape out its contents. X11 never had a model to permit this for accessibility tooling while blocking it for other code. Wayland does, but suffers from the surrounding infrastructure not being well developed yet. We're seeing that happen now, though - Gnome has been performing a great deal of work in this respect, and KDE is picking that up as well. There isn't a full correspondence between X11-based Linux accessibility support and Wayland, but for many users the Wayland accessibility infrastructure is already better than with X11.

That's going to continue improving, and it'll improve faster with broader support. We've somehow ended up with the bizarre politicisation of Wayland as being some sort of woke thing while X11 represents the Roman Empire or some such bullshit, but the reality is that there is no story for improving accessibility support under X11 and sticking to X11 is going to end up reducing the accessibility of a platform.

When you read anything about Linux accessibility, ask yourself whether you're reading something written by either a user of the accessibility features, or a developer of them. If they're neither, ask yourself why they actually care and what they're doing to make the future better.

comments

June 20, 2025 08:48 AM

June 17, 2025

Matthew Garrett: Locally hosting an internet-connected server

I'm lucky enough to have a weird niche ISP available to me, so I'm paying $35 a month for around 600MBit symmetric data. Unfortunately they don't offer static IP addresses to residential customers, and nor do they allow multiple IP addresses per connection, and I'm the sort of person who'd like to run a bunch of stuff myself, so I've been looking for ways to manage this.

What I've ended up doing is renting a cheap VPS from a vendor that lets me add multiple IP addresses for minimal extra cost. The precise nature of the VPS isn't relevant - you just want a machine (it doesn't need much CPU, RAM, or storage) that has multiple world routeable IPv4 addresses associated with it and has no port blocks on incoming traffic. Ideally it's geographically local and peers with your ISP in order to reduce additional latency, but that's a nice to have rather than a requirement.

By setting that up you now have multiple real-world IP addresses that people can get to. How do we get them to the machine in your house you want to be accessible? First we need a connection between that machine and your VPS, and the easiest approach here is Wireguard. We only need a point-to-point link, nothing routable, and none of the IP addresses involved need to have anything to do with any of the rest of your network. So, on your local machine you want something like:[Interface] PrivateKey = privkeyhere ListenPort = 51820 Address = localaddr/32 [Peer] Endpoint = VPS:51820 PublicKey = pubkeyhere AllowedIPs = VPS/0

And on your VPS, something like:[Interface] Address = vpswgaddr/32 SaveConfig = true ListenPort = 51820 PrivateKey = privkeyhere [Peer] PublicKey = pubkeyhere AllowedIPs = localaddr/32

The addresses here are (other than the VPS address) arbitrary - but they do need to be consistent, otherwise Wireguard is going to be unhappy and your packets will not have a fun time. Bring that interface up with wg-quick and make sure the devices can ping each other. Hurrah! That's the easy bit.

Now you want packets from the outside world to get to your internal machine. Let's say the external IP address you're going to use for that machine is 321.985.520.309 and the wireguard address of your local system is 867.420.696.005. On the VPS, you're going to want to do:

iptables -t nat -A PREROUTING -p tcp -d 321.985.520.309 -j DNAT --to-destination 867.420.696.005

Now, all incoming packets for 321.985.520.309 will be rewritten to head towards 867.420.696.005 instead (make sure you've set net.ipv4.ip_forward to 1 via sysctl!). Victory! Or is it? Well, no.

What we're doing here is rewriting the destination address of the packets so instead of heading to an address associated with the VPS, they're now going to head to your internal system over the Wireguard link. Which is then going to ignore them, because the AllowedIPs statement in the config only allows packets coming from your VPS, and these packets still have their original source IP. We could rewrite the source IP to match the VPS IP, but then you'd have no idea where any of these packets were coming from, and that sucks. Let's do something better. On the local machine, in the peer, let's update AllowedIps to 0.0.0.0/0 to permit packets form any source to appear over our Wireguard link. But if we bring the interface up now, it'll try to route all traffic over the Wireguard link, which isn't what we want. So we'll add table = off to the interface stanza of the config to disable that, and now we can bring the interface up without breaking everything but still allowing packets to reach us. However, we do still need to tell the kernel how to reach the remote VPN endpoint, which we can do with ip route add vpswgaddr dev wg0. Add this to the interface stanza as:PostUp = ip route add vpswgaddr dev wg0 PreDown = ip route del vpswgaddr dev wg0

That's half the battle. The problem is that they're going to show up there with the source address still set to the original source IP, and your internal system is (because Linux) going to notice it has the ability to just send replies to the outside world via your ISP rather than via Wireguard and nothing is going to work. Thanks, Linux. Thinux.

But there's a way to solve this - policy routing. Linux allows you to have multiple separate routing tables, and define policy that controls which routing table will be used for a given packet. First, let's define a new table reference. On the local machine, edit /etc/iproute2/rt_tables and add a new entry that's something like:1 wireguard

where "1" is just a standin for a number not otherwise used there. Now edit your wireguard config and replace table=off with table=wireguard - Wireguard will now update the wireguard routing table rather than the global one. Now all we need to do is to tell the kernel to push packets into the appropriate routing table - we can do that with ip rule add from localaddr lookup wireguard, which tells the kernel to take any packet coming from our Wireguard address and push it via the Wireguard routing table. Add that to your Wireguard interface config as:PostUp = ip rule add from localaddr lookup wireguard PreDown = ip rule del from localaddr lookup wireguard
and now your local system is effectively on the internet.

You can do this for multiple systems - just configure additional Wireguard interfaces on the VPS and make sure they're all listening on different ports. If your local IP changes then your local machines will end up reconnecting to the VPS, but to the outside world their accessible IP address will remain the same. It's like having a real IP without the pain of convincing your ISP to give it to you.

comments

June 17, 2025 05:17 AM

June 09, 2025

Dave Airlie (blogspot): radv: vulkan VP9 video decode

The Vulkan WG has released VK_KHR_video_decode_vp9. I did initial work on a Mesa extensions for this a good while back, and I've updated the radv code with help from AMD and Igalia to the final specification.

There is an open MR[1] for radv to add support for vp9 decoding on navi10+ with the latest firmware images in linux-firmware. It is currently passing all VK-GL-CTS tests for VP9 decode.

Adding this decode extension is a big milestone for me as I think it now covers all the reasons I originally got involved in Vulkan Video as signed off, there is still lots to do and I'll stay involved, but it's been great to see the contributions from others and how there is a bit of Vulkan Video community upstream in Mesa.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35398

June 09, 2025 07:42 PM

June 05, 2025

Matthew Garrett: Twitter's new encrypted DMs aren't better than the old ones

(Edit: Twitter could improve this significantly with very few changes - I wrote about that here. It's unclear why they'd launch without doing that, since it entirely defeats the point of using HSMs)

When Twitter[1] launched encrypted DMs a couple
of years ago, it was the worst kind of end-to-end
encrypted - technically e2ee, but in a way that made it relatively easy for Twitter to inject new encryption keys and get everyone's messages anyway. It was also lacking a whole bunch of features such as "sending pictures", so the entire thing was largely a waste of time. But a couple of days ago, Elon announced the arrival of "XChat", a new encrypted message platform built on Rust with (Bitcoin style) encryption, whole new architecture. Maybe this time they've got it right?

tl;dr - no. Use Signal. Twitter can probably obtain your private keys, and admit that they can MITM you and have full access to your metadata.

The new approach is pretty similar to the old one in that it's based on pretty straightforward and well tested cryptographic primitives, but merely using good cryptography doesn't mean you end up with a good solution. This time they've pivoted away from using the underlying cryptographic primitives directly and into higher level abstractions, which is probably a good thing. They're using Libsodium's boxes for message encryption, which is, well, fine? It doesn't offer forward secrecy (if someone's private key is leaked then all existing messages can be decrypted) so it's a long way from the state of the art for a messaging client (Signal's had forward secrecy for over a decade!), but it's not inherently broken or anything. It is, however, written in C, not Rust[2].

That's about the extent of the good news. Twitter's old implementation involved clients generating keypairs and pushing the public key to Twitter. Each client (a physical device or a browser instance) had its own private key, and messages were simply encrypted to every public key associated with an account. This meant that new devices couldn't decrypt old messages, and also meant there was a maximum number of supported devices and terrible scaling issues and it was pretty bad. The new approach generates a keypair and then stores the private key using the Juicebox protocol. Other devices can then retrieve the private key.

Doesn't this mean Twitter has the private key? Well, no. There's a PIN involved, and the PIN is used to generate an encryption key. The stored copy of the private key is encrypted with that key, so if you don't know the PIN you can't decrypt the key. So we brute force the PIN, right? Juicebox actually protects against that - before the backend will hand over the encrypted key, you have to prove knowledge of the PIN to it (this is done in a clever way that doesn't directly reveal the PIN to the backend). If you ask for the key too many times while providing the wrong PIN, access is locked down.

But this is true only if the Juicebox backend is trustworthy. If the backend is controlled by someone untrustworthy[3] then they're going to be able to obtain the encrypted key material (even if it's in an HSM, they can simply watch what comes out of the HSM when the user authenticates if there's no validation of the HSM's keys). And now all they need is the PIN. Turning the PIN into an encryption key is done using the Argon2id key derivation function, using 32 iterations and a memory cost of 16MB (the Juicebox white paper says 16KB, but (a) that's laughably small and (b) the code says 16 * 1024 in an argument that takes kilobytes), which makes it computationally and moderately memory expensive to generate the encryption key used to decrypt the private key. How expensive? Well, on my (not very fast) laptop, that takes less than 0.2 seconds. How many attempts to I need to crack the PIN? Twitter's chosen to fix that to 4 digits, so a maximum of 10,000. You aren't going to need many machines running in parallel to bring this down to a very small amount of time, at which point private keys can, to a first approximation, be extracted at will.

Juicebox attempts to defend against this by supporting sharding your key over multiple backends, and only requiring a subset of those to recover the original. ~~I can't find any evidence that Twitter's does seem to be making use of this,~~Twitter uses three backends and requires data from at least two, but all the backends used are under x.com so are presumably under Twitter's direct control. Trusting the keystore without needing to trust whoever's hosting it requires a trustworthy communications mechanism between the client and the keystore. If the device you're talking to can prove that it's an HSM that implements the attempt limiting protocol and has no other mechanism to export the data, this can be made to work. Signal makes use of something along these lines using Intel SGX for contact list and settings storage and recovery, and Google and Apple also have documentation about how they handle this in ways that make it difficult for them to obtain backed up key material. Twitter has no documentation of this, and as far as I can tell does nothing to prove that the backend is in any way trustworthy. (Edit to add: The Juicebox API does support authenticated communication between the client and the HSM, but that relies on you having some way to prove that the public key you're presented with corresponds to a private key that only exists in the HSM. Twitter gives you the public key whenever you communicate with them, so even if they've implemented this properly you can't prove they haven't made up a new key and MITMed you the next time you retrieve your key)

On the plus side, Juicebox is written in Rust, so Elon's not 100% wrong. Just mostly wrong.

But ok, at least you've got viable end-to-end encryption even if someone can put in some (not all that much, really) effort to obtain your private key and render it all pointless? Actually no, since you're still relying on the Twitter server to give you the public key of the other party and there's no out of band mechanism to do that or verify the authenticity of that public key at present. Twitter can simply give you a public key where they control the private key, decrypt the message, and then reencrypt it with the intended recipient's key and pass it on. The support page makes it clear that this is a known shortcoming and that it'll be fixed at some point, but they said that about the original encrypted DM support and it never was, so that's probably dependent on whether Elon gets distracted by something else again. And the server knows who and when you're messaging even if they haven't bothered to break your private key, so there's a lot of metadata leakage.

Signal doesn't have these shortcomings. Use Signal.

[1] I'll respect their name change once Elon respects his daughter

[2] There are implementations written in Rust, but Twitter's using the C one with these JNI bindings

[3] Or someone nominally trustworthy but who's been compelled to act against your interests - even if Elon were absolutely committed to protecting all his users, his overarching goals for Twitter require him to have legal presence in multiple jurisdictions that are not necessarily above placing employees in physical danger if there's a perception that they could obtain someone's encryption keys

comments

June 05, 2025 01:44 PM

Matthew Garrett: How Twitter could (somewhat) fix their encrypted DMs

As I wrote in my last post, Twitter's new encrypted DM infrastructure is pretty awful. But the amount of work required to make it somewhat better isn't large.

When Juicebox is used with HSMs, it supports encrypting the communication between the client and the backend. This is handled by generating a unique keypair for each HSM. The public key is provided to the client, while the private key remains within the HSM. Even if you can see the traffic sent to the HSM, it's encrypted using the Noise protocol and so the user's encrypted secret data can't be retrieved.

But this is only useful if you know that the public key corresponds to a private key in the HSM! Right now there's no way to know this, but there's worse - the client doesn't have the public key built into it, it's supplied as a response to an API request made to Twitter's servers. Even if the current keys are associated with the HSMs, Twitter could swap them out with ones that aren't, terminate the encrypted connection at their endpoint, and then fake your query to the HSM and get the encrypted data that way. Worse, this could be done for specific targeted users, without any indication to the user that this has happened, making it almost impossible to detect in general.

This is at least partially fixable. Twitter could prove to a third party that their Juicebox keys were generated in an HSM, and the key material could be moved into clients. This makes attacking individual users more difficult (the backdoor code would need to be shipped in the public client), but can't easily help with the website version[1] even if a framework exists to analyse the clients and verify that the correct public keys are in use.

It's still worse than Signal. Use Signal.

[1] Since they could still just serve backdoored Javascript to specific users. This is, unfortunately, kind of an inherent problem when it comes to web-based clients - we don't have good frameworks to detect whether the site itself is malicious.

comments

June 05, 2025 01:18 PM

May 21, 2025

Brendan Gregg: 3 Years of Extremely Remote Work

In the last 3 years I've attended 77 meetings that began between 1am and 6am, roughly once every two weeks, followed by my usual 7am start, Monday to Saturday. I'm working remotely from Australia for a US firm (Intel) who does not have a local office here. I'm not complaining. I work weird hours, but I don't think I work too many. I'm writing this post because there are some misconceptions and assumptions about the lives of remote workers, and I thought I'd share my own details as an anecdote, along with some tips for others in a similar situation (US job from Asia).

Most early meetings were 1 hour, but some were longer, for a total of 102 hours awake. As a histogram:

I've never been a morning person, but I can manage a 7am start. What about a 2am meeting? Also doable. They sound ok in isolation, but consider that every 2-3am is followed by a 7am start, which means a 6:30am alarm after going back to sleep at 3:30am, if you're lucky. It's not easy working this timezone difference, and I'd guess others have it even worse than I do (more meetings).

Like other remote staff I work with, I have a dedicated office at home where I can close the door and work uninterrupted for hours (it's changed quite a bit since it was filmed in the eBPF documentary):

That's a Samson Meteor mic on a desktop suspension boom, and I have another suspension boom clamped onto a bookcase holding a Logitch BRIO camera (when I had it sitting on my monitor it'd shake as I typed). It's important to have the best sound and video when that's how you're interacting with people all day.

Miscellaneous notes and tips about remote work:

Count out-of-hours meetings. Sometimes people hesitate to book me at 2am, when there's no other time that would work, and it takes a few emails to convince them that it's ok. I've found it saves time to log these meetings, count them, and share stats: "I've had 76 meetings between 1am and 6am, if you need to add another, that's ok, it'll be my 77th."

Never complain about the hours. There may be someone in management who doesn't support remote work, who could use such complains to argue against it. Sometimes, when I'm searching to find the right words to say at 4-something-am, I've confessed that I'm a bit tired, but I really try to never mention it.

Staying motivated. I found keeping a daily log of what I accomplished works best (I've done this for over a decade). If one day my entry looks soft, I try to do more on the next. I summarize each week for my manager.

People assume it's a problem when it isn't. Brendan didn't accept the meeting, oh, it's because it's 2am for him and he'll be asleep. Wrong! It's because I have a clash with another 2am meeting. I try to communicate this with the meeting host, but there can be others in the meeting that don't get the message, and they never consider this possibility. If I were back in California I'd still miss the meeting, but people would assume it's due to a clash. Here, they assume I'm asleep and not making the effort, when in fact I am. It means people are assuming that remote work is a problem when it isn't.

Some meetings get cancelled or silently recorded. The 77 count I mentioned earlier doesn't include the several meetings that were cancelled at the last minute, but I woke up for anyway, or those that weren't supposed to be recorded but I woke up to find they were. If you have remote staff, please try to cancel meetings before they go to bed, and be clear on which meetings are recorded for later viewing.

Upset stomaches. One early meeting every two weeks doesn't sound too bad. The worst problem is that it can leave me with an upset stomach that can last for days. I'm still working fine, it's just uncomfortable. I don't know if other people suffer this or why it happens. Maybe it's just the extra coffee.

Fewer sick days. I don't normally take many sick days, so I can't say my data is very significant. FWIW: In my last in-person job I averaged 1.5 sick days/year (9 in 6 years), and now it's 0.33 (1 in 3 years).

Northen/Southern hemisphere times get weird. Daylight savings moves in different directions and on different dates, so from Sydney depending on the time of year I have 3, 4, or 5 hours of overlap with 9-5pm on the US west coast. To make it easy for people I setup my calendar with my work hours highlighted.

Saturday? Saturday morning is Friday afternoon in the US. For a while I tried working Tuesday to Saturday, but it's better for family time if I finish early on Saturday (at US 5pm) and work Monday to make up the difference.

Phone alarms

Career limiting I've experienced it to be career limiting, and I've heard the saying "out of sight, out of mind." Opportunities can be given to local workers despite the remote worker being more qualified, or even number one in the world in that area. The remote worker is then expected to teach the local worker in weekly or monthly meetings. It's not enough time and there's a power imbalance: the local worker is in charge and doesn't have to listen. This reduces company competiveness. It's also fixable: try giving remote workers the chance they are best qualified to do.

Compared to office work. It depends on the job and meeting load. In my last years as an in-person office worker I was on a team where we typically worked solo on different projects, with headphones on, and primarily communicated over chatrooms with few in-person meetings. The only regular time we had human interaction was lunch. Remote work has not been a big difference to that kind of job: similar work, similar chatrooms. (The thing I miss the most about the office is playing cricket with other staff after work.) It wouldn't make a big difference to my current job either, as the people I work with are scattered. The one thing it would solve is "out of sight" problem, but that's not the only way to solve it.

Remote work success stories. Linux development is a great example of engineers around the world working together. Note that Linux engineers do try to meet at least once a year at one of the Linux conferences, something remote office workers can do as well at company get-togethers. My books are another example: they are out-of-hours projects where I worked with the reviewers online, some of whom I still haven't met. And the world's first AI Flame Graphs is the work of a fully-remote team.

I was in a video meeting recently with US staff where I mentioned the "77 meetings between 1 and 6am" statistic, and I could see the look of shock in their faces. Did they assume I was just working 9-5 and not making an effort to accomodate other timezones? I know some companies are discussing ending remote-work: are the deciders also making such assumptions about the lives of remote workers? It may not do much, but I can at least share my own details here as an example anecdote.

When some people hear I'm working from Australia in particular, they may think of beaches and surfboards and sunny vacations, but my reality is a full time job across weird hours, lots of coffee, and being overlooked for opportunities. That said, every job has its pros and cons, and I'm still grateful that some companies make fully remote work an option.

May 21, 2025 02:00 PM

May 14, 2025

Linux Plumbers Conference: Submission time for Linux Plumbers 2025

Submissions for the Refereed Track and Microconferences are now open. Linux Plumbers will be held this year in Tokyo from December 11th – 13th (Note, the 13th is on a Saturday).

The Refereed presentations are 45 minutes in length and should focus on a specific aspect of the “plumbing” in a Linux ecosystem. Examples of Linux plumbing include core kernel subsystems, init systems, core libraries, toolchains, windowing systems, management tools, device support, media creation/playback, testing, and so on. The best presentations are not about finished work, but rather problem statements, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

The Microconferences are 3 and a half hours of technical discussion, broken up into 15 to 30 minute subtopics. The only presentations allowed are those that are needed to bring the audience up to speed and should not last more than half the allotted time for the subtopic. To submit a Microconference, provide a topic, some examples of subtopics to be discussed and a list of key people that should be present to have meaningful discussions. For Microconferences that have been to Linux Plumbers in the past, they should provide a list of accomplishments that were a direct result of the discussions from their previous sessions (with links to patches and such).

Presentations and Microconference subtopic leads should ideally be physically present at the conference. Remote presentations may be available but are strongly discouraged.

The Refereed submissions end at 11:59PM UTC on Wednesday, September 10, 2025.
The Microconference submissions end at 11:59PM UTC on Sunday, June 29, 2025.

Go ahead and submit your Refereed track presentation or Microconference topic. We are looking forward to seeing the great content that is submitted that makes Linux Plumbers the best technical conference there is.

May 14, 2025 08:44 PM

April 30, 2025

Brendan Gregg: Doom GPU Flame Graphs

AI Flame Graphs are now open source and include Intel Battlemage GPU support, which means it can also generate full-stack GPU flame graphs for providing new insights into gaming performance, especially when coupled with FlameScope (an older open source project of mine). Here's an example of GZDoom, and I'll start with flame scopes for both CPU and GPU utilization, with details annotated:

(Here are the raw CPU and GPU versions.) FlameScope shows a subsecond-offset heatmap of profile samples, where each column is one second (in this example, made up of 50 x 20ms blocks) and the color depth represents the number of samples, revealing variance and perturbation that you can select to generate a flame graph just for that time range. Update: the row size can be ajusted (it is limited by the sample rate captured in the profile), e.g., you could generate 60 rows to match 60fps games.

Putting these CPU and GPU flame scopes side by side has enabled your eyes to do pattern matching to solve what would otherwise be a time-consuming task of performance correlation. The gaps in the GPU flame scope on the right – where the GPU was not doing much work – match the heavier periods of CPU work on the left.

CPU Analysis

FlameScope lets us click on the interesting periods. By selecting one of the CPU shader compilation stripes we get the flame graph just for that range:

This is brilliant, and we can see exactly why the CPUs were busy for about 180 ms (the vertical length of the red stripe): it's doing compilation of GPU shaders and some NIR preprocessing (optimizations to the NIR intermediate representation that Mesa uses internally). If you are new to flame graphs, you look for the widest towers and optimize them first. Here is the interactive SVG.

CPU flame graphs and CPU flame scope aren't new (from 2011 and 2018, both open source). What is new is full-stack GPU flame graphs and GPU flame scope.

GPU Analysis

Interesting details can also be selected in the GPU FlameScope for generating GPU flame graphs. This example selects the "room 3" range, which is a room in the Doom map that contains hundreds of enemies. The green frames are the actual instructions running on the GPU, aqua shows the source for these functions, and red (C) and yellow (C++) show the CPU code paths that initiated the GPU programs. The gray "-" frames just help highlight the boundary between CPU and GPU code. (This is similar to what I described in the AI flame graphs post, which included extra frames for kernel code.) The x-axis is proportional to cost, so you look for the widest things and find ways to reduce them.

I've included the interactive SVG version of this flame graph so you can mouse-over elements and click to zoom. (PNG version.)

The GPU flame graph is split between stalls coming from rendering walls (41.4%), postprocessing effects (35.7%), stenciling (17.2%), and sprites (4.95%). The CPU stacks are further differentiated by the individual shaders that are causing stalls, along with the reasons for those stalls.

GZDoom

We picked GZDoom to try since it's an open source version of a well known game that runs on Linux (our profiler does not support Windows yet). Intel Battlemage makes light work of GZDoom, however, and since the GPU profile is stall-based we weren't getting many samples. We could have switched to a more modern and GPU-demanding game, but didn't have any great open source ideas, so I figured we'd just make GZDoom more demanding. We built GPU demanding maps for GZDoom (I can't believe I have found a work-related reason to be using Slade), and also set some Battlemage tunables to limit resources, magnifying the utilization of remaining resources.

Our GZDoom test map has three rooms: room 1 is empty, room 2 is filled with torches, and room 3 is open with a large skybox and filled with enemies, including spawnpoints for Sergeants. This gave us a few different workloads to examine by walking between the rooms.

Using iaprof: Intel's open source accelerator profiler

The AI Flame Graph project is pioneering work, and has needed various changes to graphics compilers, libraries, and kernel drivers, not just the code but also how they are built. Since Intel has its own public cloud (the Intel® Tiber™ AI Cloud) we can fix the software stack in advance so that for customers it "just works." Check the available releases. It currently supports the Intel Max Series GPU.

If you aren't on the Intel cloud, or you wish to try this with Intel Battlemage, then it can require a lot of work to get the system ready to be profiled. Requirements include:

A Linux system with superuser (root) access, so that eBPF and Intel eustalls can be used.
A newer Linux kernel with the latest Intel GPU drivers. For Intel Battlemage this means Linux 6.15+ with the Xe driver; For the Intel Max Series GPU it's Linux 5.15 with the i915 driver.
The Linux kernel built with Intel driver-specific eustall and eudebug interfaces (see the github docs for details). Some of these modifications are upstreamed in the latest versions of Linux and others are currently in progress. (These interfaces are made available by default on the Intel® Tiber™ AI Cloud.)
All system libraries or programs that are being profiled need to include frame pointers so that the full stacks are visible, including Intel's oneAPI and graphics libraries. For this example, GZDoom itself needed to be compiled with frame pointers and also all libraries used by GZDoom (glibc, etc.). This is getting easier in the lastest versions of Fedora and Ubuntu (e.g., Ubuntu 24.04 LTS) which are shipping system libraries with frame pointers by default. But I'd expect there will be applications and dependencies that don't have frame pointers yet, and need recompilation. If your flame graph has areas that are very short, one or two frames deep, this is why.

If you are new to custom kernel builds and library tinkering, then getting this all working may feel like Nightmare! difficulty. Over time things will improve and gradually get easier: check the github docs. Intel can also develop a much easier version of this tool as part of a broader product offering and get it working on more than just Linux and Battlemage (either watch this space or, if you have an Intel rep, ask them to make it a priority).

Once you have it all working, you can run the iaprof command to profile the GPU. E.g.:

git clone --recursive https://github.com/intel/iaprof
cd iaprof
make deps
make
sudo iaprof record > profile.txt
cat profile.txt | iaprof flame > flame.svg

iaprof is modeled on the Linux perf command. (Maybe one day it'll become included in perf directly.) Thanks to Gabriel Muñoz for getting the work done to get this open sourced.

FAQ and Future Work

From the launch of AI flame graphs last year, I can guess what FAQ #1 will be: “What about NVIDIA?”. They do have flame graphs in Nsight Graphics for GPU workloads, although their flame graphs are currently shallow as it is GPU code only, and onerous to use as I believe it requires an interposer; on the plus side they have click-to-source. The new GPU profiling method we've been developing allows for easy, everything, anytime profiling, like you expect from CPU profilers.

Future work will include github releases, more hardware support, and overhead reduction. We're the first to use eustalls in this way, and we need to add more optimization to reach our target of <5% overhead, especially with the i915 driver.

Conclusion

We've open sourced AI flame graphs and tested it on new hardware, Intel Battlemage, and a non-AI workload: GZDoom (gaming). It's great to see a view of both CPU and GPU resources down to millisecond resolution, where we can see visual patterns in the flame scope heat maps that can be selected to produce flame graphs to show the code. We applied these new tools to GZDoom and explained GPU pauses by selecting the corresponding CPU burst and reading the flame graph, as well as GPU code use for arbitrary time windows.

While we have open sourced this, getting it all running requires Intel hardware and Linux kernel and library tinkering – which can be a lot of work. (Actually playing Doom on Nightmare! difficulty may be easier.) This will get better over time. We look forward to seeing if anyone can fight their way through this work in the meantime and what new performance issues they can solve.

Authors: Brendan Gregg, Ben Olson, Brandon Kammerdiener, Gabriel Muñoz.

April 30, 2025 02:00 PM

March 18, 2025

Matthew Garrett: Failing upwards: the Twitter encrypted DM failure

Almost two years ago, Twitter launched encrypted direct messages. I wrote about their technical implementation at the time, and to the best of my knowledge nothing has changed. The short story is that the actual encryption primitives used are entirely normal and fine - messages are encrypted using AES, and the AES keys are exchanged via NIST P-256 elliptic curve asymmetric keys. The asymmetric keys are each associated with a specific device or browser owned by a user, so when you send a message to someone you encrypt the AES key with all of their asymmetric keys and then each device or browser can decrypt the message again. As long as the keys are managed appropriately, this is infeasible to break.

But how do you know what a user's keys are? I also wrote about this last year - key distribution is a hard problem. In the Twitter DM case, you ask Twitter's server, and if Twitter wants to intercept your messages they replace your key. The documentation for the feature basically admits this - if people with guns showed up there, they could very much compromise the protection in such a way that all future messages you sent were readable. It's also impossible to prove that they're not already doing this without every user verifying that the public keys Twitter hands out to other users correspond to the private keys they hold, something that Twitter provides no mechanism to do.

This isn't the only weakness in the implementation. Twitter may not be able read the messages, but every encrypted DM is sent through exactly the same infrastructure as the unencrypted ones, so Twitter can see the time a message was sent, who it was sent to, and roughly how big it was. And because pictures and other attachments in Twitter DMs aren't sent in-line but are instead replaced with links, the implementation would encrypt the links but not the attachments - this is "solved" by simply blocking attachments in encrypted DMs. There's no forward secrecy - if a key is compromised it allows access to not only all new messages created with that key, but also all previous messages. If you log out of Twitter the keys are still stored by the browser, so if you can potentially be extracted and used to decrypt your communications. And there's no group chat support at all, which is more a functional restriction than a conceptual one.

To be fair, these are hard problems to solve! Signal solves all of them, but Signal is the product of a large number of highly skilled experts in cryptography, and even so it's taken years to achieve all of this. When Elon announced the launch of encrypted DMs he indicated that new features would be developed quickly - he's since publicly mentioned the feature a grand total of once, in which he mentioned further feature development that just didn't happen. None of the limitations mentioned in the documentation have been addressed in the 22 months since the feature was launched.

Why? Well, it turns out that the feature was developed by a total of two engineers, neither of whom is still employed at Twitter. The tech lead for the feature was Christopher Stanley, who was actually a SpaceX employee at the time. Since then he's ended up at DOGE, where he apparently set off alarms when attempting to install Starlink, and who today is apparently being appointed to the board of Fannie Mae, a government-backed mortgage company.

Anyway. Use Signal.

comments

March 18, 2025 11:58 PM

Andi Kleen: Quitting an Intel x86 hypervisor

This is an esoteric topic that might be of interest to people implementing Intel hypervisors. It assumes you know the basics of the Intel virtualization architecture, see Hypervisor from scratch for a tutorial. The actual full VT architecture is described in Volume 3 of the Intel SDM

Let’s say we write an x86 hypervisor that starts in the UEFI environment and virtualizes the initialization phase of an OS. But the hypervisor wants to eventually quit itself to not cause extra overhead during OS run time.

The way the hypervisor works is that it runs in its own memory and with its own page tables which are switched atomically on every VM exit by the VT-x implementation. This way it is isolated from the main OS.

At some vm exit with the hypervisor running in its own context it decides that it is not needed anymore and wants to quit. To disable VT support the VMXOFF instruction can be used. But what we really need is an atomic VMXOFF + switch to the original OS page tables plus a jump, and all that without using any registers which need to be already restored to the original state of the OS.

One trick is to use the MOV to CR3 instruction that reloads the page table as a jump. As soon as the page table is reloaded the CPU will fetch the next instruction with the translations from the freshly loaded page table, so we can transfer execution to the guest context. However to do that the MOV CR3 needs to be located just before the page offset of the target instruction. This can be done by copying a trampoline to the right page offset (potentially overlapping into the previous page). The trampoline is located in a special transfer page table mapping that places writable code pages overlapping the target mapping.

But there are some complications. The hypervisor also needs to load the segmentation state (like GDT/LDT/IDT) of the guest. In theory they could just be loaded by mapping these guest pages into the transfer mapping and loading them before the transfer. But what happens if the GDT/LDT/IDT is on the same page as the target address? This is common in real OS’ assembler startup code which is implemented in a small assembler file without any page separation between code and data. One option would be to copy them to the transfer page too and load it there, or the hypervisor first copies them to a temporary buffer and loads it from there. In the second option the base addresses of these structures will be incorrect, but in practice you can often rely on them getting reloaded eventually anyways.

Another problem is the register state of the target. MOV to CR3 needs a register as the source of the reload, and it needs to be the last instruction of the trampoline. So it is impossible to restore the register it uses. But remember the hypervisor is doing this as the result of a VM exit. If we chose an exit for a condition that already clobbers a register we can use the same register for the reload and the next instruction executed in the original target (and which caused the exit originally) will just overwrite it again.

A very convenient instruction for this is CPUID. It is executed multiple times in OS startup and overwrites multiple registers. In fact VMX always intercepts CPUID so it has to handle these exits in any case. So the trick to quit an hypervisor is to wait for the next CPUID exit and then use one of the registers clobbered by CPUID for the final CR3 reload. This will have inconsistent register state for one instruction in the target, but unless the original OS is currently running a debugger it will never notice. In principle any exit as a result of an instruction that clobbers a register can be used for this.

There is another potential complication if the target address of the OS conflicts with where the hypervisor is running before entering the transfer mapping. The transfer mapping needs to map the original code so that it can be jumped to. This could be solved with a third auxiliary mapping that is used before jumping to the transfer trampoline. In practice it doesn’t seem to be a problem because x86 OS typically run in a 1:1 mapping for startup, and that cannot conflict with the 1:1 mapping used by UEFI programs as our hypervisor.

Happy hypervisor hacking!

March 18, 2025 09:34 PM

March 03, 2025

Lucas De Marchi: Lazy libkmod compression library loading

One new feature in kmod 34 is related to lazily loading the decompression libraries. In other words, dlopen them when/if they are needed. Although it’s desired for a tool like modinfo to be able to inspect a .ko.xz, .ko.zst or .ko.gz module, other daemons linking to libkmod would benefit from never loading them. This is the case for systemd-udevd that will load the kernel modules during boot when they are requested by the kernel.

Testing on Archlinux, it goes from this:

$ cat /proc/$(pidof systemd-udevd)/maps | grep -e libz -e liblzma
7f27a9ae8000-7f27a9aeb000 r--p 00000000 00:19 15656                      /usr/lib/liblzma.so.5.6.4
7f27a9aeb000-7f27a9b0e000 r-xp 00003000 00:19 15656                      /usr/lib/liblzma.so.5.6.4
7f27a9b0e000-7f27a9b19000 r--p 00026000 00:19 15656                      /usr/lib/liblzma.so.5.6.4
7f27a9b19000-7f27a9b1a000 r--p 00031000 00:19 15656                      /usr/lib/liblzma.so.5.6.4
7f27a9b1a000-7f27a9b1b000 rw-p 00032000 00:19 15656                      /usr/lib/liblzma.so.5.6.4
7f27a9b1b000-7f27a9b27000 r--p 00000000 00:19 15985                      /usr/lib/libzstd.so.1.5.7
7f27a9b27000-7f27a9bed000 r-xp 0000c000 00:19 15985                      /usr/lib/libzstd.so.1.5.7
7f27a9bed000-7f27a9bfe000 r--p 000d2000 00:19 15985                      /usr/lib/libzstd.so.1.5.7
7f27a9bfe000-7f27a9bff000 r--p 000e3000 00:19 15985                      /usr/lib/libzstd.so.1.5.7
7f27a9bff000-7f27a9c00000 rw-p 000e4000 00:19 15985                      /usr/lib/libzstd.so.1.5.7
7f27aa892000-7f27aa895000 r--p 00000000 00:19 7852                       /usr/lib/libz.so.1.3.1
7f27aa895000-7f27aa8a3000 r-xp 00003000 00:19 7852                       /usr/lib/libz.so.1.3.1
7f27aa8a3000-7f27aa8a9000 r--p 00011000 00:19 7852                       /usr/lib/libz.so.1.3.1
7f27aa8a9000-7f27aa8aa000 r--p 00017000 00:19 7852                       /usr/lib/libz.so.1.3.1
7f27aa8aa000-7f27aa8ab000 rw-p 00018000 00:19 7852                       /usr/lib/libz.so.1.3.1
$

to this:

$ cat /proc/$(pidof systemd-udevd)/maps | grep -e libz -e liblzma
$

… even if all modules in Archlinux are zstd-compressed.

systemd itself started doing this a few releases ago and published https://systemd.io/ELF_PACKAGE_METADATA/. That spec is also used for this new support in kmod to annotate what libraries are possibly dlopen’ed. However although it prevented libkmod from being loaded in other binaries, it didn’t prevent all these decompression libraries from being mapped in systemd-udevd since it uses libkmod.

One might wonder why not even libzstd.so is mapped. That’s because when loading the modules and the kernel supports decompressing that format, libkmod just skips any decompression: it opens the file and passes the descriptor to the Linux kernel via finit_module. /sys/module/compression shows what algorithm the Linux kernel can use for module decompression.

In kmod 34 all that is needed is to setup the build with -Ddlopen=all. More fine-grained options are also supported, allowing to specify individual libraries to dlopen. It may go away in a future release if distros just choose an all-or-nothing support.

March 03, 2025 06:00 PM

January 11, 2025

Pete Zaitcev: Looking for a BSSID

I'm looking for a name for a new WiFi area.

The current one is called "Tokyo-Jupiter". It turns out hard to top, it meets all the requirements. It's a geographic area. It's weeb, but from old enough times: not Naruto Shippuuden, Attack On Titan, or Kimetsu no Yaiba. Classy and unique enough.

"Konoha" is too new, too washed-up, and too short.

"Kodena" and "Yokosuka" add a patriotic American tint nicely, but also too short.

"Minas-Tirith" is a place and outstanding in its reference, but not weeb.

"Big-Sight" is an opposite of the above: too much. I'm a weeb, not otaku.

Any ideas are appreciated.

UPDATE 2025-01-11: The provisional candidate is "Nishi-Teppelin". Don't google it, it's not canon. I remain open to better ideas.

UPDATE 2025-02-20: Ended with "Ostrov-Krym" after all.

January 11, 2025 01:42 AM

January 02, 2025

Matthew Garrett: The GPU, not the TPM, is the root of hardware DRM

As part of their "Defective by Design" anti-DRM campaign, the FSF recently made the following claim:
Today, most of the major streaming media platforms utilize the TPM to decrypt media streams, forcefully placing the decryption out of the user's control (from here).
This is part of an overall argument that Microsoft's insistence that only hardware with a TPM can run Windows 11 is with the goal of aiding streaming companies in their attempt to ensure media can only be played in tightly constrained environments.

I'm going to be honest here and say that I don't know what Microsoft's actual motivation for requiring a TPM in Windows 11 is. I've been talking about TPM stuff for a long time. My job involves writing a lot of TPM code. I think having a TPM enables a number of worthwhile security features. Given the choice, I'd certainly pick a computer with a TPM. But in terms of whether it's of sufficient value to lock out Windows 11 on hardware with no TPM that would otherwise be able to run it? I'm not sure that's a worthwhile tradeoff.

What I can say is that the FSF's claim is just 100% wrong, and since this seems to be the sole basis of their overall claim about Microsoft's strategy here, the argument is pretty significantly undermined. I'm not aware of any streaming media platforms making use of TPMs in any way whatsoever. There is hardware DRM that the media companies use to restrict users, but it's not in the TPM - it's in the GPU.

Let's back up for a moment. There's multiple different DRM implementations, but the big three are Widevine (owned by Google, used on Android, Chromebooks, and some other embedded devices), Fairplay (Apple implementation, used for Mac and iOS), and Playready (Microsoft's implementation, used in Windows and some other hardware streaming devices and TVs). These generally implement several levels of functionality, depending on the capabilities of the device they're running on - this will range from all the DRM functionality being implemented in software up to the hardware path that will be discussed shortly. Streaming providers can choose what level of functionality and quality to provide based on the level implemented on the client device, and it's common for 4K and HDR content to be tied to hardware DRM. In any scenario, they stream encrypted content to the client and the DRM stack decrypts it before the compressed data can be decoded and played.

The "problem" with software DRM implementations is that the decrypted material is going to exist somewhere the OS can get at it at some point, making it possible for users to simply grab the decrypted stream, somewhat defeating the entire point. Vendors try to make this difficult by obfuscating their code as much as possible (and in some cases putting some of it in-kernel), but pretty much all software DRM is at least somewhat broken and copies of any new streaming media end up being available via Bittorrent pretty quickly after release. This is why higher quality media tends to be restricted to clients that implement hardware-based DRM.

The implementation of hardware-based DRM varies. On devices in the ARM world this is usually handled by performing the cryptography in a Trusted Execution Environment, or TEE. A TEE is an area where code can be executed without the OS having any insight into it at all, with ARM's TrustZone being an example of this. By putting the DRM code in TrustZone, the cryptography can be performed in RAM that the OS has no access to, making the scraping described earlier impossible. x86 has no well-specified TEE (Intel's SGX is an example, but is no longer implemented in consumer parts), so instead this tends to be handed off to the GPU. The exact details of this implementation are somewhat opaque - of the previously mentioned DRM implementations, only Playready does hardware DRM on x86, and I haven't found any public documentation of what drivers need to expose for this to work.

In any case, as part of the DRM handshake between the client and the streaming platform, encryption keys are negotiated with the key material being stored in the GPU or the TEE, inaccessible from the OS. Once decrypted, the material is decoded (again either on the GPU or in the TEE - even in implementations that use the TEE for the cryptography, the actual media decoding may happen on the GPU) and displayed. One key point is that the decoded video material is still stored in RAM that the OS has no access to, and the GPU composites it onto the outbound video stream (which is why if you take a screenshot of a browser playing a stream using hardware-based DRM you'll just see a black window - as far as the OS can see, there is only a black window there).

Now, TPMs are sometimes referred to as a TEE, and in a way they are. However, they're fixed function - you can't run arbitrary code on the TPM, you only have whatever functionality it provides. But TPMs do have the ability to decrypt data using keys that are tied to the TPM, so isn't this sufficient? Well, no. First, the TPM can't communicate with the GPU. The OS could push encrypted material to it, and it would get plaintext material back. But the entire point of this exercise was to avoid the decrypted version of the stream from ever being visible to the OS, so this would be pointless. And rather more fundamentally, TPMs are slow. I don't think there's a TPM on the market that could decrypt a 1080p stream in realtime, let alone a 4K one.

The FSF's focus on TPMs here is not only technically wrong, it's indicative of a failure to understand what's actually happening in the industry. While the FSF has been focusing on TPMs, GPU vendors have quietly deployed all of this technology without the FSF complaining at all. Microsoft has enthusiastically participated in making hardware DRM on Windows possible, and user freedoms have suffered as a result, but Playready hardware-based DRM works just fine on hardware that doesn't have a TPM and will continue to do so.

comments

January 02, 2025 01:14 AM

December 12, 2024

Matthew Garrett: When should we require that firmware be free?

The distinction between hardware and software has historically been relatively easy to understand - hardware is the physical object that software runs on. This is made more complicated by the existence of programmable logic like FPGAs, but by and large things tend to fall into fairly neat categories if we're drawing that distinction.

Conversations usually become more complicated when we introduce firmware, but should they? According to Wikipedia, Firmware is software that provides low-level control of computing device hardware, and basically anything that's generally described as firmware certainly fits into the "software" side of the above hardware/software binary. From a software freedom perspective, this seems like something where the obvious answer to "Should this be free" is "yes", but it's worth thinking about why the answer is yes - the goal of free software isn't freedom for freedom's sake, but because the freedoms embodied in the Free Software Definition (and by proxy the DFSG) are grounded in real world practicalities.

How do these line up for firmware? Firmware can fit into two main classes - it can be something that's responsible for initialisation of the hardware (such as, historically, BIOS, which is involved in initialisation and boot and then largely irrelevant for runtime[1]) or it can be something that makes the hardware work at runtime (wifi card firmware being an obvious example). The role of free software in the latter case feels fairly intuitive, since the interface and functionality the hardware offers to the operating system is frequently largely defined by the firmware running on it. Your wifi chipset is, these days, largely a software defined radio, and what you can do with it is determined by what the firmware it's running allows you to do. Sometimes those restrictions may be required by law, but other times they're simply because the people writing the firmware aren't interested in supporting a feature - they may see no reason to allow raw radio packets to be provided to the OS, for instance. We also shouldn't ignore the fact that sufficiently complicated firmware exposed to untrusted input (as is the case in most wifi scenarios) may contain exploitable vulnerabilities allowing attackers to gain arbitrary code execution on the wifi chipset - and potentially use that as a way to gain control of the host OS (see this writeup for an example). Vendors being in a unique position to update that firmware means users may never receive security updates, leaving them with a choice between discarding hardware that otherwise works perfectly or leaving themselves vulnerable to known security issues.

But even the cases where firmware does nothing other than initialise the hardware cause problems. A lot of hardware has functionality controlled by registers that can be locked during the boot process. Vendor firmware may choose to disable (or, rather, never to enable) functionality that may be beneficial to a user, and then lock out the ability to reconfigure the hardware later. Without any ability to modify that firmware, the user lacks the freedom to choose what functionality their hardware makes available to them. Again, the ability to inspect this firmware and modify it has a distinct benefit to the user.

So, from a practical perspective, I think there's a strong argument that users would benefit from most (if not all) firmware being free software, and I don't think that's an especially controversial argument. So I think this is less of a philosophical discussion, and more of a strategic one - is spending time focused on ensuring firmware is free worthwhile, and if so what's an appropriate way of achieving this?

I think there's two consistent ways to view this. One is to view free firmware as desirable but not necessary. This approach basically argues that code that's running on hardware that isn't the main CPU would benefit from being free, in the same way that code running on a remote network service would benefit from being free, but that this is much less important than ensuring that all the code running in the context of the OS on the primary CPU is free. The maximalist position is not to compromise at all - all software on a system, whether it's running at boot or during runtime, and whether it's running on the primary CPU or any other component on the board, should be free.

Personally, I lean towards the former and think there's a reasonably coherent argument here. I think users would benefit from the ability to modify the code running on hardware that their OS talks to, in the same way that I think users would benefit from the ability to modify the code running on hardware the other side of a network link that their browser talks to. I also think that there's enough that remains to be done in terms of what's running on the host CPU that it's not worth having that fight yet. But I think the latter is absolutely intellectually consistent, and while I don't agree with it from a pragmatic perspective I think things would undeniably be better if we lived in that world.

This feels like a thing you'd expect the Free Software Foundation to have opinions on, and it does! There are two primarily relevant things - the Respects your Freedoms campaign focused on ensuring that certified hardware meets certain requirements (including around firmware), and the Free System Distribution Guidelines, which define a baseline for an OS to be considered free by the FSF (including requirements around firmware).

RYF requires that all software on a piece of hardware be free other than under one specific set of circumstances. If software runs on (a) a secondary processor and (b) within which software installation is not intended after the user obtains the product, then the software does not need to be free. (b) effectively means that the firmware has to be in ROM, since any runtime interface that allows the firmware to be loaded or updated is intended to allow software installation after the user obtains the product.

The Free System Distribution Guidelines require that all non-free firmware be removed from the OS before it can be considered free. The recommended mechanism to achieve this is via linux-libre, a project that produces tooling to remove anything that looks plausibly like a non-free firmware blob from the Linux source code, along with any incitement to the user to load firmware - including even removing suggestions to update CPU microcode in order to mitigate CPU vulnerabilities.

For hardware that requires non-free firmware to be loaded at runtime in order to work, linux-libre doesn't do anything to work around this - the hardware will simply not work. In this respect, linux-libre reduces the amount of non-free firmware running on a system in the same way that removing the hardware would. This presumably encourages users to purchase RYF compliant hardware.

But does that actually improve things? RYF doesn't require that a piece of hardware have no non-free firmware, it simply requires that any non-free firmware be hidden from the user. CPU microcode is an instructive example here. At the time of writing, every laptop listed here has an Intel CPU. Every Intel CPU has microcode in ROM, typically an early revision that is known to have many bugs. The expectation is that this microcode is updated in the field by either the firmware or the OS at boot time - the updated version is loaded into RAM on the CPU, and vanishes if power is cut. The combination of RYF and linux-libre doesn't reduce the amount of non-free code running inside the CPU, it just means that the user (a) is more likely to hit since-fixed bugs (including security ones!), and (b) has less guidance on how to avoid them.

As long as RYF permits hardware that makes use of non-free firmware I think it hurts more than it helps. In many cases users aren't guided away from non-free firmware - instead it's hidden away from them, leaving them less aware that their freedom is constrained. Linux-libre goes further, refusing to even inform the user that the non-free firmware that their hardware depends on can be upgraded to improve their security.

Out of sight shouldn't mean out of mind. If non-free firmware is a threat to user freedom then allowing it to exist in ROM doesn't do anything to solve that problem. And if it isn't a threat to user freedom, then what's the point of requiring linux-libre for a Linux distribution to be considered free by the FSF? We seem to have ended up in the worst case scenario, where nothing is being done to actually replace any of the non-free firmware running on people's systems and where users may even end up with a reduced awareness that the non-free firmware even exists.

[1] Yes yes SMM

comments

December 12, 2024 03:57 PM

Matthew Garrett: Android privacy improvements break key attestation

Sometimes you want to restrict access to something to a specific set of devices - for instance, you might want your corporate VPN to only be reachable from devices owned by your company. You can't really trust a device that self attests to its identity, for instance by reporting its MAC address or serial number, for a couple of reasons:

These aren't fixed - MAC addresses are trivially reprogrammable, and serial numbers are typically stored in reprogrammable flash at their most protected
A malicious device could simply lie about them

If we want a high degree of confidence that the device we're talking to really is the device it claims to be, we need something that's much harder to spoof. For devices with a TPM this is the TPM itself. Every TPM has an Endorsement Key (EK) that's associated with a certificate that chains back to the TPM manufacturer. By verifying that certificate path and having the TPM prove that it's in posession of the private half of the EK, we know that we're communicating with a genuine TPM[1].

Android has a broadly equivalent thing called ID Attestation. Android devices can generate a signed attestation that they have certain characteristics and identifiers, and this can be chained back to the manufacturer. Obviously providing signed proof of the device identifier is kind of problematic from a privacy perspective, so the short version[2] is that only apps installed using a corporate account rather than a normal user account are able to do this.

But that's still not ideal - the device identifiers involved included the IMEI and serial number of the device, and those could potentially be used to correlate devices across privacy boundaries since they're static[3] identifiers that are the same both inside a corporate work profile and in the normal user profile, and also remains static if you move between different employers and use the same phone[4]. So, since Android 12, ID Attestation includes an "Enterprise Specific ID" or ESID. The ESID is based on a hash of device-specific data plus the enterprise that the corporate work profile is associated with. If a device is enrolled with the same enterprise then this ID will remain static, if it's enrolled with a different enterprise it'll change, and it just doesn't exist outside the work profile at all. The other device identifiers are no longer exposed.

But device ID verification isn't enough to solve the underlying problem here. When we receive a device ID attestation we know that someone at the far end has posession of a device with that ID, but we don't know that that device is where the packets are originating. If our VPN simply has an API that asks for an attestation from a trusted device before routing packets, we could pass that on to said trusted device and then simply forward the attestation to the VPN server[5]. We need some way to prove that the the device trying to authenticate is actually that device.

The answer to this is key provenance attestation. If we can prove that an encryption key was generated on a trusted device, and that the private half of that key is stored in hardware and can't be exported, then using that key to establish a connection proves that we're actually communicating with a trusted device. TPMs are able to do this using the attestation keys generated in the Credential Activation process, giving us proof that a specific keypair was generated on a TPM that we've previously established is trusted.

Android again has an equivalent called Key Attestation. This doesn't quite work the same way as the TPM process - rather than being tied back to the same unique cryptographic identity, Android key attestation chains back through a separate cryptographic certificate chain but contains a statement about the device identity - including the IMEI and serial number. By comparing those to the values in the device ID attestation we know that the key is associated with a trusted device and we can now establish trust in that key.

"But Matthew", those of you who've been paying close attention may be saying, "Didn't Android 12 remove the IMEI and serial number from the device ID attestation?" And, well, congratulations, you were apparently paying more attention than Google. The key attestation no longer contains enough information to tie back to the device ID attestation, making it impossible to prove that a hardware-backed key is associated with a specific device ID attestation and its enterprise enrollment.

I don't think this was any sort of deliberate breakage, and it's probably more an example of shipping the org chart - my understanding is that device ID attestation and key attestation are implemented by different parts of the Android organisation and the impact of the ESID change (something that appears to be a legitimate improvement in privacy!) on key attestation was probably just not realised. But it's still a pain.

[1] Those of you paying attention may realise that what we're doing here is proving the identity of the TPM, not the identity of device it's associated with. Typically the TPM identity won't vary over the lifetime of the device, so having a one-time binding of those two identities (such as when a device is initially being provisioned) is sufficient. There's actually a spec for distributing Platform Certificates that allows device manufacturers to bind these together during manufacturing, but I last worked on those a few years back and don't know what the current state of the art there is

[2] Android has a bewildering array of different profile mechanisms, some of which are apparently deprecated, and I can never remember how any of this works, so you're not getting the long version

[3] Nominally, anyway. Cough.

[4] I wholeheartedly encourage people not to put work accounts on their personal phones, but I am a filthy hypocrite here

[5] Obviously if we have the ability to ask for attestation from a trusted device, we have access to a trusted device. Why not simply use the trusted device? The answer there may be that we've compromised one and want to do as little as possible on it in order to reduce the probability of triggering any sort of endpoint detection agent, or it may be because we want to run on a device with different security properties than those enforced on the trusted device.

comments

December 12, 2024 12:16 PM

October 30, 2024

Pete Zaitcev: virtio_pci: do not wait forvever at a reset

We all know how it's possible for a guest VM to access various host functions by accessing a PCI device, right? When KVM traps an access to this fake PCI, QEMU emulates the device, which allows packets sent, console updated, or whatever. This is called "virtio".

NVIDIA took it a step further: they have a real PCI device that emuilates QEMU. No joke. And, they have a firmware bug! The following patch works around it:

diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 9193c30d640a..6bbb34f9b088 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -438,6 +438,7 @@ static void vp_reset(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	struct virtio_pci_modern_device *mdev = &vp_dev->mdev;
+	int i;
 
 	/* 0 status means a reset. */
 	vp_modern_set_status(mdev, 0);
@@ -446,8 +447,16 @@ static void vp_reset(struct virtio_device *vdev)
 	 * This will flush out the status write, and flush in device writes,
 	 * including MSI-X interrupts, if any.
 	 */
-	while (vp_modern_get_status(mdev))
+	i = 0;
+	while (vp_modern_get_status(mdev)) {
+		if (++i >= 10000) {
+			printk(KERN_INFO
+			       "virtio reset ignoring status 0x%02x\n",
+			       vp_modern_get_status(mdev));
+			break;
+		}
 		msleep(1);
+	}
 
 	vp_modern_avq_cleanup(vdev);

I'm not dumping on NVIDIA here at all, I think it's awesome for this devious hardware to exist. And bugs are just a way of life.

October 30, 2024 05:58 PM

Pete Zaitcev: LinkedIn Asked You To Train Their AI

They pushed the "You're one of a few experts invited to answer" notifications for a long time - maybe a year, I don't remember. When I had enough and started to capture them with the intent of mockery, they stopped. So sad. Here's what I got:

"You're facing pushback from vendors on cloud integration. How can you convince them to collaborate?"

"You're focused on cutting costs in cloud computing. How do you ensure security protocols aren't compromised?"

"You're overseeing a code review process. How do you ensure feedback boosts developer morale?"

What a dystopia. LinkedIn is owned by Microsoft, so I'm not suprised someone in a giant corporation thought this sort of nonsense was a good idea. But still, the future is stupid, and all that.

P.S. The notification inserts were non-persistent — inserted on the fly. That was just fraud w.r.t. the idea of notification ticker.

P.P.S. Does anyone else think that this sort of thing would cause self-selection? They made their AI trained by the most vain and also least bright members of their user population. I'm not an expert in any of these fields.

UPDATE 2024-10-31: Spoke too soon! They hit me with the notificantion insert: "Here's how you can craft a personalized learning plan for advancing in Cloud Computing." That is not even a formed question. Getting lazy, are we?

UPDATE 2024-11-02: "You're facing budget disputes over cloud solutions. How can you align IT and non-technical teams effectively?" They are not stopping.

Meanwhile, how about another perspective: I saw an update that Hubbert Smith contributed an answer to: "You're facing a ransomware attack crisis. How do you convey the severity to a non-technical executive?" Instead of answering what LinkedIn AI asked, he answered a question of how to deal with ransomware ("Ransomware is fixable with snapshots of sensitive data."). Unless he is an AI himself, he may be thinking that he's dealing with a LinkedIn equivalent of Quora.

I'm trying to ask him what happened.

October 30, 2024 05:17 PM

October 28, 2024

Brendan Gregg: AI Flame Graphs

Imagine halving the resource costs of AI and what that could mean for the planet and the industry -- based on extreme estimates such savings could reduce the total US power usage by over 10% by 2030¹. At Intel we've been creating a new analyzer tool to help reduce AI costs called AI Flame Graphs: a visualization that shows an AI accelerator or GPU hardware profile along with the full software stack, based on my CPU flame graphs. Our first version is available to customers in the Intel Tiber AI Cloud as a preview for the Intel Data Center GPU Max Series (previously called Ponte Vecchio). Here is an example:

Simple example: SYCL matrix multiply microbenchmark

(Click for interactive SVG.) The green frames are the actual instructions running on the AI or GPU accelerator, aqua shows the source code for these functions, and red (C), yellow (C++), and orange (kernel) show the CPU code paths that initiated these AI/GPU programs. The gray "-" frames just help highlight the boundary between CPU and AI/GPU code. The x-axis is proportional to cost, so you look for the widest things and find ways to reduce them.

Layers

This flame graph shows a simple program for SYCL (a high-level C++ language for accelerators) that tests three implementations of matrix multiply, running them with the same input workload. The flame graph is dominated by the slowest implementation, multiply_basic(), which doesn't use any optimizations and consumes at 72% of stall samples and is shown as the widest tower. On the right are two thin towers for multiply_local_access() at 21% which replaces the accessor with a local variable, and multiply_local_access_and_tiling() at 6% which also adds matrix tiling. The towers are getting smaller as optimizations are added.

This flame graph profiler is a prototype based on Intel EU stall profiling for hardware profiling and eBPF for software instrumentation. It's designed to be easy and low-overhead, just like a CPU profiler. You should be able to generate a flame graph of an existing AI workload whenever you want, without having to restart anything or launch additional code via an interposer.

Instruction-offset Profiling

This is not the first project to build an AI profiler or even something called an AI Flame Graph, however, others I've seen focus on tracing CPU stacks and timing accelerator execution, but don't profile the instruction offsets running on the accelerator; or do profile them but via expensive binary instrumentation. I wanted to build AI flame graphs that work like CPU flame graphs: Easy to use, negligible cost, production safe, and shows everything. A daily tool for developers, with most of the visualization in the language of the developer: source code functions.

This has been an internal AI project at Intel for the past year. Intel was already investing in this space, building the EU stall profiler capability for the Intel Data Center GPU Max Series that provides an approximation of HW instruction sampling. I was lucky to have Dr. Matthew (Ben) Olson, an Intel AI engineer who has also worked on eBPF performance tooling (processwatch) as well as memory management research, join my team and do most of the development work. His background has helped us power through difficulties that seemed insurmountable. We've also recently been joined by Dr. Brandon Kammerdiener (coincidentally another graduate of the University of Tennessee, like Ben), who also has eBPF and memory internals experience, and has been helping us take on harder and harder workloads. And Gabriel Muñoz just joined today to help with releases. Now that our small team has shown that this is possible, we'll be joined by other teams at Intel to develop this further.

We could have built a harder-to-use and higher-overhead version months ago using Intel GTPin but for widespread adoption it needs minimal overhead and ease of use so that developers don't hesitate to use this daily and to add it to deployment pipelines.

What's a Flame Graph?

A flame graph is a visualization I invented in 2011 for showing sampled code stack traces. It has become the standard for CPU profiling and analysis, helping developers quickly find performance improvements and eliminate regressions. A CPU flame graph shows the "big picture" of running software, with x-axis proportional to CPU cost. The example picture on the right summarizes how easy it can be to go from compute costs to responsible code paths. Prior to flame graphs, it could take hours to understand a complex profile by reading through hundreds of pages of output. Now it takes seconds: all you have to do is look for the widest rectangles.

Flame graphs have had worldwide adoption. They have been the basis for five startups so far, have been adopted in over thirty performance analysis products, and have had over eighty implementations.

My first implementation of flame graphs took a few hours on a Wednesday night after work. The real effort has been in the decade since, where I worked with different profilers, runtimes, libraries, kernels, compilers, and hypervisors to get flame graphs working properly in different environments, including fixing stack walking and symbolization. Earlier this year I posted about the final missing piece: Helping distros enable frame pointers so that profiling works across standard system libraries.

Similar work is necessary for AI workloads: fixing stacks and symbols and getting profiling to work for different hardware, kernel drivers, user-mode drivers, frameworks, runtimes, languages, and models. A lot more work, too, as AI analysis has less maturity than CPU analysis.

Searching Samples

If you are new to flame graphs, it's worth mentioning the built-in search capability. In the earlier example, most of the stall samples are caused by sbid: software scoreboard dependency. As that may be a unique search term, you can run search (Ctrl-F, or click "Search") on "sbid" and it will highlight it in magenta:

Search also shows the total number of stack samples that contained sbid in the bottom right: 78.4%. You can search for any term in the flame graph: accelerator instructions, source paths, function names, etc., to quickly calculate the percentage of stacks where it is present (excluding vertical overlap) helping you prioritise performance work.

Note that the samples are EU stall-based, which means theoretical performance wins can take the percentages down to zero. This is different to timer-based samples as are typically used in CPU profiling. Stalls mean you better focus on the pain, the parts of the code that aren't making forward progress, but you aren't seeing resource usage by unstalled instructions. I'd like to supuport timer-based samples in the future as well, so we can have both views.

Who will use this?

At a recent golang conference, I asked the audience of 200+ to raise their hands if they were using CPU flame graphs. Almost every hand went up. I know of companies where flame graphs are a daily tool that developers use to understand and tune their code, reducing compute costs. This will become a daily tool for AI developers.

My employer will use this as well for evaluation analysis, to find areas to tune to beat competitors, as well as to better understand workload performance to aid design.

Why is AI profiling hard?

Consider CPU instruction profiling: This is easy when the program and symbol table are both in the file system and in a standardized file format (such as ELF) as is the case with native compiled code (C). CPU profiling gets hard for JIT-complied code, like Java, as instructions and symbols are dynamically generated and placed in main memory (the process heap) without following a universal standard. For such JITted code we use runtime-specific methods and agents to retrieve snapshots of the heap information, which is different for each runtime.

AI workloads also have different runtimes (and frameworks, languages, user-mode drivers, compilers, etc.) any of which can require special tinkering to get their CPU stacks and symbols to work. These CPU stacks are shown as the red, orange, and yellow frames in the AI Flame Graph. Some AI workloads are easy to get these frames working, some (like PyTorch) are a lot more work.

But the real challenge is instruction profiling of actual GPU and AI accelerator programs -- shown as the aqua and green frames -- and correctly associating them with the CPU stacks beneath them. Not only may these GPU and AI programs not exist in the file system, but they may not even exist in main memory! Even for running programs. Once execution begins, they may be deallocated from main memory and only exist in special accelerator memory, beyond the direct reach of OS profilers and debuggers. Or within reach, but only through a prohibitively high-overhead HW-specific debugger interface.

There's also no /proc representation for these programs either (I've been proposing building an equivalent) so there's no direct way to even tell what is running and what isn't, and all the other /proc details. Forget instruction profiling, even ps(1) and all the other process tools do not work.

It's been a mind-bending experience, revealing what gets taken for granted because it has existed in CPU land for decades: A process table. Process tools. Standard file formats. Programs that exist in the file system. Programs running from main memory. Debuggers. Profiliers. Core dumping. Disassembling. Single stepping. Static and dynamic instrumentation. Etc. For GPUs and AI, this is all far less mature. It can make the work exciting at times, when you think something is impossible and then find or devise a way.

Fortunately we have a head start as some things do exist. Depending on the runtime and kernel driver, there are debug interfaces where you can list running accelerator programs and other statistics, as used by tools like intel_gpu_top(1). You can kill -9 a GPU workload using intel_gpu_abrt(1). Some interfaces can even generate basic ELF files for the running accelerator programs that you can try to load in a debugger like gdb(1). And there is support for GPU/AI program disassembly, if you can get your hands on the binary. It feels to me like GPU/AI debugging, OS style, is about two years old. Better than zero, but still early on, and lots more ahead of us. A decade, at least.

What do AI developers think of this?

We've shown AI Flame Graphs to other AI developers at Intel and a common reaction is to be a bit puzzled, wondering what to do with it. AI developers think about their bit of code, but with AI Flame Graphs they can now see the entire stack for the first time, including the HW, and many layers they don't usually think about or don't know about. It basically looks like a pile of gibberish with their code only a small part of the flame graph.

CPU Flame Graph Implementations

This reaction is similar to people's first experiences with CPU flame graphs, which show parts of the system that developers and engineers typically don't work on, such as runtime internals, system libraries, and kernel internals. Flame graphs are great at highlighting the dozen or so functions that matter the most, so it becomes a problem of learning what those functions do across a few different code bases, which are typically open source. Understanding a dozen such functions can take a few hours or even a few days -- but if this leads to a 10% or 2x cost win, it is time well spent. And the next time the user looks at a flame graph, they start saying "I've seen that function before" and so on. You can get to the point where understanding the bulk of a CPU flame graph takes less than a minute: look for the widest tower, click to zoom, read the frames, done.

I'm encouraged by the success of CPU flame graphs, with over 80 implementations and countless real world case studies. Sometimes I'm browsing a performance issue I care about on github and hit page down and there's a CPU flame graph. They are everywhere.

I expect AI developers will also be able to understand AI Flame Graphs in less than a minute, but to start with people will be spending a day or more browsing code bases they didn't know were involved. Publishing case studies of found wins will also help people learn how to interpret them, and also help explain the value.

What about PyTorch?

Another common reaction we've had is that AI developers are using PyTorch, and initially we didn't support it as it meant walking Python stacks, which isn't trivial. But prior work has been done there (to support CPU profiling) and after a lot of tinkering we now have the first PyTorch AI Flame Graph:

PyTorch frames in pink

(Click for interactive SVG.) The PyTorch functions are at the bottom and are colored pink. This example runs oneDNN kernels that are JIT-generated, and don't have a source path so that layer just reads "jit". Getting all other the layers included was a real pain to get going, but an important milestone. We think if we can do PyTorch we can do anything.

In this flame graph, we show PyTorch running the Llama 2 7B model using the Intel Extensions for PyTorch (IPEX). This flame graph shows the origin of the GPU kernel execution all the way back to the Python source code shown in pink. Most samples are from a stack leading up to a gemm_kernel (matrix multiply) shown in aqua, which like the previous example has many stalls due to software scoreboarding.

There are two instructions (0xa30 and 0xa90) that combined are 27% of the entire profile. I expect someone will ask: Can't we just click on instructions and have it bring up a dissassembly view with full source? Yes, that should be possible, but I can't answer how we're going to provide this yet. Another expected question I can't yet answer: Since there are now multiple products providing AI auto-tuning of CPU workloads using CPU flame graphs (including Intel Granulate) can't we have AI auto-tuning of AI workloads using AI Flame Graphs?

First Release: Sometimes hard and with moderate overhead

Getting AI Flame Graphs to work with some workloads is easy, but others are currently hard and cost moderate overhead. It's similar to CPU profiling, where some workloads and languages are easy to profile, whereas others need various things fixed. Some AI workloads use many software dependencies that need various tweaks and recompilation (e.g., enabling frame pointers so that stack walking works) making setup time consuming. PyTorch is especially difficult and can take over a week of OS work to be ready for AI Flame Graphs. We will work on getting these tweaks changed upstream in their respective repositories, something involving teams inside and outside of Intel, and is a process I'd expect to take at least a year. During that time AI workloads will gradually become easier to flame graph, and with lower-overhead as well.

I'm reminded of eBPF in the early days: You had to patch and recompile the kernel and LLVM and Clang, which could take multiple days if you hit errors. Since then all the eBPF dependency patches have been merged, and default settings changed, so that eBPF "just works." We'll get there with AI Flame Graphs too, but right now it's still those early days.

The changes necessary for AI Flame Graphs are really about improving debugging in general, and are a requirement for Fast by Friday: A vision where we can root-cause analyze anything in five days or less.

Availability

AI Flame Graphs will first become available on the Intel Tiber AI Cloud as a preview feature for the Intel Data Center GPU Max Series. If you are currently deployed there you can ask through the Intel service channel for early access. As for if or when it will support other hardware types, be in other Intel products, be officially launched, be open source, etc., these involve various other teams at Intel and they need to make their own announcements before I can discuss them here.

Conclusions

Finding performance improvements for AI data centers of just fractions of a percent can add up to planetary savings in electricity, water, and money. If AI flame graphs have the success that CPU flame graphs have had, I'd expect finding improvements of over 10% will be common, and 50% and higher will eventually be found*. But it won't be easy in these early days as there are still many software components to tweak and recompile, and software layers to learn about that are revealed in the AI flame graph.

In the years ahead I imagine others will build their own AI flame graphs that look the same as this one, and there may even be startups selling them, but if they use more difficult-to-use and higher-overhead technologies I fear they could turn companies off the idea of AI flame graphs altogether and prevent them from finding sorely needed wins. This is too important to do badly. AI flame graphs should be easy to use, cost negligible overhead, be production safe, and show everything. Intel has proven it's possible.

Disclaimer

* This is a personal blog post that makes personal predictions but not guarantees of possible performance improvements. Feel free to take any claim with a grain of salt, and feel free to wait for an official publication and public launch by Intel on this technology.

¹ Based on halving the Arm CEO Rene Haas' estimate of 20-25% quoted in Taking a closer look at AI's supposed energy apocalypse by Kyle Orland of ArsTechnica.

Updates

(Jan 2025): I wasn't going to talk about other specific profilers, but since FAQ #1 is "what about NVIDIA?": They do have flame graphs in Nsight Graphics for GPU workloads, so I'd guess they aren't far from supporting AI workloads as well. Their version looks shallow as it's only the GPU code, so it is missing the CPU context necessary to understand the full picture, and seems based on interposing (launching the app from Nsights) which is the old method. The new method we've created allows for easy, everything, anytime profiling, like you expect from CPU profilers. Theirs does have click-to-source integration, which is quite handy, and I do like the default orientation and colors, thank you!

Thanks

Thanks to everyone at Intel who have helped us make this happen. Markus Flierl has driven this project and made it a top priority, and Greg Lavender has expressed his support. Special thanks to Michael Cole, Matthew Roper, Luis Strano, Rodrigo Vivi, Joonas Lahtinen, Stanley Gambarin, Timothy Bauer, Brandon Yates, Maria Kraynyuk, Denis Samoylov, Krzysztof Raszkowski, Sanchit Jain, Po-Yu Chen, Felix Degrood, Piotr Rozenfeld, Andi Kleen, and all of the other coworkers that helped clear things up for us, and thanks in advance for everyone else who will be helping us in the months ahead.

My final thanks is to the companies and developers who do the actual hands-on work with flame graphs, collecting them, examining them, finding performance wins, and applying them.
You are helping save the planet.

October 28, 2024 01:00 PM

October 23, 2024

Harald Welte: On Linux MAINTAINERS file removal of Russian developers

I sincerely regret to see Linux kernel patches like this one removing Russian developers from the MAINTAINERS file. To me, it is a sign or maybe even a symbol of how far the Linux kernel developer community I remember from ~ 20 years ago has changed, and how much it has alienated itself from what I remember back in the day.

In my opinion this commit is wrong at so many different levels:

it is intransparent. Initially it gave no explanation whatsoever (other than some compliance hand-waving). There was some follow-up paraphrasing one paragraph of presumed legal advice that was given presumably by Linux Foundation to Linus. That's not a thorough legal analysis at all. It doesn't even say to whom it was given, and who (the individual developers? Linux Foundation? Distributors?) is presumed to be subject to the unspecified regulations in which specific jurisdiction
it discriminates developers based on their presumed [Russian] nationality based on their name, e-mail address domain name or employer.

A later post in the thread has clarified that it's about an U.S. embargo list against certain Russian individuals / companies. It is news to me that the MAINTAINERS file was usually containing Companies or that the Linux kernel development is Companies engaging with each other. I was under the naive assumption that it's individual developers who work together, and their employers do not really matter. Contributions are judged by their merit, and not by the author or their employer / affiliation. In the super unlikely case that indeed those individual developers removed from the MAINTAINERS file would be personally listed in the embargo list: Then yes, of course, I agree, they'd have to be removed. But then the commit log should of course point to [the version] of that list and explicitly mention that they were personally listed there.

And no, I am of course not a friend of the Russian government at all. They are committing war crimes, no doubt about it. But since when has the collaboration of individual developers in an open source project been something related to actions completely unrelated to those individuals? Should I as a German developer be excluded due to the track record of Germany having started two world wars killing millions? Should Americans be excluded due to a very extensive track record of violating international law? Should we exclude Palestinians? Israelis? Syrians? Iranians? [In case it's not obvious: Those are rhetorical questions, my position is of course no to all of them].

I just think there's nothing more wrong than discriminating against people just because of their passport, their employer or their place of residence. Maybe it's my German upbringing/socialization, but we've had multiple times in our history where the concept of **Sippenhaft** (kin liability) existed. In those dark ages of history you could be prosecuted for crimes committed by other family members.

Now of course removal from the MAINTAINERS file or any other exclusion from the Linux kernel development process is of course not in any way comparable to prosecution like imprisonment or execution. However, the principle seems the same: An individual is punished for mere association with some others who happen to be committing crimes.

Now if there really was a compelling legal argument for this (I doubt it, but let's assume for a second there is): In that case I'd expect a broad discussion against it; a reluctance to comply with it; a search for a way to circumvent said legal requirement; a petition or political movement against that requirement.

Even if there was absolutely no way around performing such a "removal of names": At the very least I'd expect some civil disobedience by at least then introducing a statement into the file that one would have hoped to still be listing those individuals as co-maintainers but one was forced by [regulation, court order, ...] to remove them.

But the least I would expect is for senior Kernel developers to simply do apply the patch with a one-sentence commit log message and thereby disrespect the work of said [presumed] Russian developers. All that does is to alienate individuals of the developer community. Not just those who are subject to said treatment today, but any others who see this sad example how Linux developers treat each other and feel discouraged from becoming or remaining active in a community with such behaviour.

It literally hurts me personally to see this happening. It's like a kick in the gut. I used to be proud about having had an involvement with the Linux kernel community in a previous life. This doesn't feel like the community I remember being part of.

October 23, 2024 04:00 PM

October 22, 2024

Harald Welte: Oral history transcripts: Pioneers of Taiwans Chip + PC industry

During the preparation of my current brief visit to Taiwan, I've more or less by coincidence stumbled on several transcripts of oral history interviews with pioneers of the Taiwanese Chip and PC industry (click on the individual transcripts in the Related Records section at the bottom). They have been recorded, transcribed and translated in 2011 by the Computer History Museum under funding from the National Science Council, Taiwan, R.O.C..

As some of you know, I've been spending a lot of time in recent years researching (and practically exploring + re-implementing) historical telecommunications with my retronetworking project.

Retrocomputing itself is not my main focus. I usually feel there's more than enough people operating, repairing, documenting at least many older computers, as well as keeping archives of related software and continuing to spread knowledge on how they operated. Nevertheless, it is a very interesting topic - I just decided that with my limited spare time I want to focus on retro-communications which is under-explored and under-represented.

What's equally important than keeping the old technology alive, is keeping the knowledge around its creation alive. How did it happen that certain technologies were created and became successful or not? How where they key people behind it? etc.

Given my personal history with Taiwan during the last 18 years, it's actually surprising I haven't yet given thought on how or where the history of the Taiwanese IT industry is documented or kept alive. So far I didn't know of any computer museums that would focus especially on the Taiwanese developments. It didn't even occur to me to even check if there are any.

During my work in Taiwan I've had the chance to briefly meet a few senior people at FIC (large mainboard maker that made many PC mainboards I personally used) and both at VIA (chipset + CPU maker). But I didn't ever have a chance to talk about the history.

In any case, I now found those transcripts of interviews. And what a trove of interesting first-hand information they are! If you have an interest in computer history, and want to understand how it came about that Taiwan became such a major player in either the PC industry or in the semiconductor design + manufacturing, then I believe those transcripts are a "must read".

Now they've made me interested to learn more. I have little hope of many books being published on that subject, particularly in a Language I can read (i.e. English, not mandarin Chinese). But I shall research that subject. I'd also be interested to hear about any other information, like collections of historical artifacts, archives, libraries, etc. So in the unlikely case anybody reading this has some pointers on information about the history of the Taiwanese Chip and Computer history, please by all means do reach out and share!.

Once I have sufficiently prepared myself in reading whatever I can find in terms of written materials, I might be tempted to try to reach out and see if I can find some first-hand witnesses who'd want to share their stories on a future trip to Taiwan...

October 22, 2024 04:00 PM

Harald Welte: Back to Taiwan the first time after 5 years

Some of the readers of this blog know that I have a very special relationship with Taiwan. As a teenager, it was the magical far-away country that built most of the PC components in all my PCs since my first 286-16 I got in 1989. Around 2006-2008 I had the very unexpected opportunity to work in Taiwan for some time (mainly for Openmoko, later some consulting for VIA). During that time I have always felt most welcome in and fascinated by the small island nation who managed to turn themselves into a high-tech development and manufacturing site for ever more complex electronics. And who managed to evolve from decades of military dictatorship and turn into a true democracy - all the while being discriminated by pretty much all of the countries around the world, as everybody wanted to benefit from cheap manufacturing in mainland China and hence expel democratic Taiwan from the united nations in favour of communist mainland Chine.

I have the deepest admiration for Taiwan to manage all of their economic success and progress in terms of democracy and freedom despite the political situation across the Taiwan strait, and despite everything that comes along with it. May they continue to have the chance of continuing their path.

Setting economy, society and politics behind: On a more personal level I've enjoyed their culinary marvels from excellent dumplings around every street corner to niu rou mien (beef noodle soup) to ma la huo guo (spicy hot pot). Plus then the natural beauty, particularly of the rural mountainous regions once you leave the densely populated areas around the coast line and the plains of the north west.

While working in Taiwan in 2006/2007 I decided to buy a motorbike. Using that bike I've first made humble day trips and later (once I was no longer busy with stressful work at Openmoko) multiple week-long road trips around the island, riding on virtually any passable road you can find. My typical routing algorithm is "take the smallest possible road from A to B".

So even after concluding my work in Taiwan, I returned again and again for holidays, each one with more road trips. For some time, Taiwan had literally become my second home. I had my favorite restaurants, shops, as well as some places around the rural parts of the Island I cam back to several times. I even managed to take up some mandarin classes, something I never had the time for while doing [more than] full time work. To my big regret, it's still very humble beginner level; I guess had I not co-started a company (sysmocom) in Berlin in 2011, I'd have spent more time for a more serious story.

In any case, I have nothing but the fondest memory of Taiwan. My frequent visits cam to a forcible halt with the COVID-19 pandemic, Taiwan was in full isolation in 2020/21, and even irrespective of government regulations, I've been very cautious about travel and contact. Plus of course, there's always the bad conscience of frequent intercontinental air travel.

Originally I was planning to finally go on an extended Taiwan holiday in Summer 2024, but then the island was hit by a relatively serious earthquake in April, affecting particularly many of the remote mountain regions that are of main interest to me. There are some roads that I'd have wanted to ride ever since 2008, but which had been closed every successive year when I went there, due to years of reconstructions after [mostly landslides following] earthquakes and typhoons. So I decided to postpone it for another year to 2025.

However, in an unexpected change of faith, the opportunity arose to give the opening Keyonte at the 2024 Open Compliance Summit in Japan, and along with that the opportunity to do a stop-over in Taiwan. It will just be a few days of Taipei this time (no motorbike trips), but I'm very much looking forward to being back in the city I probably know second or third-best on the planet (after Berlin, my home for 23 years, as well as Nuernberg, my place of birth). Let's see what is still the same and what has changed during the past 5 years!

October 22, 2024 04:00 PM

October 10, 2024

Paul E. Mc Kenney: Parallel Programming: Cooperation

First, let me paraphrase something from my LiveJournal profile: These posts are my own, and in particular do not necessarily reflect my employer's positions, strategies, or opinions.

With that said, some say that the current geopolitical outlook is grim. And far be it from me to minimize the present-day geopolitical problems, nor am I at all interested in comparing them to their counterparts in the "good old days". But neither do I wish to obsess on these problems. I will instead call attention to a few instances of global cooperation, current and past.

Last month, NASA's oldest active astronaut traveled to Kazakhstan's Baikonur Cosmodrome, entered a Soyuz capsule atop a Roscosmos rocket and flew to the International Space Station. For me, this is especially inspiring: If he can do that at age 69, I should certainly be able to continue doing my much less demanding job for many years to come.

Some decades ago, during the Cold War, I purchased an English translation of Gradshteyn's and Ryzhik's classic "Table of Integrals, Series, and Products". Although computer-algebra systems have largely replaced this book, I have used it within the past few years and I used it heavily in the 1980s and early 1990s. Thus, along with many others, I am indebted to the longstanding Russian tradition of excellence in mathematics.

So just this past month, I was happy to receive hard copies of "Параллельное программирование – так ли это сложно?", which is a Russian translation of "Is Parallel Programming Hard, And, If So, What Can You Do About It?" I would like to think that this might be a down payment on my aforementioned debt.

Many other countries have also made many excellent contributions to mathematics, science, and technology. For example, the smartphone that I used hails from South Korea. And earlier this year, SeongJae (SJ) Park completed a Korean translation of the Second Edition of "Is Parallel Programming Hard, And, If So, What Can You Do About It?"

Returning to rocketry, China started working with rockets in the 1200s, if not earlier, and has made a great deal of more recent progress in a wide variety of fields. And rumor has it that a Chinese translation of the Second Edition will be appearing shortly.

So if you tried reading this book, but the English got in the way, you now have two other options and hopefully soon a third! But what if you want a fourth option? Then you, too, can do a translation! Just send me a translated chapter and I will add it to the list in the book's FAQ.txt file.

October 10, 2024 05:16 PM

October 06, 2024

Pete Zaitcev: Adventures in proprietary software, Solidworks edition

Because FreeCAD was such a disaster for me, I started looking at crazy solutions, like exporting STEP from OpenSCAD. I even stooped to looking at proprietary alternatives. First on the runway was SolidWorks. If it's good for Mark Serbu, surely it's good for me, right?

The first thing I found, you cannot tap your card and download. You have to contact a partner representative — never a good sign. The representative quoted me for untold thousands. I'm not going to post the amount, I'm sure they vary it every time, like small shop owners who vary prices according to the race of the shopper.

In addition, they spam like you would not believe. First you have to unsubscribe from the partner, next from community.3ds.com, next from draftsight.3ds.com, and so on. Eventually, you'll get absolutely random spam, you try to unsubscribe, and they just continue and spam. Fortunately, I used a one-time address, and I killed it. Phew.

October 06, 2024 04:39 PM

October 04, 2024

Dave Airlie (blogspot): zinking the video

A few years ago Mike and I discussed adding video support to zink, so that we could provide vaapi on top of vulkan video implementations.

This of course got onto a long TODO list and we nerdsniped each other into moving it along, this past couple of weeks we finally dragged it over the line.

This MR adds initial support for zink video decode on top of Vulkan Video. It provides vaapi support. Currently it only support H264 decode, but I've implemented AV1 decode and I've played around a bit with H264 encode. I think adding H265 decode shouldn't be too horrible.

I've tested this mainly on radv, and a bit on anv (but there are some problems I should dig into).

October 04, 2024 01:00 AM

September 23, 2024

Linux Plumbers Conference: That’s a wrap! Thanks everyone for Linux Plumbers 2024

Thank you to everyone who attended Linux Plumbers 2024 both in person and virtually!

This year we were able to accommodate huge demand for in-person participation and we were glad to see more than 700 people in the Austria Center.
As in previous years after the pandemic we also had a virtual component with more than 200 participants.

We had a lot of great content in Refereed Track, Kernel Summit, eBPF and Networking Summits and Toolchains Track and a lot of productive discussions in 24 microconferences.

There also were 25 Birds-of-a-Feather sessions, many of them were added during the event to continue a discussion that started in a microconference or in the Hallway Track.

There are recordings of live streams and we hope to have recordings of all the sessions soon.

Finally, I want to thank all those that were involved in making Linux Plumbers the best technical conference there is. This would not have happened without the hard work from the planning committee (Alice Ferrazzi, André Almeida, Christian Brauner, David Woodhouse, James Bottomley, Kate Stewart, Lorenzo Pieralisi, Shuah Khan, Song Liu, Steve Rostedt, Tim Bird), the runners of the Networking and BPF Summit tracks, the Toolchain track, Kernel Summit, and those that put together the very productive microconferences. I would also like to thank all those that presented as well as those who attended both in-person and virtually.

I want to thank our sponsors for their continued support, without them Linux Plumbers Conference would not be possible.

And a very special thanks to the Linux Foundation and their staff who did really great job behind the scenes and on-site to make this conference run smoothly. Their work is greatly appreciated by the LPC planning committee.

Sincerely,

Mike Rapoport
Linux Plumbers 2024 Conference chair

September 23, 2024 09:07 AM

September 13, 2024

Linux Plumbers Conference: Playback of Presenter and BBB Training is available

We recorded a playback of the 10:00 session which you can watch:

https://bbb1.lpc.events/playback/presentation/2.3/62e3456da3c0598910e28d204ee24b669d714c04-1725975646004

To get a feel for how the BBB platform works. In addition, your credentials are the email address you registered with in cvent and the confirmation number of the registration it sent you back. You can use those to log in here:

https://meet.lpc.events

And practice in a Hackroom (after logging in select Hackrooms from the leftnav and then pick a Hackroom which is empty).

September 13, 2024 02:37 PM

August 30, 2024

Dave Airlie (blogspot): On Rust, Linux, developers, maintainers

There's been a couple of mentions of Rust4Linux in the past week or two, one from Linus on the speed of engagement and one about Wedson departing the project due to non-technical concerns. This got me thinking about project phases and developer types.

Archetypes:

I will regret making an analogy, in an area I have no experience in, but let's give it a go with a road building analogy.

Let's sort developers into 3 rough categories. Let's preface by saying not all developers fit in a single category throughout their careers, and some developers can do different roles on different projects, or on the same project simultaneously.

1. Wayfinders/Mapmakers

I want to go build a hotel somewhere but there exists no map or path. I need to travel through a bunch of mountains, valleys, rivers, weather, animals, friendly humans, antagonistic humans and some unknowns. I don't care deeply about them, I want to make a path to where I want to go. I hit a roadblock, I don't focus on it, I get around it by any means necessary and move onto the next one. I document the route by leaving maps, signs. I build a hotel at the end.

2. Road builders

I see the hotel and path someone has marked out. I foresee that larger volumes will want to traverse this path and build more hotels. The roadblocks the initial finder worked around, I have to engage with. I engage with each roadblock differently. I build a bridge, dig a tunnel, blow up some stuff, work with with/against humans, whatever is necessary to get a road built to the place the wayfinder built the hotel. I work on each roadblock until I can open the road to traffic. I can open it in stages, but it needs a completed road.

3. Road maintainers

I've got a road, I may have built the road initially. I may no longer build new roads. I've no real interest in hotels. I deal with intersections with other roads controlled by other people, I interact with builders who want to add new intersections for new roads, and remove old intersections for old roads. I fill in the holes, improve safety standards, handle the odd wayfinder wandering across my 8 lanes.

Interactions:

Wayfinders and maintainers is the most difficult interaction. Wayfinders like to move freely and quickly, maintainers have other priorities that slow them down. I believe there needs to be road builders engaged between the wayfinders and maintainers.

Road builders have to be willing to expend the extra time to resolving roadblocks in the best way possible for all parties. The time it takes to resolve a single roadblock may be greater than the time expended on the whole wayfinding expedition, and this frustrates wayfinders. The builder has to understand what the maintainers concerns are and where they come from, and why the wayfinder made certain decisions. They work via education and trust building to get them aligned to move past the block. They then move down the road and repeat this process until the road is open. How this is done might change depending on the type of maintainers.

Maintainer types:

Maintainers can fall into a few different groups on a per-new road basis, and how do road builders deal with existing road maintainers depends on where they are for this particular intersection:

1. Positive and engaged

Aligned with the goal of the road, want to help out, design intersections, help build more roads and more intersections. Will often have helped wayfinders out.

2. Positive with real concerns

Agrees with the road's direction, might not like some of the intersections, willing to be educated and give feedback on newer intersection designs. Moves to group 1 or trusts that others are willing to maintain intersections on their road.

3. Negative with real concerns

Don't agree fully with road's direction or choice of building material. Might have some resistance to changing intersections, but may believe in a bigger picture so won't actively block. Hopefully can move to 1 or 2 with education and trust building.

4. Negative and unwilling

Don't agree with the goal, don't want the intersection built, won't trust anyone else to care about their road enough. Education and trust building is a lot more work here, and often it's best to leave these intersections until later, where they may be swayed by other maintainers having built their intersections. It might be possible to build a reduced intersection. but if they are a major enough roadblock in a very busy road, then a higher authority might need to be brought in.

5. Don't care/Disengaged

Doesn't care where your road goes and won't talk about intersections. This category often just need to be told that someone else will care about it and they will step out of the way. If they are active blocks or refuse interaction then again a higher authority needs to be brought in.

Where are we now?

I think the r4l project has a had lot of excellent wayfinding done, has a lot of wayfinding in progress and probably has a bunch of future wayfinding to do. There are some nice hotels built. However now we need to build the roads to them so others can build hotels.

To the higher authority, the road building process can look slow. They may expect cars to be driving on the road already, and they see roadblocks from a different perspective. A roadblock might look smaller to them, but have a lot of fine details, or a large roadblock might be worked through quickly once it's engaged with.

For the wayfinders the process of interacting with maintainers is frustrating and slow, and they don't enjoy it as much as wayfinding, and because they still only care about the hotel at the end, when a maintainer gets into the details of their particular intersection they don't want to do anything but go stay in their hotel.

The road will get built, it will get traffic on it. There will be tunnels where we should have intersections, there will be bridges that need to be built from both sides, but I do think it will get built.

I think my request from this is that contributors should try and identify the archetype they currently resonate with and find the next group over to interact with.

For wayfinders, it's fine to just keep wayfinding, just don't be surprised when the road building takes longer, or the road that gets built isn't what you envisaged.

For road builder, just keep building, find new techniques for bridging gaps and blowing stuff up when appropriate. Figure out when to use higher authorities. Take the high road, and focus on the big picture.

For maintainers, try and keep up with modern road building, don't say 20 year old roads are the pinnacle of innovation. Be willing to install the rumble strips, widen the lanes, add crash guardrails, and truck safety offramps. Understand that wayfinders show you opportunities for longer term success and that road builders are going to keep building the road, and the result is better if you engage positively with them.

August 30, 2024 01:52 AM

August 23, 2024

Linux Plumbers Conference: Welcome to the Android Micro-conference!

Every year the Android Micro-conference brings the upstream Linux community and the Android systems developers together at the Linux Plumbers Conference. They discuss how they can effectively engage the existing issues and collaborate on upcoming changes to the Android platform and their upstream dependencies.

This year Android MC is scheduled to start at 10am on Friday, 20th Sep at Hall L1 (Austria Center). Attending Android MC gives you a chance to contribute to the broader discussion on Android platform ecosystem and Linux kernel development. You can share your own experiences, offer feedback, and help shape the future direction of these technologies.

Discussion topics for this year include:

Android kernel support and long term AOSP maintainership to support device longevity
The pursuit of AOSP developer community
Android Generic bootloader efforts
Supporting generic restricted dmabuf heap
memcg v2 updates in Android
Bring-up devices with 16kb page_size support
ublk based zero copy I/O use case in Android
Leveraging large folios (mTHP) on Android phones

Android MC will be followed by a Android BoF session, which will be a audience directed discussion. It can be a follow-up of the discussions from any of the Android MC topics or a free-form discussion on Android related topics.

August 23, 2024 05:34 PM

August 22, 2024

Matthew Garrett: What the fuck is an SBAT and why does everyone suddenly care

Short version: Secure Boot Advanced Targeting and if that's enough for you you can skip the rest you're welcome.

Long version: When UEFI Secure Boot was specified, everyone involved was, well, a touch naive. The basic security model of Secure Boot is that all the code that ends up running in a kernel-level privileged environment should be validated before execution - the firmware verifies the bootloader, the bootloader verifies the kernel, the kernel verifies any additional runtime loaded kernel code, and now we have a trusted environment to impose any other security policy we want. Obviously people might screw up, but the spec included a way to revoke any signed components that turned out not to be trustworthy: simply add the hash of the untrustworthy code to a variable, and then refuse to load anything with that hash even if it's signed with a trusted key.

Unfortunately, as it turns out, scale. Every Linux distribution that works in the Secure Boot ecosystem generates their own bootloader binaries, and each of them has a different hash. If there's a vulnerability identified in the source code for said bootloader, there's a large number of different binaries that need to be revoked. And, well, the storage available to store the variable containing all these hashes is limited. There's simply not enough space to add a new set of hashes every time it turns out that grub (a bootloader initially written for a simpler time when there was no boot security and which has several separate image parsers and also a font parser and look you know where this is going) has another mechanism for a hostile actor to cause it to execute arbitrary code, so another solution was needed.

And that solution is SBAT. The general concept behind SBAT is pretty straightforward. Every important component in the boot chain declares a security generation that's incorporated into the signed binary. When a vulnerability is identified and fixed, that generation is incremented. An update can then be pushed that defines a minimum generation - boot components will look at the next item in the chain, compare its name and generation number to the ones stored in a firmware variable, and decide whether or not to execute it based on that. Instead of having to revoke a large number of individual hashes, it becomes possible to push one update that simply says "Any version of grub with a security generation below this number is considered untrustworthy".

So why is this suddenly relevant? SBAT was developed collaboratively between the Linux community and Microsoft, and Microsoft chose to push a Windows update that told systems not to trust versions of grub with a security generation below a certain level. This was because those versions of grub had genuine security vulnerabilities that would allow an attacker to compromise the Windows secure boot chain, and we've seen real world examples of malware wanting to do that (Black Lotus did so using a vulnerability in the Windows bootloader, but a vulnerability in grub would be just as viable for this). Viewed purely from a security perspective, this was a legitimate thing to want to do.

(An aside: the "Something has gone seriously wrong" message that's associated with people having a bad time as a result of this update? That's a message from shim, not any Microsoft code. Shim pays attention to SBAT updates in order to avoid violating the security assumptions made by other bootloaders on the system, so even though it was Microsoft that pushed the SBAT update, it's the Linux bootloader that refuses to run old versions of grub as a result. This is absolutely working as intended)

The problem we've ended up in is that several Linux distributions had not shipped versions of grub with a newer security generation, and so those versions of grub are assumed to be insecure (it's worth noting that grub is signed by individual distributions, not Microsoft, so there's no externally introduced lag here). Microsoft's stated intention was that Windows Update would only apply the SBAT update to systems that were Windows-only, and any dual-boot setups would instead be left vulnerable to attack until the installed distro updated its grub and shipped an SBAT update itself. Unfortunately, as is now obvious, that didn't work as intended and at least some dual-boot setups applied the update and that distribution's Shim refused to boot that distribution's grub.

What's the summary? Microsoft (understandably) didn't want it to be possible to attack Windows by using a vulnerable version of grub that could be tricked into executing arbitrary code and then introduce a bootkit into the Windows kernel during boot. Microsoft did this by pushing a Windows Update that updated the SBAT variable to indicate that known-vulnerable versions of grub shouldn't be allowed to boot on those systems. The distribution-provided Shim first-stage bootloader read this variable, read the SBAT section from the installed copy of grub, realised these conflicted, and refused to boot grub with the "Something has gone seriously wrong" message. This update was not supposed to apply to dual-boot systems, but did anyway. Basically:

1) Microsoft applied an update to systems where that update shouldn't have been applied
2) Some Linux distros failed to update their grub code and SBAT security generation when exploitable security vulnerabilities were identified in grub

The outcome is that some people can't boot their systems. I think there's plenty of blame here. Microsoft should have done more testing to ensure that dual-boot setups could be identified accurately. But also distributions shipping signed bootloaders should make sure that they're updating those and updating the security generation to match, because otherwise they're shipping a vector that can be used to attack other operating systems and that's kind of a violation of the social contract around all of this.

It's unfortunate that the victims here are largely end users faced with a system that suddenly refuses to boot the OS they want to boot. That should never happen. I don't think asking arbitrary end users whether they want secure boot updates is likely to result in good outcomes, and while I vaguely tend towards UEFI Secure Boot not being something that benefits most end users it's also a thing you really don't want to discover you want after the fact so I have sympathy for it being default on, so I do sympathise with Microsoft's choices here, other than the failed attempt to avoid the update on dual boot systems.

Anyway. I was extremely involved in the implementation of this for Linux back in 2012 and wrote the first prototype of Shim (which is now a massively better bootloader maintained by a wider set of people and that I haven't touched in years), so if you want to blame an individual please do feel free to blame me. This is something that shouldn't have happened, and unless you're either Microsoft or a Linux distribution it's not your fault. I'm sorry.

comments

August 22, 2024 08:52 AM

August 19, 2024

Matthew Garrett: Client-side filtering of private data is a bad idea

(The issues described in this post have been fixed, I have not exhaustively researched whether any other issues exist)

Feeld is a dating app aimed largely at alternative relationship communities (think "classier Fetlife" for the most part), so unsurprisingly it's fairly popular in San Francisco. Their website makes the claim:

Can people see what or who I'm looking for?
No. You're the only person who can see which genders or sexualities you're looking for. Your curiosity and privacy are always protected.

which is based on you being able to restrict searches to people of specific genders, sexualities, or relationship situations. This sort of claim is one of those things that just sits in the back of my head worrying me, so I checked it out.

First step was to grab a copy of the Android APK (there are multiple sites that scrape them from the Play Store) and run it through apk-mitm - Android apps by default don't trust any additional certificates in the device certificate store, and also frequently implement certificate pinning. apk-mitm pulls apart the apk, looks for known http libraries, disables pinning, and sets the appropriate manifest options for the app to trust additional certificates. Then I set up mitmproxy, installed the cert on a test phone, and installed the app. Now I was ready to start.

What became immediately clear was that the app was using graphql to query. What was a little more surprising is that it appears to have been implemented such that there's no server state - when browsing profiles, the client requests a batch of profiles along with a list of profiles that the client has already seen. This has the advantage that the server doesn't need to keep track of a session, but also means that queries just keep getting larger and larger the more you swipe. I'm not a web developer, I have absolutely no idea what the tradeoffs are here, so I point this out as a point of interest rather than anything else.

Anyway. For people unfamiliar with graphql, it's basically a way to query a database and define the set of fields you want returned. Let's take the example of requesting a user's profile. You'd provide the profile ID in question, and request their bio, age, rough distance, status, photos, and other bits of data that the client should show. So far so good. But what happens if we request other data?

graphql supports introspection to request a copy of the database schema, but this feature is optional and was disabled in this case. Could I find this data anywhere else? Pulling apart the apk revealed that it's a React Native app, so effectively a framework for allowing writing of native apps in Javascript. Sometimes you'll be lucky and find the actual Javascript source there, but these days it's more common to find Hermes blobs. Fortunately hermes-dec exists and does a decent job of recovering something that approximates the original input, and from this I was able to find various lists of database fields.

So, remember that original FAQ statement, that your desires would never be shown to anyone else? One of the fields mentioned in the app was "lookingFor", a field that wasn't present in the default profile query. What happens if we perform the incredibly complicated hack of exporting a profile query as a curl statement, add "lookingFor" into the set of requested fields, and run it?

Oops.

So, point 1 is that you can't simply protect data by having your client not ask for it - private data must never be released. But there was a whole separate class of issue that was an even more obvious issue.

Looking more closely at the profile data returned, I noticed that there were fields there that weren't being displayed in the UI. Those included things like "ageRange", the range of ages that the profile owner was interested in, and also whether the profile owner had already "liked" or "disliked" your profile (which means a bunch of the profiles you see may already have turned you down, but the app simply didn't show that). This isn't ideal, but what was more concerning was that profiles that were flagged as hidden were still being sent to the app and then just not displayed to the user. Another example of this is that the app supports associating your profile with profiles belonging to partners - if one of those profiles was then hidden, the app would stop showing the partnership, but was still providing the profile ID in the query response and querying that ID would still show the hidden profile contents.

Reporting this was inconvenient. There was no security contact listed on the website or in the app. I ended up finding Feeld's head of trust and safety on Linkedin, paying for a month of Linkedin Pro, and messaging them that way. I was then directed towards a HackerOne program with a link to terms and conditions that 404ed, and it took a while to convince them I was uninterested in signing up to a program without explicit terms and conditions. Finally I was just asked to email security@, and successfully got in touch. I heard nothing back, but after prompting was told that the issues were fixed - I then looked some more, found another example of the same sort of issue, and eventually that was fixed as well. I've now been informed that work has been done to ensure that this entire class of issue has been dealt with, but I haven't done any significant amount of work to ensure that that's the case.

You can't trust clients. You can't give them information and assume they'll never show it to anyone. You can't put private data in a database with no additional acls and just rely on nobody ever asking for it. You also can't find a single instance of this sort of issue and fix it without verifying that there aren't other examples of the same class. I'm glad that Feeld engaged with me earnestly and fixed these issues, and I really do hope that this has altered their development model such that it's not something that comes up again in future.

(Edit to add: as far as I can tell, pictures tagged as "private" which are only supposed to be visible if there's a match were appropriately protected, and while there is a "location" field that contains latitude and longitude this appears to only return 0 rather than leaking precise location. I also saw no evidence that email addresses, real names, or any billing data was leaked in any way)

comments

August 19, 2024 07:17 PM

August 16, 2024

Linux Plumbers Conference: Registration is now reopened

It’s better late than never.

This year there was a huge demand to attend Linux Plumbers Conference in person and at last we were able to add more places and reopen the registration.

August 16, 2024 10:02 AM

August 13, 2024

Lucas De Marchi: Linux module dependencies

With the imminent release of kmod 33, I thought it’d be good to have a post about the different types of module dependencies that we have in the Linux kernel and kmod. The new version adds another type, weak dependency, and as the name implies, is the weakest of all. But let’s revisit what are the other types first.

Hard (symbol) dependency

This is the first dependency that every appeared in kmod (and module-init-tools). A hard (or as some call, “symbol”) dependency occurs when your module calls or uses an exported symbol of another module. The most common way is by calling a function that is exported in another module. Example: the xe.ko calls a function ttm_bo_pin() that is provided and exported by another module, ttm.ko. Looking to the source:

$ nm build/drivers/gpu/drm/ttm/ttm.ko | grep -e "\bttm_bo_pin\b"
0000000000000bc0 T ttm_bo_pin
$ nm build64/drivers/gpu/drm/xe/xe.ko | grep -e "\bttm_bo_pin\b"
             U ttm_bo_pin

It is not possible to insert the xe.ko module before ttm.ko and if you try via insmod (that doesn’t handle dependencies) it will fail with the kernel complaining that the ttm_bo_pin symbol is undefined.

The manual invocations to nm illustrates what the depmod tool does: it opens the modules and reads the ELF headers. Then it takes note of all the symbols required and provided by each module, creating a graph of symbol dependencies. Ultimately that leads to module dependencies: xe ➛ ttm. This is recorded in the modules.dep file and its sibling modules.dep.bin. The former is human-readable and the latter is used by libkmod, but they contain the same information: all dependencies for all the modules. Also note that each line reflects indirect dependencies: module A calls symbol from B and B calls symbol from C will lead to A depending on both B and C. Real world example:

$ cat /lib/modules/$(uname -r)/modules.dep | grep kernel/drivers/gpu/drm/xe/xe.ko.zst:
kernel/drivers/gpu/drm/xe/xe.ko.zst: kernel/drivers/gpu/drm/drm_gpuvm.ko.zst kernel/drivers/gpu/drm/drm_exec.ko.zst kernel/drivers/gpu/drm/scheduler/gpu-sched.ko.zst kernel/drivers/gpu/drm/drm_buddy.ko.zst kernel/drivers/i2c/algos/i2c-algo-bit.ko.zst kernel/drivers/gpu/drm/drm_suballoc_helper.ko.zst kernel/drivers/gpu/drm/drm_ttm_helper.ko.zst kernel/drivers/gpu/drm/ttm/ttm.ko.zst kernel/drivers/gpu/drm/display/drm_display_helper.ko.zst kernel/drivers/media/cec/core/cec.ko.zst kernel/drivers/acpi/video.ko.zst kernel/drivers/platform/x86/wmi.ko.zst

So the xe.ko (with .zst extension since it’s compressed) directly or indirectly depends on drm_gpuvm.ko, drm_exec.ko, gpu-sched.ko, drm_buddy.ko, i2c-algo-bit.ko, drm_suballoc_helper.ko, drm_ttm_helper.ko, ttm.ko, drm_display_helper.ko, cec.ko, video.ko, wmi.ko. Same information, but using the kmod tools rather than looking at the raw index:

$ modinfo -F depends xe
drm_display_helper,ttm,drm_gpuvm,drm_suballoc_helper,video,drm_buddy,drm_exec,drm_ttm_helper,gpu-sched,cec,i2c-algo-bit

Soft dependency

There are situations in the kernel where it’s not possible or desired to use a symbol directly from another module - they may interact by registering in a subsystem, scanning a bus etc.. In this case depmod doesn’t have enough information from the ELF file about that. Yet the user would have a more complete support if both modules were available - it may even cause failures visible to the end user and not only “partial support for features”.

The softdep implementation contains 2 parts: pre and post dependencies. The post dependencies are not very much used in practice: they instruct kmod to load another module after loading the target one.

They can come from a configuration file like e.g. /etc/modprobe.d/foo.conf or from the kernel itself by embedding that info in the module. From the kernel source this is achieved by using the macro MODULE_SOFTDEP(). Example:

lib/libcrc32c.c:MODULE_SOFTDEP("pre: crc32c");

When libkmod is loading a module it will first load, in order:

hard dependencies
soft pre dependencies
target module
soft post dependencies

Historically softdeps were also a way to (mostly) get rid of install rules, in which the configuration instructs libkmod to execute something instead of loading the module - people would add an install rule to execute something and then call modprobe again with --ignore-install to fake a dependency. That could easily lead to a runtime loop which is avoided with softdep since kmod can (and does) check for loops.

Weak dependency

After explaining the other types of dependencies, back to the new addition in kmod 33. These are very similar to pre softdep: they come either from a configuration file or embedded in the module and they express a dependency that wouldn’t cause the target module to fail to load, but that may cause the initialization to export less features or fail while initializing. There is one important difference: weak dependencies don’t cause libkmod to actually load the module. Rather the dependency information may be used by tools like dracut and other tools responsible for assembling an initrd to make the module available since it may or may not be used. Why are they called “weak”? This was a borrowed terminology from “weak symbols”: a weak symbol is there, waiting to be used, but it may or may not be, with the final decision happening in the final link stage.

With weak dependencies, hopefully some of the pre softdep embedded in the kernel may be replaced: if the target module is already doing a request_module() or in some way getting the other module to be loaded, it doesn’t need a softdep that would serialize the module load order and possibly load more modules than required.

August 13, 2024 05:00 PM

July 30, 2024

Pete Zaitcev: Fedora Panic Canceled

The other day I was watching a video by Rich Jones about Fedora on RISC-V. In it, he mentions off-hand that CentOS Stream 10 inherits from Fedora 40.

I don't know how that happened anymore, but previously someone made me think that 1. there will be no more numbered releases of CentOS, which is why it is called "CentOS Stream" now, and 2. CentOS Stream is now the upstream of RHEL, replacing Fedora. I really was concerned for the future of Fedora, that was superfluous in that arrangement, you know!

But apparently, Fedora is still the trunk upstream, and CenOS Stream is only named like that. Nothing changes except CentOS is no longer a clone of RHEL, but instead RHEL is a clone of CentOS. What was all the panic for?

I made a VM at Kamatera a few months ago, and they didn't even have Fedora among images. I ended using Rocky.

July 30, 2024 07:06 PM

July 21, 2024

Brendan Gregg: No More Blue Fridays

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

Friday July 19th provided an unprecedented example of the inherent dangers of kernel programming, and has been called the largest outage in the history of information technology. Windows computers around the world encountered blue-screens-of-death and boot loops, causing outages for hospitals, airlines, banks, grocery stores, media broadcasters, and more. This was caused by a config update by a security company for their widely used product that included a kernel driver on Windows systems. The update caused the kernel driver to try to read invalid memory, an error type that will crash the kernel.

For Linux systems, the company behind this outage was already in the process of adopting eBPF, which is immune to such crashes. Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well. These security agents will then be safe and unable to cause a Windows kernel crash.

eBPF (no longer an acronym) is a secure kernel execution environment, similar to the secure JavaScript runtime built into web browsers. If you're using Linux, you likely already have eBPF available on your systems whether you know it or not, as it was included in the kernel several years ago. eBPF programs cannot crash the entire system because they are safety-checked by a software verifier and are effectively run in a sandbox. If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage.

Some eBPF-based security startups (e.g., Oligo, Uptycs) have made their own statements about the recent outage, and the advantages of migrating to eBPF. Larger tech companies are also adopting eBPF for security. As an example, Cisco acquired the eBPF-startup Isovalent and has announced a new eBPF security product: Cisco Hypershield, a fabric for security enforcement and monitoring. Google and Meta already rely on eBPF to detect and stop bad actors in their fleet, thanks to eBPF's speed, deep visibility, and safety guarantees. Beyond security, eBPF is also used for networking and observability.

The worst thing an eBPF program can do is to merely consume more resources than is desirable, such as CPU cycles and memory. eBPF cannot prevent developers writing poor code -- wasteful code -- but it will prevent serious issues that cause a system to crash. That said, as a new technology eBPF has had some bugs in its management code, including a Linux kernel panic discovered by the same security company in the news today. This doesn't mean that eBPF has solved nothing, substituting a vendor's bug for its own. Fixing these bugs in eBPF means fixing these bugs for all eBPF vendors, and more quickly improving the security of everyone.

There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general. What's important about the eBPF method is that it is a software solution that will be available in both Linux and Windows kernels by default, and has already been adopted for this use case.

If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement. It's possible for Linux today, and Windows soon. While some vendors have already proactively adopted eBPF (thank you), others might need a little encouragement from their paying customers. Please help raise awareness, and together we can make such global outages a lesson of the past.

Authors: Brendan Gregg, Intel; Daniel Borkmann, Isovalent; Joe Stringer, Isovalent; KP Singh, Google.

July 21, 2024 02:00 PM

July 12, 2024

Linux Plumbers Conference: System Boot and Security Microconference CFP

The System Boot and Security Microconference has been a critical platform for enthusiasts and professionals working on firmware, bootloaders, system boot, and security. This year, the conference focuses on the challenges of upstreaming boot process improvements to the Linux kernel. Cryptography, an ever-evolving field, poses unique demands on secure elements and TPMs as newer algorithms are introduced and older ones are deprecated. Additionally, new hardware architectures with DRTM capabilities, such as ARM’s D-RTM specification and the increased use of fTPMs in innovative applications, add to the complexity of the task. This is the fifth time the conference has been held in the last six years.

Trusted Platform Modules (TPMs) for encrypting disks have become widespread across various distributions. This highlights the vital role that TPMs play in ensuring platform security. As the field of confidential computing continues to grow, virtual machine firmware must evolve to meet end-users’ demands, and Linux would have to leverage exposed capabilities to provide relevant security properties. Mechanisms like UEFI Secure Boot that were once limited to OEMs now empower end-users. The System Boot and Security Microconference aims to address these challenges collaboratively and transparently. We welcome talks on the following technologies that can help achieve this goal.

TPMs, HSMs, secure elements
Roots of Trust: SRTM and DRTM
Intel TXT, SGX, TDX
AMD SKINIT, SEV
ARM DRTM
Growing Attestation ecosystem
IMA
TrenchBoot, tboot
TianoCore EDK II (UEFI), SeaBIOS, coreboot, U-Boot, LinuxBoot, hostboot
Measured Boot, Verified Boot, UEFI Secure Boot, UEFI Secure Boot Advanced Targeting (SBAT)
shim
boot loaders: GRUB2, systemd-boot/sd-boot, network boot, PXE, iPXE
UKI
u-root
OpenBMC, u-bmc
legal, organizational, and other similar issues relevant to people interested in system boot and security.

If you want to participate in this microconference and have ideas to share, please use the Call for Proposals (CFP) process. Your submissions should focus on new advancements, innovations, and solutions related to firmware, bootloader, and operating system development. It’s essential to explain clearly what will be discussed, why, and what outcomes you expect from the discussion.

Edit: The submission deadline has been updated to July 14th!

July 12, 2024 07:38 PM

Linux Plumbers Conference: Sched-Ext: The BPF extensible scheduler class Microconference CFP

sched_ext is a Linux kernel feature which enables implementing host-wide, safe kernel thread schedulers in BPF, and dynamically loading them at runtime. sched_ext enables safe and rapid iterations of scheduler implementations, thus radically widening the scope of scheduling strategies that can be experimented with and deployed, even in massive and complex production environments.

sched_ext was first sent to the upstream list as an RFC patch set back in November 2022. Since then, the project has evolved a great deal, both technically, as well as in the significant growth of the community of sched_ext users and contributors.

This MC is the space for the community to discuss the developments of sched_ext, its impact on the community, and to outline future strategies aimed at integrating this feature into the Linux kernel and mainstream Linux distributions.

Ideas of topics to be discussed include (but are not limited to):

Challenges and plans to facilitate the upstream merge of sched_ext
User-space scheduling (offload part / all of the scheduling from kernel to user-space)
Scheduling for gaming and latency-sensitive workloads
Scheduling & cpufreq integration
Distro support

While we anticipate having a schedule with existing talk proposals at the MC, we invite you to submit proposals for any topic(s) you’d like to discuss. Time permitting, we are happy to readjust the schedule for additional topics that are of relevance to the sched_ext community.

Submissions are made via LPC submission system, selecting the track Sched-Ext: The BPF extensible scheduler class.

We will consider the submissions until July 12th.

July 12, 2024 02:38 AM

July 09, 2024

Linux Plumbers Conference: In-person registration is sold out

This year it took us a bit more time, but we did run out of places and the conference is currently sold out for in-person registration.
We are setting up a waitlist for in-person registration (virtual attendee places are still available).
Please fill in this form and try to be clear about your reasons for wanting to attend.
We are giving waitlist priority to new attendees and people expected to contribute content.

July 09, 2024 06:34 PM

July 07, 2024

Linux Plumbers Conference: Rust Microconference CFP

The Rust Microconference returns this year again. It covers both Rust in the kernel and Rust in general.

The submission deadline is July 14th. Submissions are made via the LPC submission system, selecting Rust MC for Track. Please see The Ideal Microconference Topic Session as well.

Possible Rust for Linux topics:

Rust in the kernel (e.g. status update, next steps).
Use cases for Rust around the kernel (e.g. subsystems, drivers, other modules…).
Discussions on how to abstract existing subsystems safely, on API design, on coding guidelines.
Integration with kernel systems and other infrastructure (e.g. build system, documentation, testing and CIs, maintenance, unstable
features, architecture support, stable/LTS releases, Rust versioning, third-party crates).
Updates on its subprojects (e.g. klint, pinned-init).
Rust versioning requirements and using Linux distributions’ toolchains.

Possible Rust topics:

Language and standard library (e.g. upcoming features, stabilization of the remaining features the kernel needs, memory model).
Compilers and codegen (e.g. rustc improvements, LLVM and Rust, rustc_codegen_gcc, gccrs.
Other tooling and new ideas (Coccinelle for Rust, bindgen, Compiler Explorer, Cargo, Clippy, Miri).
Educational material.
Any other Rust topic within the Linux ecosystem.

Hope to see you there!

July 07, 2024 04:44 PM

July 06, 2024

Linux Plumbers Conference: In memory of Daniel Bristot de Oliveira

It comes with great sadness that on June 24th, 2024 we lost a great contributor to the Linux Plumbers Conference and the whole of Linux generally. Daniel Bristot de Oliveira passed away unexpectedly at the age of 37. Daniel has been an active participant of Linux Plumbers since 2017. Not only has he given numerous talks, which were extremely educational, he also took leadership roles in running Microconferences. He was this year’s main Microconference runner for both the Scheduler Microconference as well as the Real-Time Microconference. This year’s conference will be greatly affected by his absence. Many have stated how Daniel made them feel welcomed at Linux Plumbers. He always had a smile, would make jokes and help developers come to a conclusion for those controversial topics. He perfectly embodied the essence of what Linux Plumbers was all about. He will be missed.

July 06, 2024 05:18 AM

July 04, 2024

Linux Plumbers Conference: Tracing / Perf Events Microconference CFP

The Linux kernel has grown in complexity over the years. Complete understanding of how it works via code inspection has become virtually impossible. Today, tracing is used to follow the kernel as it performs its complex tasks. Tracing is used today for much more than simply debugging. Its framework has become the way for other parts of the Linux kernel to enhance and even make possible new features. Live kernel patching is based on the infrastructure of function tracing, as well as BPF. It is now even possible to model the behavior and correctness of the system via runtime verification which attaches to trace points. There is still much more that is happening in this space, and this microconference will be the forum to explore current and new ideas.

This year, focus will also be on perf events:

Perf events are a mechanism for presenting performance counters and Linux software events to users. There are kernel and userland components to perf events. The kernel supplies the APIs and the perf tooling presents the data to users.

Possible ideas for topics for this year’s conference:

Feedback about the tracing/perf subsystems overall (e.g. how can people help the maintainers).
Reboot persistent in-memory tracing buffers, this would make ftrace a very powerful debugging and performance analysis tool for kexec and could also be used for post crash debugging.
– Handling exposing enum names dynamically to user space to improve symbolic printing.
Userspace instrumentation (libside), including discussion of its impacts on the User events ABI.
Collect state dump events from kernel drivers (e.g. dump wifi interfaces configuration at a given point in time through trace buffers).
Current work implementing performance monitoring in the kernel
User land profiling and analysis tools using the perf event API
Improving the kernel perf event and PMU APIs
Interaction between perf events and subsystems like cgroups, kvm, drm, bpf, etc.
Improving the perf tool and its interfaces in particular w.r.t. to scalability of the tool
Implementation of new perf features and tools using eBPF, like the ones in tools/perf/util/bpf_skel/
Further use of type information to augment the perf tools
Novel uses of perf events for debugging and correctness
New challenges in performance monitoring for the Linux kernel
Regression testing/CI integration for the perf kernel infrastructure and tools
Improving documentation
Security aspects of using tracing/perf tools

The submission deadline has been updated to July 12th.

Come and join us in the discussion, we hope to see you there!

Please follow the suggestions from this BLOG post when submitting a CFP for this track.

Submissions are made via LPC submission system, selecting Track “Tracing / Perf events MC”

July 04, 2024 04:30 PM

June 18, 2024

Gustavo A. R. Silva: How to use the new counted_by attribute in C (and Linux)

The counted_by attribute

The counted_by attribute was introduced in Clang-18 and will soon be available in GCC-15. Its purpose is to associate a flexible-array member with a struct member that will hold the number of elements in this array at some point at run-time. This association is critical for enabling runtime bounds checking via the array bounds sanitizer and the __builtin_dynamic_object_size() built-in function. In user-space, this extra level of security is enabled by -D_FORTIFY_SOURCE=3. Therefore, using this attribute correctly enhances C codebases with runtime bounds-checking coverage on flexible-array members.

Here is an example of a flexible array annotated with this attribute:

struct bounded_flex_struct {
        ...
        size_t count;
        struct foo flex_array[] __attribute__((__counted_by__(count)));
};

In the above example, count is the struct member that will hold the number of elements of the flexible array at run-time. We will call this struct member the counter.

In the Linux kernel, this attribute facilitates bounds-checking coverage through fortified APIs such as the memcpy() family of functions, which internally use __builtin_dynamic_object_size() (CONFIG_FORTIFY_SOURCE). As well as through the array-bounds sanitizer (CONFIG_UBSAN_BOUNDS).

The __counted_by() macro

In the kernel we wrap the counted_by attribute in the __counted_by() macro, as shown below.

#if __has_attribute(__counted_by__)
# define __counted_by(member)           __attribute__((__counted_by__(member)))
#else
# define __counted_by(member)
#endif

c8248faf3ca27 (“Compiler Attributes: counted_by: Adjust name…”)

And with this we have been annotating flexible-array members across the whole kernel tree over the last year.

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h b/drivers/net/ethernet/chelsio/cxgb4/sched.h
index 5f8b871d79afac..6b3c778815f09e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h
@@ -82,7 +82,7 @@ struct sched_class {
 
 struct sched_table {      /* per port scheduling table */
 	u8 sched_size;
-	struct sched_class tab[];
+	struct sched_class tab[] __counted_by(sched_size);
 };

ceba9725fb45 (“cxgb4: Annotate struct sched_table with …”)

However, as we are about to see, not all __counted_by() annotations are always as straightforward as the one above.

__counted_by() annotations in the kernel

There are a number of requirements to properly use the counted_by attribute. One crucial requirement is that the counter must be initialized before the first reference to the flexible-array member. Another requirement is that the array must always contain at least as many elements as indicated by the counter. Below you can see an example of a kernel patch addressing these requirements.

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
index dac7eb77799bd1..68960ae9898713 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
@@ -33,7 +33,7 @@ struct brcmf_fweh_queue_item {
 	u8 ifaddr[ETH_ALEN];
 	struct brcmf_event_msg_be emsg;
 	u32 datalen;
-	u8 data[];
+	u8 data[] __counted_by(datalen);
 };
 
 /*
@@ -418,17 +418,17 @@ void brcmf_fweh_process_event(struct brcmf_pub *drvr,
 	    datalen + sizeof(*event_packet) > packet_len)
 		return;
 
-	event = kzalloc(sizeof(*event) + datalen, gfp);
+	event = kzalloc(struct_size(event, data, datalen), gfp);
 	if (!event)
 		return;
 
+	event->datalen = datalen;
 	event->code = code;
 	event->ifidx = event_packet->msg.ifidx;
 
 	/* use memcpy to get aligned event message */
 	memcpy(&event->emsg, &event_packet->msg, sizeof(event->emsg));
 	memcpy(event->data, data, datalen);
-	event->datalen = datalen;
 	memcpy(event->ifaddr, event_packet->eth.h_dest, ETH_ALEN);
 
 	brcmf_fweh_queue_event(fweh, event);

62d19b358088 (“wifi: brcmfmac: fweh: Add __counted_by…”)

In the patch above, datalen is the counter for the flexible-array member data. Notice how the assignment to the counter event->datalen = datalen had to be moved to before calling memcpy(event->data, data, datalen), this ensures the counter is initialized before the first reference to the flexible array. Otherwise, the compiler would complain about trying to write into a flexible array of size zero, due to datalen being zeroed out by a previous call to kzalloc(). This assignment-after-memcpy has been quite common in the Linux kernel. However, when dealing with counted_by annotations, this pattern should be changed. Therefore, we have to be careful when doing these annotations. We should audit all instances of code that reference both the counter and the flexible array and ensure they meet the proper requirements.

In the kernel, we’ve been learning from our mistakes and have fixed some buggy annotations we made in the beginning. Here are a couple of bugfixes to make you aware of these issues:

6dc445c19050 (“clk: bcm: rpi: Assign ->num before accessing…”)
9368cdf90f52 (“clk: bcm: dvp: Assign ->num before accessing…”)

Another common issue is when the counter is updated inside a loop. See the patch below.

diff --git a/drivers/net/wireless/ath/wil6210/cfg80211.c b/drivers/net/wireless/ath/wil6210/cfg80211.c
index 8993028709ecfb..e8f1d30a8d73c5 100644
--- a/drivers/net/wireless/ath/wil6210/cfg80211.c
+++ b/drivers/net/wireless/ath/wil6210/cfg80211.c
@@ -892,10 +892,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	struct wil6210_priv *wil = wiphy_to_wil(wiphy);
 	struct wireless_dev *wdev = request->wdev;
 	struct wil6210_vif *vif = wdev_to_vif(wil, wdev);
-	struct {
-		struct wmi_start_scan_cmd cmd;
-		u16 chnl[4];
-	} __packed cmd;
+	DEFINE_FLEX(struct wmi_start_scan_cmd, cmd,
+		    channel_list, num_channels, 4);
 	uint i, n;
 	int rc;
 
@@ -977,9 +975,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	vif->scan_request = request;
 	mod_timer(&vif->scan_timer, jiffies + WIL6210_SCAN_TO);
 
-	memset(&cmd, 0, sizeof(cmd));
-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
@@ -1007,16 +1005,15 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	if (rc)
 		goto out_restore;
 
-	if (wil->discovery_mode && cmd.cmd.scan_type == WMI_ACTIVE_SCAN) {
-		cmd.cmd.discovery_mode = 1;
+	if (wil->discovery_mode && cmd->scan_type == WMI_ACTIVE_SCAN) {
+		cmd->discovery_mode = 1;
 		wil_dbg_misc(wil, "active scan with discovery_mode=1\n");
 	}
 
 	if (vif->mid == 0)
 		wil->radio_wdev = wdev;
 	rc = wmi_send(wil, WMI_START_SCAN_CMDID, vif->mid,
-		      &cmd, sizeof(cmd.cmd) +
-		      cmd.cmd.num_channels * sizeof(cmd.cmd.channel_list[0]));
+		      cmd, struct_size(cmd, channel_list, cmd->num_channels));
 
 out_restore:
 	if (rc) {
diff --git a/drivers/net/wireless/ath/wil6210/wmi.h b/drivers/net/wireless/ath/wil6210/wmi.h
index 71bf2ae27a984f..b47606d9068c8b 100644
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;
 
 #define WMI_MAX_PNO_SSID_NUM	(16)

34c34c242a1b (“wifi: wil6210: cfg80211: Use __counted_by…”)

The patch above does a bit more than merely annotating the flexible array with the __counted_by() macro, but that’s material for a future post. For now, let’s focus on the following excerpt.

-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
 ...
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;

Notice that in this case, num_channels is our counter, and it’s set to zero before the for loop. Inside the for loop, the original code used this variable as an index to access the flexible array, then updated it via a post-increment, all in one line: cmd.cmd.channel_list[cmd.cmd.num_channels++]. The issue is that once channel_list was annotated with the __counted_by() macro, the compiler enforces dynamic array indexing of channel_list to stay below num_channels. Since num_channels holds a value of zero at the moment of the array access, this leads to undefined behavior and may trigger a compiler warning.

As shown in the patch, the solution is to increment num_channels before accessing the array, and then access the array through an index adjustment below num_channels.

Another option is to avoid using the counter as an index for the flexible array altogether. This can be done by using an auxiliary variable instead. See an excerpt of a patch below.

diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
index 38eb7ec86a1a65..21ebd70f3dcc97 100644
--- a/include/net/bluetooth/hci.h
+++ b/include/net/bluetooth/hci.h
@@ -2143,7 +2143,7 @@ struct hci_cp_le_set_cig_params {
 	__le16  c_latency;
 	__le16  p_latency;
 	__u8    num_cis;
-	struct hci_cis_params cis[];
+	struct hci_cis_params cis[] __counted_by(num_cis);
 } __packed;

@@ -1722,34 +1717,33 @@ static int hci_le_create_big(struct hci_conn *conn, struct bt_iso_qos *qos)
 
 static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 {
 ...

+	u8 aux_num_cis = 0;
 	u8 cis_id;
 ...

 	for (cis_id = 0x00; cis_id < 0xf0 &&
-	     pdu.cp.num_cis < ARRAY_SIZE(pdu.cis); cis_id++) {
+	     aux_num_cis < pdu->num_cis; cis_id++) {
 		struct hci_cis_params *cis;
 
 		conn = hci_conn_hash_lookup_cis(hdev, NULL, 0, cig_id, cis_id);
@@ -1758,7 +1752,7 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 
 		qos = &conn->iso_qos;
 
-		cis = &pdu.cis[pdu.cp.num_cis++];
+		cis = &pdu->cis[aux_num_cis++];
 		cis->cis_id = cis_id;
 		cis->c_sdu  = cpu_to_le16(conn->iso_qos.ucast.out.sdu);
 		cis->p_sdu  = cpu_to_le16(conn->iso_qos.ucast.in.sdu);
@@ -1769,14 +1763,14 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 		cis->c_rtn  = qos->ucast.out.rtn;
 		cis->p_rtn  = qos->ucast.in.rtn;
 	}
+	pdu->num_cis = aux_num_cis;
 
 ...

ea9e148c803b (“Bluetooth: hci_conn: Use __counted_by() and…”)

Again, the entire patch does more than merely annotate the flexible-array member, but let’s just focus on how aux_num_cis is used to access flexible array pdu->cis[].

In this case, the counter is num_cis. As in our previous example, originally, the counter is used to directly access the flexible array: &pdu.cis[pdu.cp.num_cis++]. However, the patch above introduces a new variable aux_num_cis to be used instead of the counter: &pdu->cis[aux_num_cis++]. The counter is then updated after the loop: pdu->num_cis = aux_num_cis.

Both solutions are acceptable, so use whichever is convenient for you.

Here, you can see a recent bugfix for some buggy annotations that missed the details discussed above:

[PATCH] wifi: iwlwifi: mvm: Fix _counted_by usage in cfg80211_wowlan_nd*

In a future post, I’ll address the issue of annotating flexible arrays of flexible structures. Spoiler alert: don’t do it!

June 18, 2024 05:49 PM

June 17, 2024

Linux Plumbers Conference: Microconference topic submissions deadlines are coming soon!

We are excited about the submissions that are coming in to Linux Plumbers 2024. If you want to discuss a topic at one of the Microconferences, you should start putting together a problem statement and submit. Each Microconference has its own defined deadline. To submit, go to the Call for Proposals page and select Submit new abstract. After filling out your problem statement in the Content section, make sure to select the proper Microconference in the Track pull down list. It is recommended to read this blog before writing up your submission.

And a reminder that the other tracks submissions are ending soon as well.

June 17, 2024 11:04 PM

June 15, 2024

Linux Plumbers Conference: Submission deadline for LPC refereed track proposals extended by a week

The deadline for submitting refereed track proposals for the 2024 Linux Plumbers Conference has been extended until 23 June.

If you have already submitted a proposal, thank you very much! For the rest of you, there is one additional week in which to get your proposal submitted. We very much look forward to seeing what you all come up with.

June 15, 2024 02:25 PM

June 12, 2024

Matthew Garrett: SSH agent extensions as an arbitrary RPC mechanism

A while back, I wrote about using the SSH agent protocol to satisfy WebAuthn requests. The main problem with this approach is that it required starting the SSH agent with a special argument and also involved being a little too friendly with the implementation - things worked because I could provide an arbitrary public key and the implementation never validated that, but it would be legitimate for it to start doing so and then break everything. And it also only worked for keys stored on tokens that ssh supports - there was no way to extend this to other keystores on the client (such as the Secure Enclave on Macs, or TPM-backed keys on PCs). I wanted a better solution.

It turns out that it was far easier than I expected. The ssh agent protocol is documented here, and the interesting part is the extension support extension mechanism. Basically, you can declare an extension and then just tunnel whatever you want over it. As before, my goto was the go ssh agent package which conveniently implements both the client and server side of this. Implementing the local agent is trivial - look up SSH_AUTH_SOCK, connect to it, create a new agent client that can communicate with that by calling NewClient, and then implement the ExtendedAgent interface, create a new socket, and call ServeAgent against that. Most of the ExtendedAgent functions should simply call through to the original agent, with the exception of Extension(). Just add a case statement against extensionType, define some reasonably namespaced extension, and you're done.

Now you need to use this agent. You probably don't want to use this for arbitrary hosts (agent forwarding should only be enabled for remote systems you trust, not arbitrary machines you connect to - if you enabled agent forwarding for github and github got compromised, github would be able to use any private keys loaded into your agent, and you probably don't want that). So the right approach is to add a Host entry to the ssh config with a ForwardAgent stanza pointing at the socket you created in your new agent. This way the configured subset of remote hosts will automatically talk to this new custom agent, while forwarding for anything else will still be at the user's discretion.

For the remote end things are even easier. Look up SSH_AUTH_SOCK and call NewClient as before, and then simply call client.Extension(). Whatever you stick in the contents argument will simply end up being received at the client end. You now have a communication channel between a the remote system and the local client, and what you do with that is up to you. I'm using it to allow a remote system to obtain auth tokens from Okta and forward WebAuthn challenges that can either be satisfied via a local WebAuthn token or by passing the query off to Mac TouchID, but there's fundamentally no constraints whatsoever on what can be done here.

(If you want to do this on Windows and still have everything work with existing clients you'll need to take this into account - Windows didn't really do Unix sockets until recently so everything there is awful)

comments

June 12, 2024 02:57 AM

May 27, 2024

Linux Plumbers Conference: Registration for LPC 2024 is open

We’re happy to announce that registration for LPC 2024 is now open. To register please go to our attend page.

To try to prevent the instant sellout we had in previous years we are keeping our cancellation policy of no refunds, only transfers of registrations. You will find more details during the registration process. LPC 2024 follows the Linux Foundation’s health & safety policy.

As usual we expect to sell our rather quickly so don’t delay your registration for too long!

May 27, 2024 08:11 AM

May 13, 2024

Linux Plumbers Conference: Update on the Microconference situation

Unfortunately we still do not know the total cost of the 4th track yet. We are still in the process of looking at the costs of adding another room, but we do not want to delay the acceptance of topics to Microconferences any further. We have decided to accept all pending Microconferences with one caveat. That is, we are not accepting the rest as full Microconferences. The Microconferences being accepted now will become one of the following at Linux Plumbers 2024:

A full 3 hour Microconference
A 1 and a half hour Microconference (Nanoconference)
A full 3 hour Microconference but without normal Audio/Video

That last one is another option we are looking at. The main cost to having a 4th track is the manned AV operations. But we could add the 4th track without normal AV. Instead, these would get a BBB room where an Owl video camera (or the like) and a Jabra speaker will be in place. The quality of the AV will not be as good as having a fully manned room, but this would be better than being rejected from the conference, or having half the time of a full microconference.

Even with a 4th track, two still need to become Nanoconferences.

In the mean time, we will be accepting the rest of the Microconferences so that they can start putting together content. How they are presented at Linux Plumbers is still to be determined.

Note that this also means we will likely be dropping the 3 free passes that a Microconference usually gets down to only 2 passes.

The accepted Microconferences (as full, half or no A/V) are:

Sched_ext
Containers and Checkpoint/Restore
Confidential Computing
Real-Time
Build Systems
RISC-V
Compute Express Link
X86
VFIO/IOMMU/PCI
System Boot and Security
Zone Storage
Internet of Things
Embedded
Complex Cameras
Power Management and Thermal Control
Kernel <-> Userspace/Init/System Management Boundaries and APIs

May 13, 2024 06:50 PM

May 04, 2024

Harald Welte: OsmoDevCon 2024: "Introduction to XDP, eBPF and AF_XDP"

I've presented a talk Introduction to XDP, eBPF and AF_XDP as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

This talk provides a generic introduction to a set of modern Linux kernel technologies:

eBPF (extended Berkeley Packet Filter) is a kind of virtual machine that runs sandboxed programs inside the Linux kernel.
XDP (eXpress Data Path) is a framework for eBPF that enables high-performance programmable packet processing in the Linux kernel
AF_XDP is an address family that is optimized for high-performance packet processing. It allows in-kernel XDP eBPF programs to efficiently pass packets to userspace via memory-mapped ring buffers.

The talk provides a high-level overview. It should provide some basics before the other/later talks on bpftrace and eUPF.

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-204-introduction-to-xdp-ebpf-and-afxdp

May 04, 2024 04:00 PM

Harald Welte: OsmoDevCon 2024: "Using bpftrace to analyze osmocom performance"

I've presented a talk Using bpftrace to analyze osmocom performance as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

bpftrace is a utility that uses the Linux kernel tracing infrastructure (and eBPF) in order to provide tracing capabilities within the kernel, like uprobe, kprobe, tracepoints, etc.

bpftrace can help us to analyze the performance of [unmodified] Osmocom programs and quickly provide information like, for example:

Histogram of time spent in a specific system call
Histogram of any argument or return value of any system call

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-203-using-bpftrace-to-analyze-osmocom-performance

May 04, 2024 04:00 PM

May 03, 2024

Linux Plumbers Conference: Awesome amount of Microconference submissions!

The Call-for-Proposals for Microconferences has come to a close, and with that, this year’s list of Microconferences is to be decided. A Microconference is a 3 and a half hour session with a half hour break (giving a total of 3 hours of content). Linux Plumbers has three Microconference tracks running per day, with each track having two Microconferences (one in the morning and one in the afternoon). Linux Plumbers runs for three days allowing for 18 Microconferences total (2 per track, with 3 tracks a day for 3 days).

This year we had a total of 26 quality submissions! Linux Plumbers is known as the conference that gets work done, and its success is proof of that. But sometimes success brings its own problems. How can we accept 26 Microconferences when we only have 18 slots to place them? Two of the Microconferences have agreed to merge as one bringing the total down to just 25. But that still is 7 more than we can handle.

We want to avoid rejecting 7 microconferences, but to do so, we need to make compromises. The first idea we have is to add a 4th Microconference track. But that still only gives us 6 more slots. As it will also require more A/V and manpower, the cost will increase and may not be within the budget to do so.

Pros to a 4 track are:

Have 24 full Microconferences and reject one (or 23 and keep 2 as per the next option).

Cons to a 4 track are:

Increased costs.
Having 4 Microconference tracks running simultaneously will cause more conflicts in the schedule. People have complained in the past about conflicts between sessions with just 3 tracks, having 4 will exacerbate the situation.
Still may need to reject 1 Microconference.

Another solution is to create a half Microconference (Nanoconference?). That is an hour and a half session, run the same as the full sessions, with the 30 minute break between two Nanoconferences. Doing so will allow for 11 full Microconferences and 14 Nanoconferences which will allow for all submissions to be accepted and fit within the 3 tracks.

The difference between a Nanoconference and a BOF is that a Nanoconference still has all the rules of a Microconference. That is, all sessions should be strictly discussion focus. If presentations are needed, they should be submitted as Refereed talks (the CFP for them are still open). A BOF is usually focused on a single issue. A Nanoconference should still be broken up into small discussions about different issues with sessions lasting 15 to 20 minutes each.

Pros for Nanoconferences are:

Can accommodate all submitted Microconferences.

Cons for Nanoconferences are:

Shortened time for Microconferences, even for topics in the past that had filled a full Microconference may now only get half the time.

Note, as BOFs will be in a separate track, a Nanoconference may be able to still submit for topics there (BOF submissions are still open).

Currently, we also give out 3 free passes to each Microconference that can be handed to anyone in their session. For 18 Microconferences, that is 54 passes. This will not be feasible to give out 3 passes to 25 Microconferences (totaling 75 passes), thus one solution is to drop it down to 2 free passes. The problem with passes is still an issue with the Nanoconference approach, as you can not give out 1 and a half passes. Thus, the Nanoconferences may only get 1 pass each, or perhaps have both the Microconferences and Nanoconferences all get just 2 passes each.

Anyway, since the above solutions still allow for 11 full Microconferences, we have accepted 9 so far. They are:

Android

KVM

Kernel Memory Management

Scheduling

Rust

Kernel Testing and Dependability

Graphics and DRM

Safe Systems with Linux

Tracing / Perf events (This is the merged Microconference)

We are still weighing our options so stay tuned for updates on the situation, and thank you to all the Microconference submitters that make
Linux Plumbers the best technical conference around!

May 03, 2024 04:54 PM

Harald Welte: OsmoDevCon 2024: "GlobalPlatform in USIM and eUICC"

I've presented a talk GlobalPlatform in USIM and eUICC as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

The GlobalPlatform Card Specification and its many amendments play a significant role in most real-world USIM/ISIM, and even more so in eUICC.

The talk will try to provide an overview of what GlobalPlatform does in the telecommunications context.

Topics include:

security domains
key loading
card and application life cycle
loading and installation of applications
Secure Channel Protocols SCP02, SCP03

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-173-globalplatform-in-usim-and-euicc

May 03, 2024 04:00 PM

Harald Welte: OsmoDevCon 2024: "Detailed workings of OTA for SIM/USIM/eUICC"

I've presented a talk Detailed workings of OTA for SIM/USIM/eUICC as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

Everyone knows that OTA (over the air) access to SIM cards exists for decades, and that somehow authenticated APDUs can be sent via SMS.

But let's look at the OTA architecture in more detail:

OTA transport (SCP80) over SMS, USSD, CellBroadcast, CAT-TP, BIP
The new SCP81 transport (HTTPS via TLS-PSK)
how to address individal applications on the card via their TAR
common applications like RFM and RAM
custom applications on the card
OTA in the world of eUICCs
talking to the ECASD
talking to the ISD-R
talking to the ISD-P/MNO-SD or applications therein

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-175-detailed-workings-of-ota-for-sim-usim-euicc

May 03, 2024 04:00 PM

Harald Welte: OsmoDevCon 2024: "Anatomy of the eSIM Profile"

I've presented a talk Anatomy of the eSIM Profile as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

In the eSIM universe, eSIM profiles are the virtualised content of a classic USIM (possibly with ISIM, CSIM, applets, etc.).

Let's have a look what an eSIM profile is:

how is the data structured / organized?
what data can be represented in it?
how to handle features provided by eUICC, how can the eSIM profile mandate some of them?
how does personalization of eSIM profiles work?

There is also hands-on navigation through profiles, based on the pySim.esim.saip module.

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-174-anatomy-of-the-esim-profile

May 03, 2024 04:00 PM

May 02, 2024

Harald Welte: OsmoDevCon 2024: "High-performance I/O using io_uring via osmo_io"

I've co-presented a talk (together with Andreas Eversberg High-performance I/O using io_uring via osmo_io as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

Traditional socket I/O via read/write/recvfrom/sendto/recvmsg/sendmsg and friends creates a very high system call load. A highly-loaded osmo-bsc spends most of its time in syscall entry and syscall exit.

io_uring is a modern Linux kernel mechanism to avoid this syscall overhead. We have introduced the osmo_io`API to libosmocore as a generic back-end for non-blocking/asynchronous I/O and a back-end for our classic `osmo_fd / poll approach as well as a new backend for io_uring.

The talk will cover

a very basic io_uring introduction
a description of the osmo_io API
the difficulties porting from osmo_fd to osmo_io
status of porting various sub-systems over to osmo_io

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-209-high-performance-i-o-using-iouring-via-osmoio

May 02, 2024 04:00 PM

April 29, 2024

Pete Zaitcev: Export to STEP in OpenSCAD

The only way to obtain STEP from OpenSCAD that I know is an external tool that someone made. It's pretty crazy actually: it parses OpenSCAD's native export, CSG, and issues commands to OpenCASCADE's CLI, OCC-CSG. The biggest issue for me here is that his approach cannot handle transformations that the CLI does not support. I use hull all over the place and a tool that does not support hull is of no use for me.

So I came up with a mad lad idea: just add a native export of STEP to OpenSCAD. The language itself is constructive, and an export to CSG exists. I just need to duplicate whatever it does, and then at each node, transform it into something that can be expressed in STEP.

As it turned out, STEP does not have any operations. It only has manifolds assembled from faces, which are assembled from planes and lines, which are assembled from cartesian points and vectors. Thus, I need to walk the CSG, compile it into a STEP representation, and only then write it out. Operations like union, difference, or hull have to be computed by my code. The plan is to borrow from OpenSCAD's compiler that builds the mesh, only build with larger pieces - possibly square or round.

Not sure if this is sane and can be made to work, but it's pretty fun at least.

April 29, 2024 05:37 AM

April 18, 2024

Pete Zaitcev: sup Python you okay bro

What do you think this does:

class A(object):
def aa(self):
return 'A1'
class A(object):
def aa(self):
return 'A2'
a = A()
print("%s" % a.aa())

It prints "A2".

But before you think "what's the big deal, the __dict__ of A is getting updated", how about this:

class A(object):
def aa(self):
return 'A1'
class A(object):
def bb(self):
return 'A2'
a = A()
print("%s" % a.aa())

This fails with "AttributeError: 'A' object has no attribute 'aa'".

Apparently, the latter definition replaces the former completely. This is darkly amusing.

Python 3.12.2

April 18, 2024 02:44 AM

April 16, 2024

Pete Zaitcev: Trailing whitespace in vim

Problem:
When copying from tmux in gnome-terminal, the text is full of whitespace. How do I delete it in gvim?

Solution:
/ \+$

Obviously.

This is an area where tmux is a big regression from screen. Too bad.

April 16, 2024 08:26 PM

April 15, 2024

Pete Zaitcev: Boot management magic in Fedora 39

Problem: After an update to F39, a system continues to boot F38 kernels

The /bin/kernel-install generates entries in /boot/efi/loader/entries instead of /boot/loader/entries. Also, they are in BLS Type 1 format, and not in the legacy GRUB format. So I cannot copy them over.

Solution:
[root@chihiro zaitcev]# dnf install ostree
[root@chihiro zaitcev]# rm -rf /boot/efi/$(cat /etc/machine-id) /boot/efi/loader/

I've read a bunch of docs and the man page for kernel-install(8), but they are incomprehensible. Still the key insight was that all that Systemd stuff loves to autodetect by finding this directory or that.

The way to test is:
[root@chihiro zaitcev]# /bin/kernel-install -v add 6.8.5-201.fc39.x86_64 /lib/modules/6.8.5-201.fc39.x86_64/vmlinuz

April 15, 2024 05:51 PM

March 29, 2024

Harald Welte: Gradual migration of IP address/port between servers

I'm a strong proponent of self-hosting all your services, if not on your own hardware than at least on dedicated rented hardware. For IT nerds of my generation, this has been the norm sicne the early 1990s: If you wante to run your own webserver/mailserver/... back then, the only way was to self-host.

So over the last 30 years, I've always been running a fleet of machines, some my own hardware colocated, and during the past ~18 years also some rented dedicated "root servers". They run a plethora of services for either my personal stuff (like this blog, or my personal email server), or any of the IT services of the open source projects I'm involved in (like osmocom) or the company I co-founded and run (sysmocom).

Every few years there's the need to migrate to new hardware. Either due to power consumption/efficiency, or to increase performance, or to simply avoid aging hardware that may be dying soon.

When upgrading from one [hosted] server to another [hosted] server, there's always the question of how to manage the migration with minimal interruption to services. For very simple services like http/https, it can usually be done entirely within DNS: You reduce the TTL of the records, bring up the service on the new server (with a new IP), make the change in the DNS and once the TTL of the DNS record is expired in all caches, everybody will access the new server/IP.

However, there are services where the IP address must be retained. SMTP is a prime example of that. Given how spam filtering works, you certainly will not want to give up your years if not decadeds of good reputation for your IP address. As a result, you will want to keep the IP address while doing the migration.

If it's a physical machine in colocation or your home, you can of course do that all rather easily under your control. You can synchronize the various steps from stopping the services on the old machine, rsync'ing over the spool files to the new, then migrate the IP over to the new machine.

However, if it's a rented "root" server at a company like Hetzner or KVH, then you do not have full control over when exactly the IP address will be migrated over to the new server.

Also, if there are many different services on that same physical machine, running on a variety of different IPv4/IPv6 addresess and ports, it may be difficult to migrate all of them at once. It would be much more practical, if individual services could be migrated step by step.

The poor man's approach would be to use port-forwarding / reverse-proxying. In this case, the client establishes a TCP connection to the old IP address on the old server, and a small port-forward proxy accepts that TCP connectin, creates a second TCP connection to the new server, and bridges those two together. This approach only works for the most simplistic of services (like web servers), where

there are only inbound connections from remote clients (as outbound connections from the new server would originate from the new IP, not the old one), and
where the source IP of the client doesn't matter. To the new server all connections' source IP addresses suddenly are masked and there's only one source IP (the old server) for all connections.

For more sophisticated serviecs (like e-mail/SMTP, again), this is not an option. The SMTP client IP address matters for whitelists/blacklists/relay rules, etc. And of course there are also plenty of outbound SMTP connections which need to originate from the old IP, not the new IP.

So in bed last night [don't ask], I was brainstorming if the goal of fully transparent migration of individual TCP+UDP/IP (IPv4+IPv6) services could be made between and old and new server. In theory it's rather simple, but in practice the IP stack is not really designed for this, and we're breaking a lot of the assumptions and principles of IP networking.

After some playing around earlier today, I was actually able to create a working setup!

It fulfills the followign goals / exhibits the following properties:

old and new server run concurrently for any amount of time
individual IP:port tuples can be migrated from old to new server, as services are migrated step by step
fully transparent to any remote peer: Old IP:port of server visible to client
fully transparent to the local service: Real client IP:port of client visible to server
no constraints on whether or not the old and new IPs are in the same subnet, link-layer, data centre, ...
use only stock features of the Linux kernel, no custom software, kernel patches, ...
no requirement for controlling a router in front of either old or new server

General Idea

The general idea is to receive and classify incoming packets on the old server, and then selectively tunnel some of them via a GRE tunnel from the old machine to the new machine, where they are decapsulated and passed to local processes on the new server. Any packets generated by the service on the new server (responses to clients or outbound connections to remote serveers) will take the opposite route: They will be encapsulated on the new server, passed through that GRE tunnel back to the old server, from where they will be sent off to the internet.

That sounds simple in theory, but it poses a number of challenges:

packets destined for a local IP address of the old server need to be re-routed/forwarded, not delivered to local sockets. This is easily done with fwmark, multiple routing tables and a rule, similar ro many other policy routing setups.
FIXME

March 29, 2024 04:00 PM

Kernel Planet

June 24, 2025

Matthew Garrett: Why is there no consistent single signon API flow?

June 20, 2025

Matthew Garrett: My a11y journey

June 17, 2025

Matthew Garrett: Locally hosting an internet-connected server

June 09, 2025

Dave Airlie (blogspot): radv: vulkan VP9 video decode

June 05, 2025

Matthew Garrett: Twitter's new encrypted DMs aren't better than the old ones

Matthew Garrett: How Twitter could (somewhat) fix their encrypted DMs

May 21, 2025

Brendan Gregg: 3 Years of Extremely Remote Work

May 14, 2025

Linux Plumbers Conference: Submission time for Linux Plumbers 2025

April 30, 2025

Brendan Gregg: Doom GPU Flame Graphs

CPU Analysis

GPU Analysis

GZDoom

Using iaprof: Intel's open source accelerator profiler

FAQ and Future Work

Conclusion

March 18, 2025

Matthew Garrett: Failing upwards: the Twitter encrypted DM failure

Andi Kleen: Quitting an Intel x86 hypervisor

March 03, 2025

Lucas De Marchi: Lazy libkmod compression library loading

January 11, 2025

Pete Zaitcev: Looking for a BSSID

January 02, 2025

Matthew Garrett: The GPU, not the TPM, is the root of hardware DRM

December 12, 2024

Matthew Garrett: When should we require that firmware be free?

Matthew Garrett: Android privacy improvements break key attestation

October 30, 2024

Pete Zaitcev: virtio_pci: do not wait forvever at a reset

Pete Zaitcev: LinkedIn Asked You To Train Their AI

October 28, 2024

Brendan Gregg: AI Flame Graphs

Instruction-offset Profiling

What's a Flame Graph?

Searching Samples

Who will use this?

Why is AI profiling hard?

What do AI developers think of this?

What about PyTorch?

First Release: Sometimes hard and with moderate overhead

Availability

Conclusions

Disclaimer

Updates

Thanks

October 23, 2024

Harald Welte: On Linux MAINTAINERS file removal of Russian developers

October 22, 2024

Harald Welte: Oral history transcripts: Pioneers of Taiwans Chip + PC industry

Harald Welte: Back to Taiwan the first time after 5 years

October 10, 2024

Paul E. Mc Kenney: Parallel Programming: Cooperation

October 06, 2024

Pete Zaitcev: Adventures in proprietary software, Solidworks edition

October 04, 2024

Dave Airlie (blogspot): zinking the video

September 23, 2024

Linux Plumbers Conference: That’s a wrap! Thanks everyone for Linux Plumbers 2024

September 13, 2024

Linux Plumbers Conference: Playback of Presenter and BBB Training is available

August 30, 2024

Dave Airlie (blogspot): On Rust, Linux, developers, maintainers

Archetypes:

1. Wayfinders/Mapmakers

2. Road builders

3. Road maintainers

Interactions:

Maintainer types:

1. Positive and engaged

2. Positive with real concerns

3. Negative with real concerns