Thanks for all the suggestions, on here, and on twitter and on mastodon, anyway who noted I could use a single fd and avoid all the pain was correct!
I hacked up an ever growing ftruncate/madvise memfd and it seemed to work fine. In order to use it for sparse I have to use it for all device memory allocations in lavapipe which means if I push forward I probably have to prove it works and scales a bit better to myself. I suspect layering some of the pb bufmgr code on top of an ever growing fd might work, or maybe just having multiple 2GB buffers might be enough.
Not sure how best to do shaderResourceResidency, userfaultfd might be somewhat useful, mapping with PROT_NONE and then using write(2) to get a -EFAULT is also promising, but I'm not sure how best to avoid segfaults for read/writes to PROT_NONE regions.
Once I got that going, though I ran headfirst into something that should have been obvious to me, but I hadn't thought through.
llvmpipe allocates all it's textures linearly, there is no tiling (even for vulkan optimal). Sparse textures are incompatible with linear implementations. For sparseImage2D you have to be able to give the sparse tile sizes from just the image format. This typically means you have to work out how large the tile that fits into a hw page is in w/h. Of course for a linear image, this would be dependent on the image stride not just the format, and you just don't have that information.
I guess it means texture tiling in llvmpipe might have to become a thing, we've thought about it over the years but I don't think there's ever been a solid positive for implementing it.
Might have to put sparse support on the back burner for a little while longer.
Mike nerdsniped me into wondering how hard sparse memory support would be in lavapipe.
The answer is unfortunately extremely.
Sparse binding essentially allows creating a vulkan buffer/image of a certain size, then plugging in chunks of memory to back it in page-size multiple chunks.
This works great with GPU APIs where we've designed this, but it's actually hard to pull off on the CPU.
Currently lavapipe allocates memory with an aligned malloc. It allocates objects with no backing and non-sparse bindings connect objects to the malloced memory.
However with sparse objects, the object creation should allocate a chunk of virtual memory space, then sparse binding should bind allocated device memory into the virtual memory space. Except Linux has no interfaces for doing this without using a file descriptor.
You can't mmap a chunk of anonymous memory that you allocated with malloc to another location. So if I malloc backing memory A at 0x1234000, but the virtual memory I've used for the object is at 0x4321000, there's no nice way to get the memory from the malloc to be available at the new location (unless I missed an API).
However you can do it with file descriptors. You can mmap a PROT_NONE area for the sparse object, then allocate the backing memory into file descriptors, then mmap areas from those file descriptors into the correct places.
But there are limits on file descriptors, you get 1024 soft, or 4096 hard limits by default, which is woefully low for this. Also *all* device memory allocations would need to be fd backed, not just ones going to be used in sparse allocations.
Vulkan has a limit maxMemoryAllocationCount that could be used for this, but setting it to the fd limit is a problem because some fd's are being used by the application and just in general by normal operations, so reporting 4096 for it, is probably going to explode if you only have 3900 of them left.
Also the sparse CTS tests don't respect the maxMemoryAllocationCount anyways :-)
I shall think on this a bit more, please let me know if anyone has any good ideas!
(Edit 2023-05-10: This has now launched for a subset of Twitter users. The code that existed to notify users that device identities had changed does not appear to have been enabled - as a result, in its current form, Twitter can absolutely MITM conversations and read your messages)
Elon Musk appeared on an interview with Tucker Carlson last month, with one of the topics being the fact that Twitter could be legally compelled to hand over users' direct messages to government agencies since they're held on Twitter's servers and aren't encrypted. Elon talked about how they were in the process of implementing proper encryption for DMs that would prevent this - "You could put a gun to my head and I couldn't tell you. That's how it should be."
tl;dr - in the current implementation, while Twitter could subvert the end-to-end nature of the encryption, it could not do so without users being notified. If any user involved in a conversation were to ignore that notification, all messages in that conversation (including ones sent in the past) could then be decrypted. This isn't ideal, but it still seems like an improvement over having no encryption at all. More technical discussion follows.
For context: all information about Twitter's implementation here has been derived from reverse engineering version 9.86.0 of the Android client and 9.56.1 of the iOS client (the current versions at time of writing), and the feature hasn't yet launched. While it's certainly possible that there could be major changes in the protocol between now launch, Elon has asserted that they plan to launch the feature this week so it's plausible that this reflects what'll ship.
For it to be impossible for Twitter to read DMs, they need to not only be encrypted, they need to be encrypted with a key that's not available to Twitter. This is what's referred to as "end-to-end encryption", or e2ee - it means that the only components in the communication chain that have access to the unencrypted data are the endpoints. Even if the message passes through other systems (and even if it's stored on other systems), those systems do not have access to the keys that would be needed to decrypt the data.
End-to-end encrypted messengers were initially popularised by Signal, but the Signal protocol has since been incorporated into WhatsApp and is probably much more widely used there. Millions of people per day are sending messages to each other that pass through servers controlled by third parties, but those third parties are completely unable to read the contents of those messages. This is the scenario that Elon described, where there's no degree of compulsion that could cause the people relaying messages to and from people to decrypt those messages afterwards.
But for this to be possible, both ends of the communication need to be able to encrypt messages in a way the other end can decrypt. This is usually performed using AES, a well-studied encryption algorithm with no known significant weaknesses. AES is a form of what's referred to as a symmetric encryption, one where encryption and decryption are performed with the same key. This means that both ends need access to that key, which presents us with a bootstrapping problem. Until a shared secret is obtained, there's no way to communicate securely, so how do we generate that shared secret? A common mechanism for this is something called Diffie Hellman key exchange, which makes use of asymmetric encryption. In asymmetric encryption, an encryption key can be split into two components - a public key and a private key. Both devices involved in the communication combine their private key and the other party's public key to generate a secret that can only be decoded with access to the private key. As long as you know the other party's public key, you can now securely generate a shared secret with them. Even a third party with access to all the public keys won't be able to identify this secret. Signal makes use of a variation of Diffie-Hellman called Extended Triple Diffie-Hellman that has some desirable properties, but it's not strictly necessary for the implementation of something that's end-to-end encrypted.
Although it was rumoured that Twitter would make use of the Signal protocol, and in fact there are vestiges of code in the Twitter client that still reference Signal, recent versions of the app have shipped with an entirely different approach that appears to have been written from scratch. It seems simple enough. Each device generates an asymmetric keypair using the NIST P-256 elliptic curve, along with a device identifier. The device identifier and the public half of the key are uploaded to Twitter using a new API endpoint called /1.1/keyregistry/register. When you want to send an encrypted DM to someone, the app calls /1.1/keyregistry/extract_public_keys with the IDs of the users you want to communicate with, and gets back a list of their public keys. It then looks up the conversation ID (a numeric identifier that corresponds to a given DM exchange - for a 1:1 conversation between two people it doesn't appear that this ever changes, so if you DMed an account 5 years ago and then DM them again now from the same account, the conversation ID will be the same) in a local database to retrieve a conversation key. If that key doesn't exist yet, the sender generates a random one. The message is then encrypted with the conversation key using AES in GCM mode, and the conversation key is then put through Diffie-Hellman with each of the recipients' public device keys. The encrypted message is then sent to Twitter along with the list of encrypted conversation keys. When each of the recipients' devices receives the message it checks whether it already has a copy of the conversation key, and if not performs its half of the Diffie-Hellman negotiation to decrypt the encrypted conversation key. One it has the conversation key it decrypts it and shows it to the user.
What would happen if Twitter changed the registered public key associated with a device to one where they held the private key, or added an entirely new device to a user's account? If the app were to just happily send a message with the conversation key encrypted with that new key, Twitter would be able to decrypt that and obtain the conversation key. Since the conversation key is tied to the conversation, not any given pair of devices, obtaining the conversation key means you can then decrypt every message in that conversation, including ones sent before the key was obtained.
(An aside: Signal and WhatsApp make use of a protocol called Sesame which involves additional secret material that's shared between every device a user owns, hence why you have to do that QR code dance whenever you add a new device to your account. I'm grossly over-simplifying how clever the Signal approach is here, largely because I don't understand the details of it myself. The Signal protocol uses something called the Double Ratchet Algorithm to implement the actual message encryption keys in such a way that even if someone were able to successfully impersonate a device they'd only be able to decrypt messages sent after that point even if they had encrypted copies of every previous message in the conversation)
How's this avoided? Based on the UI that exists in the iOS version of the app, in a fairly straightforward way - each user can only have a single device that supports encrypted messages. If the user (or, in our hypothetical, a malicious Twitter) replaces the device key, the client will generate a notification. If the user pays attention to that notification and verifies with the recipient through some out of band mechanism that the device has actually been replaced, then everything is fine. But, if any participant in the conversation ignores this warning, the holder of the subverted key can obtain the conversation key and decrypt the entire history of the conversation. That's strictly worse than anything based on Signal, where such impersonation would simply not work, but even in the Twitter case it's not possible for someone to silently subvert the security.
So when Elon says Twitter wouldn't be able to decrypt these messages even if someone held a gun to his head, there's a condition applied to that - it's true as long as nobody fucks up. This is clearly better than the messages just not being encrypted at all in the first place, but overall it's a weaker solution than Signal. If you're currently using Twitter DMs, should you turn on encryption? As long as the limitations aren't too limiting, definitely! Should you use this in preference to Signal or WhatsApp? Almost certainly not. This seems like a genuine incremental improvement, but it'd be easy to interpret what Elon says as providing stronger guarantees than actually exist.
After some hiccups with Indico we’ve finally set up a page that lists submitted microconference proposals. Along with seasoned veterans like Containers and Checkpoint/Restore and RISC-V we are glad to see Live Patching microconference returning after a long break and a brand new Linux Kernel Debugging microconference.
The Proposed microconfences page will be updated from time until the CFP for microconference proposals will be closed on June, 1.
Be sure not to miss the deadline and submit your microconference!
Much angst (and discussion ink) is wasted in open source over whether pulling in code from one project with a different licence into another is allowable based on the compatibility of the two licences. I call this problem self defeating because it creates sequestered islands of incompatibly licensed but otherwise fully open source code that can never ever meet in combination. Everyone from the most permissive open source person to the most ardent free software one would agree this is a problem that should be solved, but most of the islands would only agree to it being solved on their terms. Practically, we have got around this problem by judicious use of dual licensing but that requires permission from the copyright holders, which can sometimes be hard to achieve; so dual licensing is more a band aid than a solution.
In this blog post, I’m going to walk you through the reasons behind cone the most intractable compatibility disputes in open source: Apache-2 vs GPLv2. However, before we get there, I’m first going to walk through several legal issues in general contract and licensing law and then get on to the law and politics of open source licensing.
The Law of Contracts and Licences
Contracts and Licences come from very similar branches of the law and concepts that apply to one often apply to the other. For this legal tour we’ll begin with materiality in contracts followed by licences then look at repairable and irreparable legal harms and finally the conditions necessary to take court action.
Materiality in Contracts
This is actually a well studied and taught bit of the law. The essence is that every contract has a “heart” or core set of clauses which really represent what the parties want from each other and often has a set of peripheral clauses which don’t really affect the “heart” of the contract if they’re not fulfilled. Not fulfilling the latter are said to cause non-material breaches of the contract (i.e. breaches which don’t terminate the contract if they happen, although a party may still have an additional legal claim for the breach if it caused some sort of harm). A classic illustration, often used in law schools, is a contract for electrical the electrical wiring of a house that specifies yellow insulation. The contractor can’t find yellow, so wires the house with blue insulation. The contract doesn’t suffer a material breach because the wires are in the wall (where no-one can see) and there’s no safety issue with the colour and the heart of the contract was about wiring the house not about wire colour.
Materiality in Licensing
This is actually much less often discussed, but it’s still believed that licences are subject to the same materiality constraints as contracts and for this reason, licences often contain “materiality clauses” to describe what the licensor considers to be material to it. So for the licensing example, consider a publisher wishing to publish a book written by a famous author known as the “Red Writer”. A licence to publish for per copy royalties of 25% of the purchase price of the book is agreed but the author inserts a clause specifying by exact pantone number the red that must be the predominant colour of the binding (it’s why they’re known as the “Red Writer”) and also throws in a termination of copyright licence for breaches clause. The publisher does the first batch of 10,000 copies, but only after they’ve been produced discovers that the red is actually one pantone shade lighter than that specified in the licence. Since the cost of destroying the batch and reprinting is huge, the publisher offers the copies for sale knowing they’re out of spec. Some time later the “Red Writer” comes to know of the problem, decides the licence is breached and therefore terminated, so the publisher owes statutory damages (yes, they’ve registered their copyright) per copy on 10,000 books (about $300 million maximum), would the author win?
The answer of course is that no court is going to award the author $300 million. Most courts would take the view that the heart of the contract was about money and if the author got their royalties per book, there was no material breach and the licence continues in force for the publisher. The “Red Writer” may have a separate tort claim for reputational damage if any was caused by the mis-colouring of the book, but that’s it.
Open Source Enforcement and Harm
Looking at the examples above, you can see that most commercial applications of the law eventually boil down to money: you go to court alleging a harm, the court must agree and then assess the monetary compensation for the harm which becomes damages. Long ago in community open source, we agreed that money could never compensate for a continuing licence violation because if it could we’d have set a price for buying yourself out of the terms of the licence (and some Silicon Valley Rich Companies would actually be willing to pay it, since it became the dual licence business model of companies like MySQL)1. The principle that mostly applies in open source enforcement actions is that the harm is to the open source ecosystem and is caused by non-compliance with the licence. Since such harm can only be repaired by compliance that’s the essence of the demand. Most enforcement cases have been about egregious breaches: lack of any source code rather than deficiencies in the offer to provide source code, so there’s actually very little in court records with regard to materiality of licence breaches.
One final thing to note about enforcement cases is there must always be an allegation of material harm to someone or something because you can’t go into court and argue on abstract legal principles (as we seem to like to do in various community mailing lists), you must show actual consequences as well. In addition to consequences, you must propose a viable remedy for the harm that a court could impose. As I said above in open source cases it’s often about harms to the open source ecosystem caused by licence breaches, which is often accepted unchallenged by the defence because the case is about something obviously harmful to open source, like failure to provide source code (and the remedy is correspondingly give us the source code). However, when considering about the examples below it’s instructive to think about how an allegation of harm around a combination of incompatible open source licences would play out. Since the source code is available, there would be much more argument over what the actual harm to the ecosystem, if any, was and even if some theoretical harm could be demonstrated, what would the remedy be?
Applying this to Apache-2 vs GPLv2
The divide between the Apache Software Foundation (ASF) and the Free Software Foundation (FSF) is old and partly rooted in politics. For proof of this notice the FSF says that the two licences (GPLv2 and Apache-2) are legally incompatible and in response the ASF says no-one should use any GPL licences anyway. The purpose of this section is to guide you through the technicalities of the incompatibility and then apply the materiality lessons from above to see if they actually matter.
Why GPLv2 is Incompatible with Apache-2
The argument is that Apache-2 contains two incompatible clauses: the patent termination clause (section 3) which says that if you launch an action against anyone alleging the licensed code infringes your patent then all your rights to patents in the code under the Apache-2 licence terminate; and the Indemnity clause (Section 9) which says that if you want to offer an a warranty you must indemnify every contributor against any liability that warranty might incur. By contrast, GPLv2 contains an implied patent licence (Section 7) and a No Warranty clause (Section 11). Licence scholars mostly agree that the patent and indemnity terms in GPLv2 are weaker than those in Apache-2.
The incompatibility now occurs because GPLv2 says in Section 2 that the entire work after the combination must be shipped under GPLv2, which is possible: Apache is mostly permissive except for the stronger patent and indemnity clauses. However, it is arguable that without keeping those stronger clauses on the Apache-2 code, you’ve violated the Apache-2 licence and the GPLv2 no additional restrictions clause (Section 6) prevents you from keeping the stronger licensing and indemnity clauses even on the Apache-2 portions of the code. Thus Apache-2 and GPLv2 are incompatible.
Materiality and Incompatibility
It should be obvious from the above that it’s hard to make a materiality argument for dropping the stronger apache2 provisions because someone, somewhere might one day get into a situation where they would have helped. However, we can look at the materiality of the no additional restrictions clause in GPLv2. The FSF has always taken the absolutist position on this, which is why they think practically every other licence is GPLv2 incompatible: when you dig at least one clause in every other open source licence can be regarded as an additional restriction. We also can’t take the view that the whole clause is not material: there are obviously some restrictions (like you must pay me for every additional distribution of the code) that would destroy the open source nature of the licence. This is the whole point of the no additional restrictions clause: to prevent the downstream addition of clauses incompatible with the free software goal of the licence.
I mentioned in the section on Materiality in Licences that some licences have materiality clauses that try to describe what’s important to the licensor. It turns out that GPLv2 actually does have a materiality clause: the preamble. We all tend to skip the preamble when analysing the licence, but there’s no denying it’s 7 paragraphs of justification for why the licence looks like it does and what its goals are.
So, to take the easiest analysis first, does the additional indemnity Apache-2 requires represent a material additional restriction. The preamble actually says “for each author’s protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors’ reputations.” Even on a plain reading an additional strengthening of that by providing an indemnity to the original authors has to be consistent with the purpose as described, so the indemnity clause can’t be regarded as a material additional restriction (a restriction which would harm the aims of the licence) when read in combination with the preamble.
Now the patent termination clause. The preamble has this to say about patents “Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone’s free use or not licensed at all.” So giving licensees the ability to terminate the patent rights for patent aggressors would appear to be an additional method of fulfilling the last sentence. And, again, the patent termination clause seems to be consistent with the licence purpose and thus must also not be a material additional restriction.
Thus the final conclusion is that while the patent and indemnity clauses of Apache-2 do represent additional restrictions, they’re not material additional restrictions according to the purpose of the licence as outlined by its materiality clause and thus the combination is permitted. This doesn’t mean the combination is free of consequences: the added code still carries the additional restrictions and you must call that out to the downstream via some mechanism like licensing tags, but it can be done.
Proving It
The only way to prove the above argument is to win in court on it. However, here lies the another good reason why combining Apache-2 and GPLv2 is allowed: there’s no real way to demonstrate harm to anything (either the copyright holder who agreed to GPLv2 or the Community) and without a theory of actual Harm, no-one would have standing to get to court to test the argument. This may look like a catch-22, but it’s another solid reason why, even in the absence of the materiality arguments, this would ultimately be allowed (if you can’t prevent it, it must be allowable, right …).
Community Problems with the Materiality Approach
The biggest worry about the loosening of the “no additional restrictions” clause of the GPL is opening the door to further abuse of the licence by unscrupulous actors. While I agree that this should be a concern, I think it is adequately addressed by rooting the materiality of the licence in the preamble or in provable harm to the open source community. There is also the flip side of this: licences are first and foremost meant to serve the needs of their development community rather than become inflexible implements for a group of enforcers, so even if there were some putative additional abuse in this approach, I suspect it would be outweighed by the licence compatibility benefit to the development communities in general.
Conclusion
The first thing to note is that Open Source incompatible licence combination isn’t as easy as simply combining the code under a single licence: You have to preserve the essential elements of both licences in the code which is combined (although not necessarily the whole project), so for an Apache-2/GPLv2 combination, you’ll need a note on the files saying they follow the stronger Apache patent termination and indemnity even if they’re otherwise GPLv2. However, as long as you’re careful the combination works for either of two reasons: because the Apache-2 restrictions aren’t material additional restrictions under the GPLv2 preamble or because no-one was actually harmed in the making of the combination (or both).
One can see from the above that similar arguments can be applied to various other supposedly incompatible licence combinations (exercise for the reader: try it with BSD-4-Clause and GPLv2). One final point that should be made is that licences and contracts are also all about what was in the minds of the parties, so for open source licences on community code, the norms and practices of the community matter in addition to what the licence actually says and what courts have made of it. In the final analysis, if the community norm of, say, a GPLv2 project is to accept Apache-2 code allowing for the stronger patent and indemnity clauses, then that will become the understood basis for interpreting the GPLv2 licence in that community.
For completeness, I should point out I’ve used the no harm no foul reasoning before when arguing that CDDL and GPLv2 are compatible.
eBPF has many uses in improving computer security, but just taking eBPF observability tools as-is and using them for security monitoring would be like driving your car into the ocean and expecting it to float.
Observability tools are designed have the lowest overhead possible so that they are safe to run in production while analyzing an active performance issue. Keeping overhead low can require tradeoffs in other areas: tcpdump(8), for example, will drop packets if the system is overloaded, resulting in incomplete visibility. This creates an obvious security risk for tcpdump(8)-based security monitoring: An attacker could overwhelm the system with mostly innocent packets, hoping that a few malicious packets get dropped and are left undetected. Long ago I encountered systems which met strict security auditing requirements with the following behavior: If the kernel could not log an event, it would immediately **halt**! While this was vulnerable to DoS attacks, it met the system's security auditing non-repudiation requirements, and logs were 100% complete.
There are ways to evade detection in other tools as well, like top(1) (since it samples processes and relies on its comm field) and even ls(1) (putting escape characters in files). Rootkits do this. These techniques have been known in the industry for decades and haven't been "fixed" because they aren't "broken." They are cars, not boats. Similar methods can be used to evade detection in the eBPF bcc and bpftrace observability tools as well: overwhelming them with events, doing time-of-check-time-of-use attacks (TOCTOU), escape characters, etc.
When will the eBPF community "fix" these tools? Well, when will Tesla fix my Model 3 so I can drive it under the Oakland bridge instead of over it? (I joke, and I don't drive a Tesla.) What you actually want is a security monitoring tool that meets a different set of requirements. Trying to adapt observability tools into security tools generally increases overhead (e.g., adding extra probes) which negates the main reason I developed these using eBPF in the first place. That would be like taking the wheels off a car to help make it float. There are other issues as well, like decreasing maintainability when moving probes from stable tracepoints to unstable inner workings for TOU tracing. Had I written these as security tools to start with, I would have done them differently: I'd start with LSM hooks, use a plugin model instead of standalone CLI tools, support configurable policies for event drop behavior, optimize event logging (which we still haven't [done](https://github.com/iovisor/bcc/issues/1033)), and lots more.
None of this should be news to experienced security engineers. I'm writing this post because others see the tools and examples I've shared and believe that, with a bit of shell scripting, they could have a good security monitoring product. I get that it looks that way, but in reality there's a bunch of work to do. Ideally I'd link to an example in bcc for security monitoring (we could create a subdirectory for them) but that currently doesn't exist. In the meantime my best advice is: If you are making a security monitoring product, hire a good security engineer (e.g., someone with solid pen-testing experience).
BPF for security monitoring was first explored by myself and a Netflix security engineer, Alex Maestretti, in a [2017 BSides talk] \(some slides below). Since then I've worked with other security engineers on the topic (hi Michael, Nabil, Sargun, KP). (I also did security work many years ago, so I'm not completely new to the topic.)
There is potential for an awesome eBPF security product, and it's not just the visibility that's valuable (all those arrows) it's also the low overhead. These slides included our [overhead evaluation] showing bcc/eBPF was far more efficient than auditd or go-audit. (It was pioneering work, but unfortunately the slides are all we have: Alex, I, and others left Netflix before open sourcing it.) There are now other eBPF security products, including open source projects (e.g., [tetragon]), but I don't know enough about them all to have a recommendation.
Note that I'm talking about the observability tools here and not the eBPF kernel runtime itself, which has been designed as a secure sandbox. Nor am I talking about privilege escalation, since to run the tools you already need root access (that car has sailed!).
[2017 BSides talk]: https://www.brendangregg.com/Slides/BSidesSF2017_BPF_security_monitoring
[overhead evaluation]: https://www.brendangregg.com/Slides/BSidesSF2017_BPF_security_monitoring/#17
[tetragon]: https://github.com/cilium/tetragon
The latest stable kernel is Linux 6.3, released by Linus Torvalds on Sunday, April 23rd, 2023.
The latest mainline (development) kernel is 6.3. The Linux 6.4 “merge window” is open.
Linux 6.3
Linus Torvalds announced the release of Linux 6.3, noting, “It’s been a calm release this time around, and the last week was really no different. So here we are, right on schedule”. As usual, the KernelNewbies website has a summary of Linux 6.3, including links to the appropriate LWN (Linux Weekly News) articles with deep dives for each new feature (if you like this podcast and want to support Linux Kernel journalism, please subscribe to Linux Weekly News).
Linux 6.3 includes additional support for the Rust programming language, a new red-black tree data structure for BPF programs, and the removal of a large number of legacy Arm systems.
With the release of Linux 6.3 comes the opening of the “merge window” (period of time during which disruptive changes are allowed to be merged into the kernel source code) for what will be Linux 6.4 in another couple of months. The next podcast release will include a full summary.
Thorsten Leemhuis has been doing his usual excellent work tracking regressions. He posted multiple updates during the Linux 6.3 development cycle as usual, at one point saying that “The list of regressions from the 6.3 cycle I track is still quite short”. Most seemed to relate to build problems that had stalled for fixes. He had been concerned that there “are two regressions from the 6.2 cycle still not fixed”. These included that “Wake-on-lan (WOL) apparently is broken for a huge number of users” and “a huge number of DISCARD request on NVME devices with Btrfs” causing “a performance regression for some users”. With the final release of Linux 6.3, he has “nothing much to report”, with just “two regression from the 6.3 cycle…worth mentioning”.
Sebastian Andrej Siewior announced pre-empt RT (Real Time) patch v6.3-rc5-rt8.
Shuah Khan posted a summary of complaints addressed by the Linux Kernel Code of Conduct Committee between October 1, 2022 through March 31, 2023. During that time, they received reports of “Unacceptable behavior of comments in email” 6 times. Most were resolved with “Clarification on the Code of Conduct related to maintainer rights and responsibility to reject code”. Overall “The reports were about the decisions made in rejecting code and these actions are not viewed as violations of the Code of Conduct”.
Russia
It cannot have escaped anyone’s attention that there is an active military conflict ongoing in Europe. I try to keep politics out of this podcast. We are, after all, not lacking for other places in which to debate our opinions. Similarly, for the most part, it can be convenient as Open Source developers to attempt to live in an online world devoid of politics and physical boundaries, but the real world very much continues to exist, and in the real world there are consequences (in the form of sanctions) faced by those who invade other sovereign nations. Those consequences can be imposed by governments, but also by fellow developers. The latter was the case over the past month with a patch posted to the Linux “netdev” networking development list.
An engineer from (sanctioned) Russian company Baikal Electronics attempted to post some network patches. His post was greeted by a terse response from one of the maintainers: “We don’t feel comfortable accepting patches from or relating to hardware produced by your organization. Please withhold networking contributions until further notice”. Baikal is known for its connections to the Russian state. The question of official policy was subsequently raised by James Harkonnen, citing a message allegedly from Linus in which he reportedly said “I will not stop any kernel developer I trust from taking patches from Russian sources that they in turn trust, but at the same time I will also not override anybody who goes “I don’t want to have anything to do with this” and doesn’t want to work with Russian companies”. James wanted a clarification as to any official position. As of this date no follow up discussion appears to have taken place, and there does not appear to be an official kernel-wide policy on Russian patches.
Introducing Bugbot
Konstantin Ryabitsev, who is responsible for running kernel.org on behalf of Linux Foundation, posted “Introducing bugbot”, in which he described a new tool that aims to be “a bridge between bugzilla [as in bugzilla.kernel.org] and public-inbox (the mailing list). The tool is “still a very early release” but it is able to “Create bugs from mailing list discussions, with full history”, and “Start mailing list threads from pre-triaged bugzilla bugs”. He closed (presciently) with “bugbot is very young and probably full of bugs, so it will still see a lot of change and will likely explode a couple of times”. True to the prediction, bugbot saw that it was summoned by the announcement of its existence and it replied to the thread, which Konstantin used as an example of the “may explode” comment he had made. Generally feedback to the new tool was positive.
Ongoing Development
Anjali Kulkarni posted version 3 of “Process connector bug fixes & enhancements”, a patch series to improve the performance of monitoring the exit of dependent threads. According to Anjali, “Oracle DB runs on a large scale with 100000s of short lived processes, starting up and exiting quickly. A process monitoring DB daemon which tracks and cleans up after processes that have died without a proper exit needs notifications only when a process died with a non-zero exit code (which should be rare)”. The patches allow a “client [to] register to listen for only exit or fork or a mix of all events. This greatly enhances performance”.
Vlastimil Babka posted “remove SLOB and allow kfree() with kmem_cache_alloc()”. In the patch posted, Vlastimil notes that “The SLOB allocator was deprecated in 6.2 so I think we can start exposing the complete removal in for-next and aim at 6.4 if there are no complaints”.
Thorsten Leemhuis (“the Linux kernel’s regression tracker”) poked an older thread about a 20% UDP performance degradation that Tariq Toukan (NVIDIA) had reported a few months ago. The report observed that a specific CFS (Completely Fair Scheduler, the current default Linux scheduler) patch was the culprit, but that the team discovering it “couldn’t come up with a good explanation how this patch causes this issue”. Thorsten tagged the mail for followup tracking.
Lukas Bulwahn posted “Updating information on lanana.org”. Lanana was setup to be “The Linux Assigned Names and Numbers Authority”, a play on organizations like the IANA: Internet Assigned Numbers Authority, that assigns e.g. IP addresses on the internet. As the patches note, “As described in Documentation/admin-guide/devices.rst, the device number register (or linux device list) is at Documentation/admin-guide/devices.txt and no longer maintained at lanana.org”. Lanana still technically hosts some of the LSB (Linux Standard Base) IDs.
On the Rust front, Asahi Lina posted “rust: add uapi crate” that “introduce[s] a new ‘uapi’ crate that will contain only these [uapi] publicly usable definitions” for use by userspace APIs.
Marcelo Tosatti posted “fold per-CPU vmstats remotely”, a patch that notes a (Red Hat) customer had encountered a system in which 48 out of 52 CPUs were in a “nohz_full” state (i.e. completely idle with the idle “tick” interrupt stopped), where a process on the system was “trapped in throttle_direct_reclaim” (a low memory “reclaim” codepath) but was not making progress because the counters the reclaim code wanted to use were stale (coming from a completely idle CPU) and not updating. The patch series causes the “vmstat_shepered” kernel thread to “flush the per-CPU counters to the global counters from remote [other] CPUs”.
Reinette Chatre posted “vfio/pci: Support dynamic allocation of MSI-X interrupts”. MSIs are “Message Signaled Interrupts”, typically used by modern buses, such as PCIe, in which an interrupt is not signaled using a traditional wiggling of a wire, but instead by a memory write to a special magic address that subsequently causes an actual hard-wired interrupt to be asserted. In the patch posting, Reinette noted that “Qemu allocates interrupts incrementally at the time the guest unmasks an interrupt, for example each time a Linux guest runs request_irq(). Dynamic allocation of MSI-X interrupts was not possible until v6.2. This prompted Qemu to, when allocating a new interrupt, first release a previously allocated interrupts (including disable of MSI-X) followed by re-allocation of all interrupts that includes the new interrupt”. This of course may not be possible while a device or accelerator is running. The patches are marked as RFC (Request For Comments) because “vfio support for dynamic MSI-X needs to work with existing user space as well as upcoming user space that takes advantage of this feature”. Reinette adds, “I would appreciate guidance on the expectations and requirements surrounding error handling when considering existing user space”. She provides several scenarios to consider.
Tejun Heo posted version 3 of “sched: Implement BPF extensible scheduler class”, which “proposed a new scheduler class called ‘ext_sched_class’, or sched_ext, which allows scheduling policies to be implemented as BPF programs”. BPF (Berkeley Packet Filter) programs are small specially processed “bytecode” programs that can be loaded into the kernel and run within a special form of sandbox. They are commonly used to implement certain tracing logic and come with restrictions (for obvious reasons) on the nature of the modifications they can make to a running kernel. Due to their complexity, and potential intrusiveness of allowing scheduling algorithms to be implemented in BPF programs, the patches come with a (lengthy) “Motivation” section, describing the “Ease of experimentation and exploration”, among other reasons for allowing BPF extension of the scheduler instead of requiring traditional patches. An example provided includes that of implementing an L1TF (L1 Terminal Fault, a speculation execution security side-channel bug in certain x86 CPUs) aware scheduler that performs co-scheduling of (safe to pair) peer threads using sibling hyperthreads using BPF.
Joel Fernandes sent a patch adding himself as a maintainer for RCU, noting “I have spent years learning / contributing to RCU with several features, talks and presentations, with my most recent work being on Lazy-RCU. Please consider me for M[aintainer], so I can tell my wife why I spend a lot of my weekends and evenings on this complicated and mysterious thing — which is mostly in the hopes of preventing the world from burning down because everything runs on this one way or another”. RCU (Read-Copy-Update) is a notoriously difficult subsystem to understand yet it is a feature of certain modern Operating Systems that allows them to gain significant performance enhancements from the fundamental notion of having different views into the same data, based upon point-in-time producers and consumers that come and go. Joel later followed up with “Core RCU patches for 6.4”, including the shiny new MAINTAINERS change and several other fixes.
Separately, Paul McKenney (the original RCU author, and co-inventor) posted assorted updates to sleepable RCU (SRCU) reducing cache footprint and marking it non-optional in Kconfig (kernel build configuration), “courtesy of new-age printk() requirements”.
Mike Kravetz raised a concern about THP (Transparent Huge Page) “backed thread stacks”. In his mail, he cited a “product team” that had “recently experienced ‘memory bloat’ in their environment” due to the alignment of the allocations they had used for thread local stacks within the Java Virtual Machine (JVM) runtime. Mike questioned whether stacks should always be THP given that “Stacks by their very nature grow in somewhat unpredictable ways over time”. Most replies were along the lines that the JVM should alter how it does allocations to use the MADV_NOHUGEPAGE parameter to madvise when allocating space for thread stacks.
Carlos Llamas posted “Using page-fault handler in binder” about “trying to remove the current page handling in [Android’s userspace IPC] binder and switch to using ->fault() and other mm/ infrastructure”. He was seeking pointers and input on the direction from other developers.
Mike Rapoport posted a patch series that “move[s] core MM initialization to mm/mm_init.c”.
Randy Dunlap noted that uclinux.org was dead and requested references to it be removed from the Linux kernel MAINTAINERS file.
Jonathan Corbet (of LWN) posted various cleanups to the kernel documentation (which he maintains), including an “arch reorg” to clean up architecture specific docs.
Architectures
Arm
Lukasz Luba posted “Introduce runtime modifiable Energy Model”, a patch set that “adds a new feature which allows to modify Energy Model (EM) power values at runtime. It will allow to better reflect power model of a recent SoCs and silicon. Different characteristics of the power usage can be leverages and thus better decisions made during task placement”. Thus, the kernel’s (CFS) scheduler can (with this patch) make a decision about where to schedule (place, or migrate) a running process (known as a task within the kernel) according to the power usage that the silicon knows will vary according to nature of the workload, and its use of hardware. For example, heavy GPU use will cause a GPU to heat up and alter a chip’s (SoC’s) thermal properties in a manner that may make it better to migrate other tasks to a different core.
Itanium
Reports of Itanium’s demise may not have been greatly exaggerated, but when it comes to the kernel they may have been a little premature by a month or two. Florian Weimer followed up to “Retire IA64/Itanium support” with a question, “Is this still going ahead? In userspace, ia64 is of course full of special cases, too, so many of us really want to see it gone, but we can’t really start the removal process while there is still kernel support”.
LoongArch
Tianrui Zhao posted version 5 of “Add KVM LoongArch support”.
Huacai Chen posted a patch, “LoongArch: Make WriteCombine configurable for ioremap()” that aims to work around a PCIe protocol violation in the implementation of the LS7A chipset.
Separately, Huacai also posted a patch enabling the kernel itself to use FPU (Floating Point Unit) functions. Quoting the patch, “They can be used by some other kernel components, e.g. the AMDGPU graphic driver for DCN”.
WANG Xuerui posted “LongArch: Make bounds-checking instructions useful”, referring to “BCE” (Bounds Checking Error) instructions, similar to those of other architectures, such as x86_64.
POWER
Laurent Dufour posted “Online new threads according to the current SMT level”, which aims to balance a hotplugged CPU’s SMT level against the current one used by the overall system. For example, a system capable of SMT8 but booted in SMT4 will currently nonetheless online all 8 SMT threads of a subsequently added CPU, rather than only 4 (to match the system).
RISC-V
Evan Green posted the fourth version of “RISC-V Hardware Probing User Interface”, which aims to handle the number of (potentially incompatible) ISA extensions present in implementations of the RISC-V architecture. The basic idea is to provide a vDSO (virtual Dynamic Shared Object – a kind of library that appears in userspace and is fast to link against, but is owned by the kernel) and backing syscall (for fallback use by the vDSO in certain cases) that can quickly hand an application key/value pairs representative of potential ISA features present on a system. The previous attempts had experienced pushback, so this time Evan came with performance numbers showing the (many) orders of magnitude differences in performance between using a vDSO/syscall approach vs. the sysfs file interface originally counter proposed by Greg KH (Greg Kroah-Hartman). Greg had preferred an application perform many open calls to parse sysfs files in order to determine the capabilities of a system, but this would be expensive for every binary. This patch series was later merged by Palmer Dabbelt (the RISC-V kernel maintainer) and should therefore make its way into the Linux 6.4 kernel series in the next couple of months.
Sia Jee Heng posted version 5 of a patch series implementing hibernation support for RISC-V. According to the posting, “This series adds RISC-V Hibernation/suspend to disk support. Low level Arch functions were created to support hibernation”. The cover letter explains how e.g. swsusp_arch_resume “creates a temporary page table that [covering only] the linear map. It copies the restore code to a ‘safe’ page, then [start] restore the memory image”.
Heiko Stuebner posted “RISC-V: support some cryptography accelerations”. These rely on version 14 of a previous patch series adding experimental support for the “v” (vector) extension, which has not been ratified (made official) by the RISC-V International organization yet. And speaking of this, a recent discussion of the non-standard implementation of the RISC-V vector extension in the “T-Head C9xx” cores suggests describing those as an “errata” implementation.
The PINE64 project recently began shipping a RISC-V development board known as “Star64”. This board uses the StarFive JH7110 SoC for which Samin Guo recently posted an updated ethernet driver, apparently based on the DesignWare MAC from Synopsys. Separately, Walker Chen posted a DMA driver for the same SoC, and Mason Huo posted cpufreq support (which included enabling “the axp15060 pmic for the cpu power source”). Seems an effort is underway to upstream support for this low-cost “Raspberry Pi”-like alternative in the RISC-V ecosystem.
Greg Ungerer posted “riscv: support ELF format binaries in nommu mode” which does what it says on the tin: “add the ability to run ELF format binaries when running RISC-V in nommu mode. That support is actually part of the ELF-FDPIC loader, so these changes are all about making that work on RISC-V”. Greg notes, “These changes have not been used to run actual ELF-FDPIC binaries. It is used to load and run normal ELF – compiled -pie format. Though the underlying changes are expected to work with full ELF-FDPIC binaries if or when that is supported on RISC-V in gcc”.
Anup Patel posted version 18 of “RISC-V IPI Improvements” which aims to teach RISC-V (on suitable hardware) how to use “normal per-CPU interrupts” to send IPIs (Inter-Processor Interrupts), as well as remote TLB (Translation Lookaside Buffer) flushes and cache maintenance operations without having to resort to calls into “M” mode firmware.
x86 (x86_64)
Rick Edgecombe posted version 8 of “Shadow stacks for userspace”, to which Borislav Petkov replied “Yes, finally! That was loooong in the making. Thanks for the persistence and patience”. He signed off as having reviewed the patches.
Ian Rogers posted “Event updates for GNR, MTL and SKL”. Apparently these perf events are generated automatically using a script on Intel’s github (that’s pretty sweet).
Usama Arif posted version 15 of “Parallel CPU bringup for x86_64”. This is about doing parallel calls to INIT/SIPI/SIPI (the initialization sequences used by x86 CPUs to bring them up) rather than the single threaded process that previously was used by the Linux kernel.
Tony Luck posted version 2 of “Handle corrected machine check interrupt storms”, which includes additional patches from Smita Koralahalli that “Extend the logic of handling Intel’s corrected machine check interrupt storms to AMD’s threshold interrupts”.
Yi Liu posted “iommu: Add nested domain support”, which “Introduce[s] a new domain type for a user space I/O address, which is nested on top of another address space address represented by a UNMANAGED domain”.
Kirill A. Shutemov posted version 16 of “Linear Address Masking enabling”. As he noted, “(LAM) modifies the checking that is applied to 64-bit linear addresses, allowing software to use of the untranslated address bits for metadata. The capability can be used for efficient address sanitizers (ASAN) implementation and for optimizations in JITs and virtual machines”. It’s also been present in architectures such as Arm for many, many years as TBI (Top Byte Ignore), etc.
Kuppuswamy Sathyanarayanan posted “TDX Guest Quote generation support”, which enables “TDX” (Trusted Domain Extensions – aka Confidential Compute) guests to attest to their “trustworthiness to other entities before provisioning secrets to the guest”. The patch describes a two step process including a “TDREPORT generation” and a “Quote generation”. The report captures measurements while the report is sent to a “Quoting Enclave” (QE) that generates a “remotely verifiable Quote”. A special conduit is provided for guests to send these quotes.
Shan Kang posted some benchmark results from KVM for Intel’s “FRED” (Flexible Return and Event Delivery) new syscall/sysenter enhanced architecture.
Mario Limonciello posted “Add vendor agnostic mechanism to report hardware sleep”, noting that “An import part of validating that S0ix [an SoC level idle power state] worked properly is to check how much of a cycle was spent in a hardware sleep state”.
Linux Plumbers Conference 2023 is pleased to host the eBPF & Networking Track!
For the fourth year in a row, the eBPF & Networking Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s networking stack as well as BPF subsystem and their surrounding user space ecosystems such libraries, loaders, compiler backends, and other related system tooling.
The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of both subsystems.
Proposals can cover a wide range of topics related to Linux networking and BPF covering improvements in areas such as (but not limited to) core networking, protocols, routing, performance, tunneling, drivers, BPF infrastructure and its use in tracing, security, networking, scheduling and beyond, as well as non-kernel components like libraries, compilers, testing infra and tools.
Please come and join us in the discussion. We hope to see you there!
F38 just released and seeing a bunch of people complain that TF2 dies on AMD or other platforms when lavapipe is installed. Who's at fault? I've no real idea. How to fix it? I've no real idea.
What's happening?
AMD OpenGL drivers use LLVM as the backend compiler. Fedora 38 updated to LLVM 16. LLVM 16 is built with c++17 by default. C++17 introduces new "operator new/delete" interfaces[1].
TF2 ships with it's own libtcmalloc_minimal.so implementation, tcmalloc expects to replace all the new/delete interfaces, but the version in TF2 must not support or had incorrect support for the new align interfaces.
What happens is when TF2 probes OpenGL and LLVM is loaded, when DenseMap initializes, one "new" path fails to go into tcmalloc, but the "delete" path does, and this causes tcmalloc to explode with
"src/tcmalloc.cc:278] Attempt to free invalid pointer"
Fixing it?
I'll talk to Valve and see if we can work out something, LLVM 16 doesn't seem to support building with C++14 anymore. I'm not sure if static linking libstdc++ into LLVM might avoid the tcmalloc overrides, it might not also be acceptable to the wider Fedora community.
There are plans for nouveau to support using the NVIDIA supplied GSP firmware in order to support new hardware going forward
The nouveau project doesn't have any input or control over the firmware. NVIDIA have made no promises around stable ABI or firmware versioning. The current status quo is that NVIDIA will release versioned signed gsp firmwares as part of their driver distribution packages that are version locked to their proprietary drivers (open source and binary). They are working towards allowing these firmwares to be redistributed in linux-firmware.
The NVIDIA firmwares are quite large. The nouveau project will control the selection of what versions of the released firmwares are to be supported by the driver, it's likely a newer firmware will only be pulled into linux-firmware for:
New hardware support (new GPU family or GPU support)
Security fix in the firmware
New features that is required to be supported
This should at least limit the number of firmwares in the linux-firmware project.
However a secondary effect of the size of the firmwares is that having the nouveau kernel module at more and more MODULE_FIRMWARE lines for each iteration will mean the initramfs sizes will get steadily larger on systems, and after a while the initramfs will contain a few gsp firmwares that the driver doesn't even need to run.
To combat this I've looked into adding some sort of module grouping which dracut can pick one out off.
This group only one will end up in the module info section and dracut will only pick one module from the group to install into the initramfs. Due to how the module info section is constructed this will end up picking the last module in the group first.
Here's an article from a French anarchist describing how his (encrypted) laptop was seized after he was arrested, and material from the encrypted partition has since been entered as evidence against him. His encryption password was supposedly greater than 20 characters and included a mixture of cases, numbers, and punctuation, so in the absence of any sort of opsec failures this implies that even relatively complex passwords can now be brute forced, and we should be transitioning to even more secure passphrases.
Or does it? Let's go into what LUKS is doing in the first place. The actual data is typically encrypted with AES, an extremely popular and well-tested encryption algorithm. AES has no known major weaknesses and is not considered to be practically brute-forceable - at least, assuming you have a random key. Unfortunately it's not really practical to ask a user to type in 128 bits of binary every time they want to unlock their drive, so another approach has to be taken.
This is handled using something called a "key derivation function", or KDF. A KDF is a function that takes some input (in this case the user's password) and generates a key. As an extremely simple example, think of MD5 - it takes an input and generates a 128-bit output, so we could simply MD5 the user's password and use the output as an AES key. While this could technically be considered a KDF, it would be an extremely bad one! MD5s can be calculated extremely quickly, so someone attempting to brute-force a disk encryption key could simply generate the MD5 of every plausible password (probably on a lot of machines in parallel, likely using GPUs) and test each of them to see whether it decrypts the drive.
(things are actually slightly more complicated than this - your password is used to generate a key that is then used to encrypt and decrypt the actual encryption key. This is necessary in order to allow you to change your password without having to re-encrypt the entire drive - instead you simply re-encrypt the encryption key with the new password-derived key. This also allows you to have multiple passwords or unlock mechanisms per drive)
Good KDFs reduce this risk by being what's technically referred to as "expensive". Rather than performing one simple calculation to turn a password into a key, they perform a lot of calculations. The number of calculations performed is generally configurable, in order to let you trade off between the amount of security (the number of calculations you'll force an attacker to perform when attempting to generate a key from a potential password) and performance (the amount of time you're willing to wait for your laptop to generate the key after you type in your password so it can actually boot). But, obviously, this tradeoff changes over time - defaults that made sense 10 years ago are not necessarily good defaults now. If you set up your encrypted partition some time ago, the number of calculations required may no longer be considered up to scratch.
And, well, some of these assumptions are kind of bad in the first place! Just making things computationally expensive doesn't help a lot if your adversary has the ability to test a large number of passwords in parallel. GPUs are extremely good at performing the sort of calculations that KDFs generally use, so an attacker can "just" get a whole pile of GPUs and throw them at the problem. KDFs that are computationally expensive don't do a great deal to protect against this. However, there's another axis of expense that can be considered - memory. If the KDF algorithm requires a significant amount of RAM, the degree to which it can be performed in parallel on a GPU is massively reduced. A Geforce 4090 may have 16,384 execution units, but if each password attempt requires 1GB of RAM and the card only has 24GB on board, the attacker is restricted to running 24 attempts in parallel.
So, in these days of attackers with access to a pile of GPUs, a purely computationally expensive KDF is just not a good choice. And, unfortunately, the subject of this story was almost certainly using one of those. Ubuntu 18.04 used the LUKS1 header format, and the only KDF supported in this format is PBKDF2. This is not a memory expensive KDF, and so is vulnerable to GPU-based attacks. But even so, systems using the LUKS2 header format used to default to argon2i, again not a memory expensive KDFwhich is memory strong, but not designed to be resistant to GPU attack (thanks to the comments pointing out my misunderstanding here). New versions default to argon2id, which is. You want to be using argon2id.
What makes this worse is that distributions generally don't update this in any way. If you installed your system and it gave you pbkdf2 as your KDF, you're probably still using pbkdf2 even if you've upgraded to a system that would use argon2id on a fresh install. Thankfully, this can all be fixed-up in place. But note that if anything goes wrong here you could lose access to all your encrypted data, so before doing anything make sure it's all backed up (and figure out how to keep said backup secure so you don't just have your data seized that way).
First, make sure you're running as up-to-date a version of your distribution as possible. Having tools that support the LUKS2 format doesn't mean that your distribution has all of that integrated, and old distribution versions may allow you to update your LUKS setup without actually supporting booting from it. Also, if you're using an encrypted /boot, stop now - very recent versions of grub2 support LUKS2, but they don't support argon2id, and this will render your system unbootable.
Next, figure out which device under /dev corresponds to your encrypted partition. Run
lsblk
and look for entries that have a type of "crypt". The device above that in the tree is the actual encrypted device. Record that name, and run
(Edit to add: Once everything is working, delete this backup! It contains the old weak key, and someone with it can potentially use that to brute force your disk encryption key using the old KDF even if you've updated the on-disk KDF.)
Next, run
sudo cryptsetup luksDump /dev/whatever
and look for the Version: line. If it's version 1, you need to update the header to LUKS2. Run
and follow the prompts. Make sure your system still boots, and if not go back and restore the backup of your header. Assuming everything is ok at this point, run
sudo cryptsetup luksDump /dev/whatever
again and look for the PBKDF: line in each keyslot (pay attention only to the keyslots, ignore any references to pbkdf2 that come after the Digests: line). If the PBKDF is either "pbkdf2" or "argon2i" you should convert to argon2id. Run the following:
and follow the prompts. If you have multiple passwords associated with your drive you'll have multiple keyslots, and you'll need to repeat this for each password.
Distributions! You should really be handling this sort of thing on upgrade. People who installed their systems with your encryption defaults several years ago are now much less secure than people who perform a fresh install today. Please please please do something about this.
CPUs can't do anything without being told what to do, which leaves the obvious problem of how do you tell a CPU to do something in the first place. On many CPUs this is handled in the form of a reset vector - an address the CPU is hardcoded to start reading instructions from when power is applied. The address the reset vector points to will typically be some form of ROM or flash that can be read by the CPU even if no other hardware has been configured yet. This allows the system vendor to ship code that will be executed immediately after poweron, configuring the rest of the hardware and eventually getting the system into a state where it can run user-supplied code.
The specific nature of the reset vector on x86 systems has varied over time, but it's effectively always been 16 bytes below the top of the address space - so, 0xffff0 on the 20-bit 8086, 0xfffff0 on the 24-bit 80286, and 0xfffffff0 on the 32-bit 80386. Convention on x86 systems is to have RAM starting at address 0, so the top of address space could be used to house the reset vector with as low a probability of conflicting with RAM as possible.
The most notable thing about x86 here, though, is that when it starts running code from the reset vector, it's still in real mode. x86 real mode is a holdover from a much earlier era of computing. Rather than addresses being absolute (ie, if you refer to a 32-bit address, you store the entire address in a 32-bit or larger register), they are 16-bit offsets that are added to the value stored in a "segment register". Different segment registers existed for code, data, and stack, so a 16-bit address could refer to different actual addresses depending on how it was being interpreted - jumping to a 16 bit address would result in that address being added to the code segment register, while reading from a 16 bit address would result in that address being added to the data segment register, and so on. This is all in order to retain compatibility with older chips, to the extent that even 64-bit x86 starts in real mode with segments and everything (and, also, still starts executing at 0xfffffff0 rather than 0xfffffffffffffff0 - 64-bit mode doesn't support real mode, so there's no way to express a 64-bit physical address using the segment registers, so we still start just below 4GB even though we have massively more address space available).
Anyway. Everyone knows all this. For modern UEFI systems, the firmware that's launched from the reset vector then reprograms the CPU into a sensible mode (ie, one without all this segmentation bullshit), does things like configure the memory controller so you can actually access RAM (a process which involves using CPU cache as RAM, because programming a memory controller is sufficiently hard that you need to store more state than you can fit in registers alone, which means you need RAM, but you don't have RAM until the memory controller is working, but thankfully the CPU comes with several megabytes of RAM on its own in the form of cache, so phew). It's kind of ugly, but that's a consequence of a bunch of well-understood legacy decisions.
Except. This is not how modern Intel x86 boots. It's far stranger than that. Oh, yes, this is what it looks like is happening, but there's a bunch of stuff going on behind the scenes. Let's talk about boot security. The idea of any form of verified boot (such as UEFI Secure Boot) is that a signature on the next component of the boot chain is validated before that component is executed. But what verifies the first component in the boot chain? You can't simply ask the BIOS to verify itself - if an attacker can replace the BIOS, they can replace it with one that simply lies about having done so. Intel's solution to this is called Boot Guard.
But before we get to Boot Guard, we need to ensure the CPU is running in as bug-free a state as possible. So, when the CPU starts up, it examines the system flash and looks for a header that points at CPU microcode updates. Intel CPUs ship with built-in microcode, but it's frequently old and buggy and it's up to the system firmware to include a copy that's new enough that it's actually expected to work reliably. The microcode image is pulled out of flash, a signature is verified, and the new microcode starts running. This is true in both the Boot Guard and the non-Boot Guard scenarios. But for Boot Guard, before jumping to the reset vector, the microcode on the CPU reads an Authenticated Code Module (ACM) out of flash and verifies its signature against a hardcoded Intel key. If that checks out, it starts executing the ACM. Now, bear in mind that the CPU can't just verify the ACM and then execute it directly from flash - if it did, the flash could detect this, hand over a legitimate ACM for the verification, and then feed the CPU different instructions when it reads them again to execute them (a Time of Check vs Time of Use, or TOCTOU, vulnerability). So the ACM has to be copied onto the CPU before it's verified and executed, which means we need RAM, which means the CPU already needs to know how to configure its cache to be used as RAM.
Anyway. We now have an ACM loaded and verified, and it can safely be executed. The ACM does various things, but the most important from the Boot Guard perspective is that it reads a set of write-once fuses in the motherboard chipset that represent the SHA256 of a public key. It then reads the initial block of the firmware (the Initial Boot Block, or IBB) into RAM (or, well, cache, as previously described) and parses it. There's a block that contains a public key - it hashes that key and verifies that it matches the SHA256 from the fuses. It then uses that key to validate a signature on the IBB. If it all checks out, it executes the IBB and everything starts looking like the nice simple model we had before.
Except, well, doesn't this seem like an awfully complicated bunch of code to implement in real mode? And yes, doing all of this modern crypto with only 16-bit registers does sound like a pain. So, it doesn't. All of this is happening in a perfectly sensible 32 bit mode, and the CPU actually switches back to the awful segmented configuration afterwards so it's still compatible with an 80386 from 1986. The "good" news is that at least firmware can detect that the CPU has already configured the cache as RAM and can skip doing that itself.
I'm skipping over some steps here - the ACM actually does other stuff around measuring the firmware into the TPM and doing various bits of TXT setup for people who want DRTM in their lives, but the short version is that the CPU bootstraps itself into a state where it works like a modern CPU and then deliberately turns a bunch of the sensible functionality off again before it starts executing firmware. I'm also missing out the fact that this entire process only kicks off after the Management Engine says it can, which means we're waiting for an entirely independent x86 to boot an entire OS before our CPU even starts pretending to execute the system firmware.
Of course, as mentioned before, on modern systems the firmware will then reprogram the CPU into something actually sensible so OS developers no longer need to care about this[1][2], which means we've bounced between multiple states for no reason other than the possibility that someone wants to run legacy BIOS and then boot DOS on a CPU with like 5 orders of magnitude more transistors than the 8086.
tl;dr why can't my x86 wake up with the gin protected mode already inside it
[1] Ha uh except that on ACPI resume we're going to skip most of the firmware setup code so we still need to handle the CPU being in fucking 16-bit mode because suspend/resume is basically an extremely long reboot cycle
[2] Oh yeah also you probably have multiple cores on your CPU and well bad news about the state most of the cores are in when the OS boots because the firmware never started them up so they're going to come up in 16-bit real mode even if your boot CPU is already in 64-bit protected mode, unless you were using TXT in which case you have a different sort of nightmare that if we're going to try to map it onto real world nightmare concepts is one that involves a lot of teeth. Or, well, that used to be the case, but ACPI 6.4 (released in 2021) provides a mechanism for the OS to ask the firmware to wake the CPU up for it so this is invisible to the OS, but you're still relying on the firmware to actually do the heavy lifting here
We are pleased to announce the Call for Papers (CFP) for Microconferences at the Linux Plumbers Conference (LPC) 2023.
LPC 2023 is currently planned to take place in Richmond, VA, USA from 13 November to 15 November. For details about the location, co-location with other events see our website and social media for updates.
Like in 2022, Linux Plumbers Conference will be a hybrid event but still, ideally microconference runners should be willing and able to attend in person.
As the name suggests, LPC is about Linux plumbing encompassing topics from kernel and userspace. A microconference is a set of sessions organized around a particular topic. The topic can be a kernel subsystem or a specific problem area in either kernel or userspace.
A microconference is supposed to be research and development in action and an abstract for a Microconference should be thought of as a set of research questions and problem statements.
The sessions in each microconference are expected to address specific problems and should generate new ideas, solutions, and patches. Sessions should be focused on discussion. Presentations should always aim to aid or kick off a discussion. If your presentation feels like a talk we would recommend to consider submitting to the LPC refereed track.
In the past years microconferences were organized around topics such as security, scalability, energy efficiency, toolchains, containers, printing, system boot, Android, scheduling, filesystems, tracing, or real-time. The LPC microconference track is open to a wide variety of topics as long as it is focused, concerned with interesting problems, and is related to open source and the wider Linux ecosystem. We are happy about a wide range of topics!
A microconference submission should outline the overall topic and list key people and problems which can be discussed. The list of problems and specific topics in a microconference can be continuously updated until fairly late. This will allow microconferences to cover topics that pop up after submission and to address new developments or problems.
Microconferences that have been at previous LPCs should list results and accomplishments in the submission and should make sure to cover follow-up work and new topics.
Submissions are due on or before 11:59PM UTC on Sunday, June 1, 2023.
The Call for Refereed Presentation Proposals for the 2023 edition of the Linux Plumbers Conference (LPC) is now open. We plan to hold LPC in Richmond Virginia on November 13-15, 2023.
Submitters should ideally be able to give their presentation in person, although presenting remotely will be possible if necessary. Expectation is that the presentation will be done live in either case, to maximize audience interaction. Please see our website or social media for regular updates.
Refereed Presentations are 45 minutes in length and should focus on a specific aspect of the “plumbing” in a Linux system. Examples of Linux plumbing include core kernel subsystems, init systems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problem statements, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.
The Refereed Presentations track will be running throughout all three days of the conference.
Linux Plumbers Conference Program Committee members will be reviewing all submitted proposals. High-quality submissions that cannot be accepted due to the limited number of slots will be forwarded to organizers of suitable Linux Plumbers Microconferences for further consideration.
Submissions are due on or before 11:59PM UTC on Sunday, August 6, 2023.
Engines in OpenSSL have a long history of providing new algorithms (Russian GOST hash/signature etc) but they can also be used to interface external crypto tokens (pkcs#11) or even key managers like my own TPM engine. I’ve actually been using my TPM2 engine for nearly a decade so that I no longer have to have an unprotected private keys anywhere on my laptops (including for ssh). The purpose of this post is to look at the differences between Providers and Engines and give advice on the minimum necessary Provider implementation to give back all the Engine functionality. So this post is aimed at Engine developers who wish to convert to Providers rather than giving user advice for either.
TPMs and Engines
TPM2 actually has a remarkable number of algorithms: hashing, symmetric encryption, asymmetric signatures, key derivation, etc. However, most TPMs are connected to the host over very slow busses (usually serial), which means that no-one in their right mind would use a TPM for bulk data operations (like hashing or symmetric encryption) since it will take orders of magnitude longer than if the native CPU did it. Thus from an Engine point of view, the TPM is really only good for guarding private asymmetric keys and doing sign or decrypt operations on them, which are the only capabilities the TPM engine has.
Hashes and Signatures
Although I said above we don’t use the TPM for doing hashes, the TPM2_Sign() routines insist on knowing which hash they’re signing. For ECDSA signatures, this is irrelevant, since the hash type plays no part in the signature (it’s always truncated to key length and converted to a bignum) but for RSA the ASN.1 form of the hash description is part of the toBeSigned data. The problem now is that early TPM2’s only had two hash algorithms (sha1 and sha256) and the engine wanted to be able to use larger hash sizes. The solution was actually easy: lie about the hash size for ECDSA, so always give the hash that’s the width of the key (sha256 for NIST P-256 and sha384 for NIST P-384) and left truncate the passed in hash if larger or left zero pad if smaller.
For RSA, the problem is more acute, since TPM2_Sign() actually takes a raw digest and adds the hash description but the engine code sends down the fully described hash which merely needs to be padded if PKCS1 (PSS data is fully padded when sent down) and encrypted with the private key. The solution to this taken years ago was not to bother with TPM2_Sign() at all for RSA keys but instead to do a Decrypt operation1. This also means that TPM RSA engine keys are marked as decryption keys, not signing keys.
The Engine Itself
Given that the TPM is really only guarding the private keys, it only makes sense to substitute engine functions for the private key operations. Although the TPM can do public key operations, the core OpenSSL routines do them much faster and no information is leaked about the private key by doing them through OpenSSL, so Engine keys were constructed from standard OpenSSL keys by substituting a couple of private key methods from the underlying key types. One thing Engines were really bad at was passing additional parameters at key creation time and doing key wrapping. The result is that most Engines already have a separate tool to create engine keys (create_tpm2_key for the TPM2 engine) because complex arguments are needed for TPM specific things like key policy.
TPM keys are really both public and private keys combined and the public part of the key can be accessed without a password (unlike OpenSSL keys) or even access to the TPM that created the key. However, the engine code doesn’t usually know when only the public part of the key will be required and password prompting is done in OpenSSL at key loading (the TPM doesn’t need a password until key use), so usually after a TPM key is created, the public key is also separately derived using a pkey operation and used as a normal public key.
The final, and most problematic Engine feature, is key loading. Engine keys must be loaded using a special API (ENGINE_load_private_key). OpenSSL built in applications require you to specify the key type (-keyform option) but most well written OpenSSL applications simply try loading the PEM key first, then the DER key then the Engine key (since they all have different APIs), but frequently the Engine key is forgotten leading to the application having to be patched if you want to use them with any engine.
Converting Engines to Providers
The provider API has several pieces which apply to asymmetric key handling: Store, Encode/Decode, Key Management, Signing and Decryption (plus many more if you provide hashes or symmetric algorithms). One thing to remember about the store API is that if you only have file based keys, you should use the generic file store instead. Implementing your own store is only necessary if you also have a URI based input (like PKCS#11). In fact the TPM Engine has a URI for persistent keys, so the TPM store implementation will be dealt with later.
Provider Basics
If a provider is specified on the OpenSSL command line, it will become the sole provider of every algorithm. That means that providers like the TPM2 one, which only fill in a subset of functions cannot operate on their own and must always be used with another provider (usually the default one). After initialization (see below) all provider actions are governed by algorithm tables. One of the key questions for any provider is what to do about algorithm names and properties. Because the TPM2 provider relies on external providers for other algorithms, it must use consistent key names (so “EC” for Elliptic curve and “RSA” for RSA), even though it has only a single key type. There are also elements of the provider key managements, like the way Elliptic Curve keys change name to “ECDSA” for signing and “ECDH” for derivation, which is driven by the key management query operation function. As far as I can tell, this provides no benefit and merely serves to add complexity to the provider, so my provider doesn’t implement these functions and uses the same key names throughout.
The most mysterious string of all is the algorithm property one. The manual gives very little clue as to what should be in it besides “provider=<provider name>”. Empirically it seems to have input, output and structure elements, which are primarily used by encoders and decoders: input can be either der or pem and structure must be the same as the OSSL_OBJECT_PARAM_DATA_STRUCTURE string produced by the der decoder (although you are free to choose any name for this). output is even more varied and the best current list is provided by the source; however the only encoder the TPM2 provider actually provides is the text one.
One of the really nice things about providers is that when OpenSSL is presented with a key to load, every provider will be tried (usually in the order they’re specified on the command line) to decode and load the key. This completely fixes the problem with missing ENGINE_load_private_key() functions is applications because now all applications can use any provider key. This benefit alone is enough to outweigh all the problems of doing the actual conversion to a provider.
Replacing Engine Controls
Engine controls were key/value pairs passed into engines. The TPM2 engine has two: “PIN” for the parent authority and “NVPREFIX” for the prefix which identifies a non-volatile key. Although these can be passed in with the ENGINE_ctrl() functions, they were mostly set in the configuration file. This latter mechanism can be replaced with the provider base callback core_get_params(). Most engine controls actually set global variables and with the provider, they could be placed into the provider context. However, for code sharing it’s easier simply to keep the current globals mechanism.
Initialization and Contexts
Every provider has to have an OSSL_provider_init() routine which fills in a dispatch table and allocates a core context, which is passed in to every other context routine. For a provider, there’s really only one instance, so storing variables in the provider context is really no different (except error handling and actually getting destructors) from using static variables and since the engine used static variables, that’s what we’ll stick with. However, pretty much every routine will need an allocated library context, so it’s easiest to allocate at provider init time and pass it through as the provider context. The dispatch routine must contain a query_operation function, and probably needs a teardown function if you need to use a destructor, but nothing else.
All provider function groups require a newctx() and freectx() call. This is not optional because the current OpenSSL code calls them without checking so they cannot be NULL. Thus for function groups (like encoders and key management) where new contexts aren’t really required it makes sense to use pass through context functions that simply pass through the provider context for newctx() and do nothing for freectx().
The man page implies it is necessary to pick a load of functions from the in argument, but it seems unnecessary for those which the OpenSSL library already provides. I assume it’s something to do with a provider not requiring OpenSSL symbols, but it’s impossible to implement a provider today without relying on other OpenSSL functions than those which can be picked out of the in argument.
Decoders
Decoders are used to convert a read file from PEM to DER (this is essentially the same conversion for every provider, so it is strange you have to do this rather than it being done in the core routines) and then DER to an internal key structure. The remaining decoders take DER in and output a labelled key structure (which is used as a component of the EVP_PKEY), if you do both RSA and EC keys, you need one for each key type and, unfortunately, they must be provided and may not cross decode (the RSA decoder must reject EC keys and vice versa). This is actually required so the OpenSSL core can tell what type of key it has but is a royal pain for things like the TPM where the key DER is identical regardless of key type:
The decode_pem_fns can be cut and pasted from any provider with the sole exception that you probably have a different PEM guard string that you need to check for.
The main job of the DECODER_DECODE function is to take the DER form of the key and convert it to an internal PKEY and send that PKEY up by reference so it can be consumed by a key management load.
Encoders
By and large, engines all come with creation tools for key files, which means that while you could now use the encoder routines to create key files, it’s probably better off to stick with what you have (especially for things like the TPM that can have complex policy statements attached to keys), so you can omit providing any encoder functions at all. The only possible exception is if you want the keys pretty printing, you might consider a text output encoder:
Note: there are many more encode/decode function types you could supply, but the above are the essential ones.
Key Management
Nothing in the key management functions requires the underlying key object to be reference counted since it belongs to an already reference counted EVP_PKEY structure in the OpenSSL generic routines. However, the signature operations can’t be implemented without context duplication and the signature context must contain a reference to the provider key so, depending on how the engine implements keys, duplicating via reference might be easier than duplicating via copy. The minimum functionality to implement is LOAD, FREE and HAS. If you are doing Elliptic Curve derive or reference counting your engine keys, you will also need NEW. You also have to provide both GET_PARAMS and GETTABLE_PARAMS (many key management functions have to implement pairs like this) for at least the BITS, SECURITY_BITS and SIZE properties)2.
You must also implement the EXPORT (and EXPORT_TYPES, which must be provided but has no callers) so that you can convert your engine key to an external public key. Note the EXPORT function must fail if asked to export the private key otherwise the default provider will try to do the private key operations via the exported key as well.
If you need to do Elliptic Curve key derivation you must also implement IMPORT (and IMPORT_TYPES) because the creation of the peer key (even though it’s a public one) will necessarily go through your provider key managment functions.
The HAS function can be problematic because OpenSSL doesn’t assume the interchangeability of public and private keys, even if it is true of the engine. Thus the engine must remember in the decode routines what key selector was used (public, private or both) and make sure to condition HAS on that value.
Signatures
This is one of the most confusing areas for simple signing devices (which don’t do hashing) because you’d assume you can implement NEWCTX, FREECTX, SIGN_INIT and SIGN and be done. Unfortunately, in spite of the fact that all the DIGEST_SIGN_… functions can be implemented in terms of the previous functions and generic hashing, they aren’t, so all providers are required to duplicate hashing and signing functions including constructing the binary ASN.1 for the certificate signature function (via GET_CTX_PARAMS and its pair GETTABLE_CTX_PARAMS). Another issue a sign only token will get into is padding: OpenSSL supports a variety of padding schemes (for RSA) but is deprecating their export, so if your token doesn’t do an expected form of padding, you’ll need to implement that in your provider as well. Recalling that the TPM2 provider uses RSA Decryption for signatures means that the TPM2 provider implementation is entirely responsible for padding all signatures. In order to try to come up with a common solution, I added an opensslmissing directory to my provider under the MIT licence that anyone is free to incorporate into their provider if they end up having the same digest and padding problems I did.
Decryption and Derivation
The final thing a private key provider needs to do is decryption. This is a very different operation between Elliptic Curve and RSA keys, so you need two different operations for each (OSSL_OP_ASYM_CIPHER for RSA and OSSL_OP_KEYEXCH for EC). Each ends up being a slightly special snowflake: RSA because it may need OAEP padding (which the TPM does) but with the most usual cipher being md5 (so OAEP padding with arbitrary mask and hash function is also in opensslmissing), which the TPM doesn’t do. and EC because it requires derivation from another public key. The problem with this latter operation is that because of the way OpenSSL works, the public key must be imported into the provider before it can be used, so you must provide NEW, IMPORT and IMPORT_TYPES routines for key management for this to happen.
Store
The store functions only need to be used if you have to load keys that aren’t file based (for file based keys the default provider file store will load them). For the TPM there are a set of NV Keys with 0x81 MSO prefix that aren’t file based. We load these in the engine with //nvkey:<hex> as the designator (and the //nvkey: prefix is overridable in the config file). To get this to work in the Provider is slightly problematic because the scheme (the //nvkey: prefix) must be specified as the provider algorithm_name which is usually a constant in a static array. This means that the stores actually can’t be static and must have the configuration defined name poked into it before the store is used, but this is relatively easy to arrange this in the OSSL_provider_init() function. Once this is done, it’s relatively easy to create a store. The only really problematic function is the STORE_EOF one, which is designed around files but means you have to keep an eof indicator in the context and update it to be 1 once the load function has complete.
The Provider Recursion Problem
This doesn’t seem to be discussed anywhere else, but it can become a huge issue if your provider depends on another library which also uses OpenSSL. The TPM2 provider depends on either the Intel or IBM TSS libraries and both of those use OpenSSL for cryptographic operations around TPM transport security since both of them use ECDH to derive a seed for session encryption and HMAC. The problem is that ordinarily the providers are called in the order they’re listed, so you always have to specify –provider default –provider tpm2 to make up for the missing public key operations in the TPM2 provider. However, the OpenSSL core operates a cache for the provider operations it has previously found and searches the cache first before doing any other lookups, so if the EC key management routines are cached (as they are if you input a TPM format key) and the default ones aren’t (because inputting TPM format keys requires no public key operations), the next attempt to generate an ephemeral EC key for the ECDH security derivation will find the TPM2 provider first. So say you are doing a signature which requires HMAC security to guard against interposer tampering. The use of ECDH in the HMAC seed derivation will then call back into the provider to do an ECDH operation which also requires session security and will thus call back again into the provider ad infinitum (or at least until stack overflow). The only way to break out of this infinite recursion is to try to prime the cache with the default provider as well as the TPM2 provider, so the tss library functions can find the default provider first. The (absolutely dirty) hack I have to do this is inside the pkey decode function as
Which currently works to break the recursion loop. However it is an unreliable hack because internally the OpenSSL hash bucket implementation orders the method cache by provider address and since the TPM2 provider is dynamically loaded it has a higher address than the OpenSSL default one. However, this will not survive security techniques like Address Space Layout Randomization.
Conclusions
Hopefully I’ve given a rapid (and possibly useful) overview of converting an engine to a provider which will give some pointers about provider conversion to all the engine token implementations out there. Please feel free to repurpose my opensslmissing routines under the MIT licence without any obligations to get them back upstream (although I would be interested in hearing about bugs and feature enhancements). In the end, it was only 1152 lines of C to implement the TPM2 provider (additive on top of the common shared code base with the existing Engine) and 681 lines in opensslmissing, showing firstly that there is still an need for OpenSSL itself to do the missing routines as a provider export and secondly that it really takes a fairly small amount of provider code to wrapper an existing engine implementation provided you’re discriminating about what functions you actually provide. As a final remark I should note that the openssl_tpm2_engine has a fairly extensive test suite which all now pass with the provider implementation as well.
Github accidentally committed their SSH RSA private key to a repository, and now a bunch of people's infrastructure is broken because it needs to be updated to trust the new key. This is obviously bad, but what's frustrating is that there's no inherent need for it to be - almost all the technological components needed to both reduce the initial risk and to make the transition seamless already exist.
But first, let's talk about what actually happened here. You're probably used to the idea of TLS certificates from using browsers. Every website that supports TLS has an asymmetric pair of keys divided into a public key and a private key. When you contact the website, it gives you a certificate that contains the public key, and your browser then performs a series of cryptographic operations against it to (a) verify that the remote site possesses the private key (which prevents someone just copying the certificate to another system and pretending to be the legitimate site), and (b) generate an ephemeral encryption key that's used to actually encrypt the traffic between your browser and the site. But what stops an attacker from simply giving you a fake certificate that contains their public key? The certificate is itself signed by a certificate authority (CA), and your browser is configured to trust a preconfigured set of CAs. CAs will not give someone a signed certificate unless they prove they have legitimate ownership of the site in question, so (in theory) an attacker will never be able to obtain a fake certificate for a legitimate site.
This infrastructure is used for pretty much every protocol that can use TLS, including things like SMTP and IMAP. But SSH doesn't use TLS, and doesn't participate in any of this infrastructure. Instead, SSH tends to take a "Trust on First Use" (TOFU) model - the first time you ssh into a server, you receive a prompt asking you whether you trust its public key, and then you probably hit the "Yes" button and get on with your life. This works fine up until the point where the key changes, and SSH suddenly starts complaining that there's a mismatch and something awful could be happening (like someone intercepting your traffic and directing it to their own server with their own keys). Users are then supposed to verify whether this change is legitimate, and if so remove the old keys and add the new ones. This is tedious and risks users just saying "Yes" again, and if it happens too often an attacker can simply redirect target users to their own server and through sheer fatigue at dealing with this crap the user will probably trust the malicious server.
Why not certificates? OpenSSH actually does support certificates, but not in the way you might expect. There's a custom format that's significantly less complicated than the X509 certificate format used in TLS. Basically, an SSH certificate just contains a public key, a list of hostnames it's good for, and a signature from a CA. There's no pre-existing set of trusted CAs, so anyone could generate a certificate that claims it's valid for, say, github.com. This isn't really a problem, though, because right now nothing pays attention to SSH host certificates unless there's some manual configuration.
(It's actually possible to glue the general PKI infrastructure into SSH certificates. Please do not do this)
So let's look at what happened in the Github case. The first question is "How could the private key have been somewhere that could be committed to a repository in the first place?". I have no unique insight into what happened at Github, so this is conjecture, but I'm reasonably confident in it. Github deals with a large number of transactions per second. Github.com is not a single computer - it's a large number of machines. All of those need to have access to the same private key, because otherwise git would complain that the private key had changed whenever it connected to a machine with a different private key (the alternative would be to use a different IP address for every frontend server, but that would instead force users to repeatedly accept additional keys every time they connect to a new IP address). Something needs to be responsible for deploying that private key to new systems as they're brought up, which means there's ample opportunity for it to accidentally end up in the wrong place.
Now, best practices suggest that this should be avoided by simply placing the private key in a hardware module that performs the cryptographic operations, ensuring that nobody can ever get at the private key. The problem faced here is that HSMs typically aren't going to be fast enough to handle the number of requests per second that Github deals with. This can be avoided by using something like a Nitro Enclave, but you're still going to need a bunch of these in different geographic locales because otherwise your front ends are still going to be limited by the need to talk to an enclave on the other side of the planet, and now you're still having to deal with distributing the private key to a bunch of systems.
What if we could have the best of both worlds - the performance of private keys that just happily live on the servers, and the security of private keys that live in HSMs? Unsurprisingly, we can! The SSH private key could be deployed to every front end server, but every minute it could call out to an HSM-backed service and request a new SSH host certificate signed by a private key in the HSM. If clients are configured to trust the key that's signing the certificates, then it doesn't matter what the private key on the servers is - the client will see that there's a valid certificate and will trust the key, even if it changes. Restricting the validity of the certificate to a small window of time means that if a key is compromised an attacker can't do much with it - the moment you become aware of that you stop signing new certificates, and once all the existing ones expire the old private key becomes useless. You roll out a new private key with new certificates signed by the same CA and clients just carry on trusting it without any manual involvement.
Why don't we have this already? The main problem is that client tooling just doesn't handle this well. OpenSSH has no way to do TOFU for CAs, just the keys themselves. This means there's no way to do a git clone ssh://git@github.com/whatever and get a prompt asking you to trust Github's CA. Instead, you need to add a @cert-authority github.com (key) line to your known_hosts file by hand, and since approximately nobody's going to do that there's only marginal benefit in going to the effort to implement this infrastructure. The most important thing we can do to improve the security of the SSH ecosystem is to make it easier to use certificates, and that means improving the behaviour of the clients.
It should be noted that certificates aren't the only approach to handling key migration. OpenSSH supports a protocol for key rotation, basically by allowing the server to provide a set of multiple trusted keys that the client can cache, and then invalidating old ones. Unfortunately this still requires that the "new" private keys be deployed in the same way as the old ones, so any screwup that results in one private key being leaked may well also result in the additional keys being leaked. I prefer the certificate approach.
Finally, I've seen a couple of people imply that the blame here should be attached to whoever or whatever caused the private key to be committed to a repository in the first place. This is a terrible take. Humans will make mistakes, and your systems should be resilient against that. There's no individual at fault here - there's a series of design decisions that made it possible for a bad outcome to occur, and in a better universe they wouldn't have been necessary. Let's work on building that better universe.
While going over the AV1 a few people commented on the lack of VP9 and a few people said it would be an easier place to start etc.
Daniel Almeida at Collabora took a first pass at writing the spec up, and I decided to go ahead and take it to a working demo level.
Lynne was busy, and they'd already said it should take an afternoon, so I decided to have a go at writing the ffmpeg side for it as well as finish off Daniel's radv code.
About 2 mins before I finished for the weekend on Friday, I got a single frame to decode, and this morning I finished off the rest to get at least 2 test videos I downloaded to work.
Branches are at [1] and [2]. There is only 8-bit support so far and I suspect some cleaning up is required.
The latest stable kernel is Linux 6.2.2, released by Greg Kroah-Hartman on March 3rd 2023. The latest mainline (development) kernel is 6.3-rc1, released by Linus on March 5th 2023.
Mathieu Desnoyers has announced Userspace RCU release 0.14.0 which adopts a baseline requirement of C99 and C++11, and introduces new APIs for C++.
Alejandro Colomar announced man-pages-6.03 is released. Among the “most notable changes” is “We now have a hyperlinked PDF book of the Linux man-pages”.
Junio C Hamano announced that “A release candidate Git v2.40.0-rc2 is now available”.
Takashi Sakamoto has stepped up to become the owner of the FireWire subsystem.
Linux 6.2 released
Linux 6.2 was released “right on (the extended) schedule” on February 19th following an extra RC (Release Candidate) motivated by the end of year holidays. Linus noted in his release announcement that “Nothing unexpected happened” toward the end of the cycle but there were a “couple of small things” on the regression side that Thorsten Leemhuis is tracking. Since “they weren’t actively pushed by maintainers…they will have to show up for stable [kernel releases]”.
Thorsten diligently followed up with his summary of regressions, noting that “There are still quite a few known issues from this cycle mentioned below. Afaics none of them affect a lot of people”. He also recently posted “docs: describe how to quickly build a trimmed kernel” as “that’s something users will often have to do when they want to report an issue or test proposed fixes”.
Among the fixes that will come in for stable is a build fix for those running Linux 6.2 on a Talos II (IBM POWER9) machine, who may notice an “undefined reference to ‘hash__tlb_flush” during kernel compilation. A fix is being tracked for backport to stable.
Speaking of regressions, Nick Bowler identified an older regression beginning in Linux 6.1 that caused “random crashes” on his SPARC machine. Peter Xu responded that it was likely a THP (Transparent Huge Page) problem, perhaps showing up because THP was disabled (which it was in Nick’s configuration). Nick tested a fix from Peter that seemed to address the issue.
As you’ll see below, ongoing discussions are taking place about the removal of various legacy architectures from the kernel. Another proposal recently made (by Christoph Hellwig) is to “orphan JFS” (the “Journalling File System”). Stefan Tibus was among those who stood up and claimed to still be “a happy user of JFS from quite early on all my Linux installations”.
Linux 6.3-rc1
Linus announced the closure of the merge window (the period of time during which disruptive changes are allowed to be merged into the kernel) with the release of Linux 6.3-rc1, noting, “So after several releases where the merge windows had something odd going on, we finally had just a regular “two weeks of just merge window”. It was quite nice. In fact, it was quite nice in a couple of ways: not only didn’t I have a huge compressed merge window where I felt I had to cram as much as possible into the first few days, but the fact that we _have_ had a couple of merge windows where I really asked for people to have everything ready when the merge window opened seems to have set a pattern: the bulk of everything really did come in early”.
As usual, Linux Weekly News has an excellent summary of part 1 and part 2 of the merge window (across two weeks). I encourage you to subscribe and read it for a full breakdown.
Ongoing Development
Linux 6.2 brought with it initial support for the Rust programming language. Development continues apace upsteam, with proposed patches extending the support to include new features. Miguel Ojeda (the Rust for Linux maintainer) posted a pull request for Linux 6.3, including support for various new types. Daniel Almeida recently posted “rust: virtio: add virtio support”, which “adds virtIO support to the rust crate. This includes the capability to create a virtIO driver (through the module_virtio_driver macro and the respective Driver trait)”.
And the work extends to the architectural level also, with Conor Dooley recently posting “RISC-V: enable rust”, which he notes is a “somewhat blind (and maybe foolish) attempt at enabling Rust for RISC-V. I’ve tested this on Icicle [a prominent board], and the modules seem to work. I’d like to play around with Rust on RISC-V, but I’m not interested in using downstream kernels, so figured I should try and see what’s missing…”.
But probably the most interesting development in Rust language land has noting to do with Rust as a language at all. Instead, it is a patch series titled “Rust DRM subsystem abstractions (& preview AGX driver)” from Asahi Lina. In the patch, Lina notes “This is my first take on the Rust abstractions from the DRM [graphics] subsystem. It includes the abstractions themselves, some minor prerequisite changes to the C side, as well as drm-asahi GPU driver (for reference on how the abstractions are used, but not necessarily intended to land together)”. It’s that last part, patch 18, the one titled “drm/asahi: Add the Asahi driver for Apple AGX GPUs” which we refer to here. In it, Lina implements support for the GPUs used by the Apple M1, M1 Pro, M1 Max, M1 Ultra, and the Apple M2 silicon. This is not a small driver and an interesting demonstration of the level of capability already being reached in terms of Linux upstream Rust language support.
Lokesh Gidra posted an “RFC for new feature to move pages from one vma to another without split” which allows one “anonymous” (not file backed) page (the fundamental granule size by which memory is managed and accounted) to be moved from part of a runtime heap (VMA) to another without otherwise impacting the state of the overall heap. The intended benefit is to managed runtimes with garbage collection, allowing for simplified “coarse-grained page-level compaction” garbage collection algorithms “wherein pages containing live objects are slid next to each other without touching them, while reclaiming in-between pages which contain only garbage”. The patch posting includes a lengthy writeup explaining the details.
Alison Schofield posted patches titled “CXL Poison List Retrieval & Tracing” targeting the CXL 3.0 specification, which allows OS management software to obtain a list of memory locations that have been poisoned (corrupted due to a RAS event, such as an ECC failure), for example in a “CXL.mem” DDR memory device attached to a system using the serial CXL interconnect.
Dexuan Cui noted that “earlyprintk=ttyS0” was broken on AMD SNP (Confidential Compute) guests running under KVM. This turned out to be due to a particular code branch taken during initialization that varied based upon whether a kernel was entered in 64-bit mode via EFI or through a direct (e.g. kexec/qemu KVM device modeling userspace) type of load.
Zhangjin Wu posted “Add dead syscalls elimination support” intended to remove support from the kernel for “dead” syscalls “which are not used in target system”. Presumably this is to benefit deeply embedded architectures where any excess memory used by the kernel is precious.
Nick Alcock posted “MODULE_LICENSE removals, first tranche” intended to “remove the LICENSE_MODULE usage from files/objects that are not tristate” [meaning that they are not actually setup to be used as modules to begin with].
Bobby Eshleman posted “vsock: add support for sockmap”. Bytedance are apparently “testing usage of vsock as a way to redirect guest-local UDS [Unix Domain Socket] requests to the host and this patch series greatly improves the performance of such a setup”. By 121% throughput.
Chih-En Lin posted version 4 of a patch series “Introduce Copy-On-Write to Page Table” which aims to add support for COW to the other half of the equation. Copy-on-Write is commonly used as an optimization whereby a cloned process (for example, during a fork used to exec a new program) doesn’t actually get a copy of the entire memory used by the original process. Instead, the tracking structures (page tables) are modified to mark all the pages in the new process as read only. Only when it attempts to write to the memory are the actual pages copied. The COW page table patches aim to do the same for the page tables themselves, so that full copies are not needed until the new address space is modified. Pulling off this trick requires that some of the tables are copied, but not the leaf (PTE) entries themselves, which are shared between the two processes. David Hildenbrand thanked Chih-En for the work, and the measurements, but expressed concern about “how intrusive even this basic deduplication approach already is”.
On the subject of page tables, Matthew (Willy) Wilcox posted version 3 of “New page table range API” that allows for setting up multiple page table entries at once, noting “The point of all this is better performance, and Fenwei Yin has measured improvement on x86”.
Architectures
Arm
Kristina Martsenko posted “arm64: support Armv8.8 memcpy instructions in userspace”, which adds support for (you guessed it) the memcpy instructions that were added in Armv8.8. These are described by the FEAT_MOPS documentation in the Arm ARM. As Kristina puts it, “The aim is to avoid having many different performance-optimal memcpy implementations in software (tailored to CPU model and copy size) and the overhead of selecting between them. The new instructions are intended to be at least as fast as any alternative instruction sequence”.
Various Apple Silicon patches have been posted. As the Asahi Linux project noted recently in “an update and reality check”, Linux “6.2 notably adds device trees and basic support for M1 Pro/Max/Ultra machines. However, there is still a long road before upstream kernels are usable on laptops”. Nonetheless, patches continue to fly, with the latest including “Apple M2 PMU support” from Janne Grunau, which notes that “The PMU itself appears to work in the same way as o[n] M1”, and support for the Broadcom BCM4387 WiFi chip used by Apple M1 platforms from Hector Martin. Hector also posted “Apple T2 platform support” patches.
Itanium
The Intel Itanium architecture, also known as “IA-64”, was originally announced on October 4th 1999. It was intended as the successor to another legacy architecture that Intel had previously introduced back in 1978. That legacy architecture (known as “x86”) had a number of design challenges that could limit its future scalability, but it was also quite popular, and there were a relatively large number of systems deployed. Nonetheless, Intel was determined to replace x86 with a modern architecture designed with the future in mind. Itanium was co-designed with Hewlett-Packard, who created the original ISA specification. It featured 128 64-bit general purpose registers, 128 floating point registers, 64-bit predicate registers, and more besides.
Itanium was a VLIW (Very Long Instruction Word) machine that leveraged fixed-width “bundles” of instructions that are each individually 41-bits, plus a 5-bit template describing which type of instructions are present in the bundle. The Itanium implementation of VLIW is referred to as “EPIC” (Explicitly Parallel Instruction Computing) – which one must be careful not to confuse with the highly successful x86 architecture implementation from AMD known as “EPYC”. In Itanium, modern high performance microprocessor innovations such as hardware speculative and Out-of-Order execution take a back seat to software managed speculation, requiring an extremely complicated compiler toolchain that took many years to develop. Even then, it was clear early on that software management of dependencies and speculation could not compete with a hardware implementation, such as that used by contemporary x86 and RISC CPUs.
Intel Itanium processors were officially discontinued in January of 2020. As Ard Biesheuvel noted across several patch postings attempting to remove or mark IA-64 as broken, various support for Itanium has already been removed from dependent projects (such as upstream Tianocore – the EFI implementation needed to boot such systems, from which Intel itself removed such support in 2018), “QEMU no longer implements support for it”, and given the lack of systems and ongoing firmware maintenance, “there is zero test coverage using actual hardware” (“beyond a couple of machines used by distros to churn out packages”). Even this author has long since decommissioned his Itanium system (named “Hamartia” after the tragic hero, which he acquired during the upstreaming of PCI support for Arm as both of Itanium’s users had expressed concern that Arm support for PCI might break Itanium and it thus seemed important to be able to test that this mission-critical architecture was not broken in the process).
As of this writing support for Itanium has not (yet) been removed from the kernel.
LoongArch
A lot of work is going into the LoongArch [aside: could someone please let me know how to pronounce it properly?]. Recent patches include a patch from Youling Tang (“Add support for kernel relocation”) that “allows to compile [the] kernel as PIE and to relocated it at any virtual address at runtime” (to “pave the way to KASLR”, added in a later patch). Another patch “Add hardware breakpoints/watchpoints support” does what it says on the tin. Finally, Tianrui Zhao posted “Add KVM LoongArch support”, which adds KVM support noting that the Loongson (the company behind the architecture) “3A5000” chip “supports hardware assisted virtualization”.
RISC-V
Evan Green posted “RISC-V: Add a syscall for HW probing” which started an extremely long discussion about the right (and wrong) ways to handle the myriad (sometimes mutually incompatible) extensions supported by the RISC-V community. Traditionally, architectures were quite standardized with a central authority providing curation. But while RISC-V does have the RISC-V International organization, and the concept of ratification for extensions with a standard set of extensions defined in various profiles, the practical reality is somewhat less rigid than folks may be used to. As a result, there are in fact a very wide range of implementations, and the kernel needs to somehow be able to handle all of the hundreds of permutations.
Most architectures handle minor variation between implementations using the “HWCAP” infrastructure and the “Auxiliary Vectors” which are special environment variables exported into every running process. This allows (e.g.) userspace software to quickly determine whether a particular feature is supported or not. For example, the feature might be some novel atomic or vector support that isn’t present in older processors. But when it comes to RISC-V this approach isn’t as easy. As Evan said in his posting, “We don’t have enough space for these all in ELF_HWCAP and there’s no system call that quite does this, so let’s just provide an arch-specific one to probe for hardware capabilities. This currently just provides m{arch,imp,vendor}id, but with the key-value pairs we can pass more in the future”.
The response was swift, and negative, with Greg Kroah-Hartman responding, “Ick, this is exactly what sysfs is designed to export in a sane way. Why not just use that instead? The “key” would be the filename, and the value the value read from the filename”. The response was that this would slow down future RISC-V systems because of the large number of file operations that every process would need to perform on startup in order for the standard libraries to figure out what features were supported or not. Worse, some of the infrastructure for file operations might not be available at the time when it would be needed. This situation is a good reminder of the importance of standardization and the value that it can bring to any modern architecture.
Speaking of standardization, several rounds of patches were posted titled “Add basic ACPI support for RISC-V” which “enables the basic ACPI infrastructure for RISC-V”. According to Sunil V L, who posted the patch series, “Supporting external interrupt controllers is in progress and hence it is tested using poll based HVC SBI console and RAM disk”.
Other patches recently posted for RISC-V include “Introduce virtual kernel mapping KASLR”. The patches note that “The seed needed to virtually move the kernel is taken from the device tree, so we rely on the bootloader to provide the correct seed”. Later patches may add support for the RISC-V “Zkr” random extension so that this can be provided by hardware instead. As a dependent patch, Alexandre Ghiti posted “Introduce 64b relocatable kernel”.
Deepak Gupta posted “riscv control-flow integrity for U mode” in which he notes he has “been working on linux support for shadow stack and landing pad instruction on riscv for a while. These are still RFC quality. But at least they’re in a shape which can start a discussion”. The RISC-V extension adding support for control flow integrity is called Zisslpcfi, which rolls off the tongue just as easily as all of the other extension names, chosen by cats falling on keyboards.
Jesse Taube posted “Add RISC-V 32 NOMMU support”, noting, “This patch-set aims to add NOMMU support to RV32. Many people want to build simple emulators or HDL models of RISC-V. [T]his patch makes it possible to run linux on them”.
Returning to the topic of incompatible vendor extensions, Heiko Stuebner posted “RISC-V: T-Head vector handling”, which notes “As is widely known, the T-Head C9xx cores used for example in the Allwinner D1 implement an older non-ratifed variant of the vector spec. While userspace will probably have a lot more problems implementing support for both, on the kernel side the needed changes are actually somewhat small’ish and can be handled via alternatives somewhat nicely. With this patchset I could run the same userspace program (picked from some riscv-vector-test repository) that does some vector additions on both qemu and a d1-nezha board. On both platforms it ran successfully and even produced the same results”.
Super-H
Returning to the subject of dying architectures once again, an attempt was made by Christoph Hellwig to “Drop arch/sh and everything that depends on it” since “all of the support has been barely maintained for almost 10 years, and not at all for more than 1 year”. Geert Uytterhoeven noted that “The main issue is not the lack of people sending patches and fixes, but those patches never being applied by the maintainers. Perhaps someone is willing to stand up to take over maintainership?” This caused John Paul Adrian Glaubitz to raise his hand and say he “actually would be silling to do it but I’m a bit hesitant as I’m not 100% sure my skills are sufficient”. Rob Landley offered to help out too. It seems sh might survive this round.
x86
Mathieu Desnoyers was interested in formal documentation from Intel concerning concurrent modification of code while it is executing (specifically, updating instructions to patch them as calling a debug handler via “INT3”). He wrote to Peter Anvin saying “I have emails from you dating from a few years back unofficially stating that it’s OK to update the first byte of an instruction with a single-byte int3 concurrently…Olivier Dion is working on the libpatch project aiming to use this property for low-latency/low-overhead live code patching in user-space as well, but we cannot find an official statement from Intel that guarantees this breakpoint-bypass technique is indeed OK without stopping the world while patching”. Steven Rostedt was among those who noted “The fact that we have been using it for over 10 years without issue should be a good guarantee”. Mathieu was able to find comprehensive documentation in the AMD manual that allows it, but noted again “I cannot find anything with respect to asynchronous cross-modification of code stated as clearly in Intel’s documentation”. Anyone want to help him?
Development continues toward implementing support for “Flexible Return and Event Delivery” aka “FRED” on Intel architecture. Among the latest patches, Ammar Faizi includes a fix to the “sysret_rip” selftest that handles the fact that FRED’s “syscall” instruction (to enter the kernel from userspace) no longer clobbers (overwrites) the x86 “rcx” and “r11” registers. On the subject of tests, Mingwei Zhang posted patches updated the “amx_test” suite to add support for several of the “new entities” that are present in Intel’s AMX (matrix extension) architecture.
Sean Christopherson posted “KVM: x86: Add “governed” X86_FEATURE framework”, which is intended to “manage and cache KVM-governed features, i.e. CPUID based features that require explicit KVM enabling and/or need to be queried semi-frequently by KVM”. According to Sean, “The idea originally came up in the context of the architectural LBRs [Last Branch Record, a profiling mechanism to record precisely the last N branched] series as a way to avoid querying guest CPUID in hot paths without needing a dedicated flag, but as evidenced by the shortlog, the most common usage is to handle the ever-growing list of SVM [Shared Virtual Memory] that are exposed to L1”. Reducing calls to CPUID is generally a good thing since it results in a (possibly lengthy) trap into microcode, and is also a context serializing instruction.
Paolo Bonzini posted “Cross-Thread Return Address Predictions vulnerability”, noting that “Certain AMD processors are vulnerable to a cross-thread return address predictions bug. When running in SMT [Simultaneous Multi-Threading] mode and one of the sibling threads transitions out of C0 state, the other thread gets access to twice as many entries in the RSB [Return Stack Buffer], but unfortunately the predictions of the now-halter logical processor are not purged”. Paolo is referring to the fact that x86 processors include two logical “threads” (which Intel calls “Hyperthreads” – a trademarked name – and more generally are known as SMT or Simultaneous Multi-Thread). Most modern x86 processors include an optimization that when one logical thread is not being used and the other transitions into what software sees as a “low power” state, what actually happens is that the partitioned resources as given to the other thread, which consequently sees a boost in performance as it is no longer contending on the back end for execution units, and now has double the store buffer and predictor entries.
But in this case, the RSB [Return Stack Buffer] entries are not zeroed out in the process, meaning that it is possible for a malicious thread to “train” the RSB predictor later used by the peer thread to guess that certain function call return paths will be used. This opens up an opportunity to cause a sibling thread to speculatively execute down a wrong path that leaves cache breadcrumbs that can be measured in order to potentially leak certain information. Paolo addresses this by adding a KVM (hypervisor) parameter that “if set, will prevent the user from disabling the HLT, MWAIT, and CSTATE exits”, preventing the other thread from preventing the hypervisor from stuffing the RSB with dummy safe values when the sibling thread goes to sleep.
Dionna Glaze posted “Add throttling detection to sev-guest”, noting that “The guest request synchronous API from SEV-SNP [AMD’s Confidential Computing feature] to the host’s security processor consumes a global resource. For this reason, AMD’s docs recommend that the host implements a throttling mechanism. In order for the guest to know it’s been throttled and should try its request again, we need some good-faith communication from the host that the request has been throttled. These patches work with the existing dev/sev-guest ABI”.
On the subject of Confidential Compute, Kai Huang posted version 9 of a patch series “TDX host kernel support” aiming to add support for Intel’s TDX Confidential Compute extensions, while Jeremi Piotrowski posted “Support nested SNP KVM guests on Hyper-V” intending to add support for nested (hypervisor inside hypervisor) support for AMD’s Confidential Compute to the Hyper-V hypervisor as used by Microsoft Azure. Nested Confidential Compute sounds fun.
Rick Edgecombe posted version 6 of “Shadow stacks for userspace”, a series that “implements Shadow Stacks for userspace using x86’s Control-flow Enforcement Technology (CET)”. As he reminds us, CET supports both shadow stacks and indirect branch tracking (landing pads), but these patches “implements just the shadow stack part of this feature, and just for userspace”.
Michael S. Tsirkin posted “revert RNG seed mess” noting “All attempts to fix up passing RNG [random entropy] seed via setup_data entry failed. Let’s just rip out all of it. We’ll start over”.
Arnd Bergmann posted “x86: make 64-bit defconfig the default” noting that 32-bit kernel builds were “rarely what anyone wants these days”. The patch changes “the default so that the 64-bit config gets used unless the user asked for i686_defconfig, uses ARCH=i386 or runs on a system that “uname -m” identifies as i386/i486/i586/i686”.
At USENIX SREcon22 APAC I gave the opening keynote on the future of computer performance, rounding up the latest developments and making predictions of where I see things heading. This talk originated from my updates to [Systems Performance 2nd Edition], and this was the first time I've given this talk in person!
The video is now on [YouTube]:
In Q&A I was asked about CXL (compute express link) which was fortunate as I had planned to cover it and then forgot, so the question let me talk about it (although Q&A is missing from the video). CXL in a way allows a custom memory controller to be added to a system, to increase memory capacity, bandwidth, and overall performance. My personal opinion is that I don't see a widespread need for more capacity given horizontal scaling and servers that can already exceed 1 Tbyte of DRAM; bandwidth is also helpful, but I'd be concerned about the increased latency for adding a hop to more memory. So it's interesting, but I don't think they have the killer use case for it yet.
## Realizing and exceeding a lifelong dream
I began my tech career as a junior Unix sysadmin in Newcastle, NSW, Australia, in 1999, with no connection to the exciting world of tech in Silicon Valley, New York, or even nearby Sydney. As I was determined to become great at my new occupation regardless of my location, I read every sysadmin book, article, and magazine I could find on the shelf. This included SysAdmin magazine, which contained articles from various experts including Amy Rich, and a couple of advertisements: One was to submit your own articles to the magazine for publication (by writing to the editor, Rikki Endsley) and another was to attend USENIX conferences in the US and learn directly from the experts! I made both of these my goals, even though I'd never been published before and I'd never been to the US. Or even on a plane.
I didn't end up getting published in SysAdmin directly, but my performance work did make it as a feature article (thanks Matty). As for attending USENIX conferences: I finally started attending and speaking at them in 2010 when a community manager encouraged me to (thanks Deirdre Straughan), and since then I've met many friends and connections, including Amy who is now USENIX President, and Rikki with whom I co-chaired the USENIX LISA18 conference. USENIX has been a great help to my career and my employers, and I hope it is just as helpful for you. It's an important vendor-neutral space to share the latest in technology.
And now, helping bring USENIX conferences to Australia by giving the first keynote: I could not have scripted or expected it. It was a great privilege.
## SREcon 2023 CFP
Tech moves fast, however, and I have little time to reflect on 2022 when there's 2023 to plan! I'm now program co-chair for SREcon 2023 APAC, and our 2023 conference is June 14-16 in Singapore. The call for participation ends on March 2nd 23:59 SGT! That's about 24 hours from now!
## References
I've reproduced the references from my SREcon22 keynote below, so you can click on links:
- [Gregg 08] Brendan Gregg, “ZFS L2ARC,” http://www.brendangregg.com/blog/2008-07-22/zfs-l2arc.html, Jul 2008
- [Gregg 10] Brendan Gregg, “Visualizations for Performance Analysis (and More),” https://www.usenix.org/conference/lisa10/visualizations-performance-analysis-and-more, 2010
- [Greenberg 11] Marc Greenberg, “DDR4: Double the speed, double the latency? Make sure your system can handle next-generation DRAM,” https://www.chipestimate.com/DDR4-Double-the-speed-double-the-latencyMake-sure-your-system-can-handle-next-generation-DRAM/Cadence/Technical-Article/2011/11/22, Nov 2011
- [Hruska 12] Joel Hruska, “The future of CPU scaling: Exploring options on the cutting edge,” https://www.extremetech.com/computing/184946-14nm-7nm-5nm-how-low-can-cmos-go-it-depends-if-you-ask-the-engineers-or-the-economists, Feb 2012
- [Gregg 13] Brendan Gregg, “Blazing Performance with Flame Graphs,” https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg, 2013
- [Shimpi 13] Anand Lal Shimpi, “Seagate to Ship 5TB HDD in 2014 using Shingled Magnetic Recording,” https://www.anandtech.com/show/7290/seagate-to-ship-5tb-hdd-in-2014-using-shingled-magnetic-recording, Sep 2013
- [Borkmann 14] Daniel Borkmann, “net: tcp: add DCTCP congestion control algorithm,” https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e3118e8359bb7c59555aca60c725106e6d78c5ce, 2014
- [Macri 15] Joe Macri, “Introducing HBM,” https://www.amd.com/en/technologies/hbm, Jul 2015
- [Cardwell 16] Neal Cardwell, et al., “BBR: Congestion-Based Congestion Control,” https://queue.acm.org/detail.cfm?id=3022184, 2016
- [Gregg 16] Brendan Gregg, “Unikernel Profiling: Flame Graphs from dom0,” http://www.brendangregg.com/blog/2016-01-27/unikernel-profiling-from-dom0.html, Jan 2016
- [Gregg 16b] Brendan Gregg, “Linux BPF Superpowers,” https://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html, 2016
- [Alcorn 17] Paul Alcorn, “Seagate To Double HDD Speed With Multi-Actuator Technology,” https://www.tomshardware.com/news/hdd-multi-actuator-heads-seagate,36132.html, 2017
- [Alcorn 17b] Paul Alcorn, “Hot Chips 2017: Intel Deep Dives Into EMIB,” https://www.tomshardware.com/news/intel-emib-interconnect-fpga-chiplet,35316.html#xenforo-comments-3112212, 2017
- [Corbet 17] Jonathan Corbet, “Two new block I/O schedulers for 4.12,” https://lwn.net/Articles/720675, Apr 2017
- [Gregg 17] Brendan Gregg, “AWS EC2 Virtualization 2017: Introducing Nitro,” http://www.brendangregg.com/blog/2017-11-29/aws-ec2-virtualization-2017.html, Nov 2017
- [Russinovich 17] Mark Russinovich, “Inside the Microsoft FPGA-based configurable cloud,” https://www.microsoft.com/en-us/research/video/inside-microsoft-fpga-based-configurable-cloud, 2017
- [Gregg 18] Brendan Gregg, “Linux Performance 2018,” http://www.brendangregg.com/Slides/Percona2018_Linux_Performance.pdf, 2018
- [Hady 18] Frank Hady, “Achieve Consistent Low Latency for Your Storage-Intensive Workloads,” https://www.intel.com/content/www/us/en/architecture-and-technology/optane-technology/low-latency-for-storage-intensive-workloads-article-brief.html, 2018
- [Joshi 18] Amit Joshi, et al., “Titus, the Netflix container management platform, is now open source,” https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436, Apr 2018
- [Cutress 19] Dr. Ian Cutress, “Xilinx Announces World Largest FPGA: Virtex Ultrascale+ VU19P with 9m Cells,” https://www.anandtech.com/show/14798/xilinx-announces-world-largest-fpga-virtex-ultrascale-vu19p-with-9m-cells, Aug 2019
- [Gallatin 19] Drew Gallatin, “Kernel TLS and hardware TLS offload in FreeBSD 13,” https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf, 2019
- [Bearman 20] Ian Bearman, “Exploring Profile Guided Optimization of the Linux Kernel,” https://linuxplumbersconf.org/event/7/contributions/771, 2020
- [Burnes 20] Andrew Burnes, “GeForce RTX 30 Series Graphics Cards: The Ultimate Play,” https://www.nvidia.com/en-us/geforce/news/introducing-rtx-30-series-graphics-cards, Sep 2020
- [Charlene 20] Charlene, “800G Is Coming: Set Pace to More Higher Speed Applications,” https://community.fs.com/blog/800-gigabit-ethernet-and-optics.html, May 2020
- [Cutress 20] Dr. Ian Cutress, “Insights into DDR5 Sub-timings and Latencies,” https://www.anandtech.com/show/16143/insights-into-ddr5-subtimings-and-latencies, Oct 2020
- [Ford 20] A. Ford, et al., “TCP Extensions for Multipath Operation with Multiple Addresses,” https://datatracker.ietf.org/doc/html/rfc8684, Mar 2020
- [Gregg 20] Brendan Gregg, “Systems Performance: Enterprise and the Cloud, Second Edition,” Addison-Wesley, 2020
- [Hruska 20] Joel Hruska, “Intel Demos PCIe 5.0 on Upcoming Sapphire Rapids CPUs,” https://www.extremetech.com/computing/316257-intel-demos-pcie-5-0-on-upcoming-sapphire-rapids-cpus,
- Oct 2020
- [Liu 20] Linda Liu, “Samsung QVO vs EVO vs PRO: What’s the Difference? [Clone Disk],”
- https://www.partitionwizard.com/clone-disk/samsung-qvo-vs-evo.html, 2020
- [Moore 20] Samuel K. Moore, “A Better Way to Measure Progress in Semiconductors,” https://spectrum.ieee.org/semiconductors/devices/a-better-way-to-measure-progress-in-semiconductors, Jul 2020
- [Peterson 20] Zachariah Peterson, “DDR5 vs. DDR6: Here's What to Expect in RAM Modules,” https://resources.altium.com/p/ddr5-vs-ddr6-heres-what-expect-ram-modules, Nov 2020
- [Salter 20] Jim Salter, “Western Digital releases new 18TB, 20TB EAMR drives,” https://arstechnica.com/gadgets/2020/07/western-digital-releases-new-18tb-20tb-eamr-drives, Jul 2020
- [Spier 20] Martin Spier, Brendan Gregg, et al., “FlameScope,” https://github.com/Netflix/flamescope, 2020
- [Tolvanen 20] Sami Tolvanen, Bill Wendling, and Nick Desaulniers, “LTO, PGO, and AutoFDO in the Kernel,” Linux Plumber’s Conference, https://linuxplumbersconf.org/event/7/contributions/798, 2020
- [Vega 20] Juan Camilo Vega, Marco Antonio Merlini, Paul Chow, “FFShark: A 100G FPGA Implementation of BPF Filtering for Wireshark,” IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020
- [Warren 20] Tom Warren, “Microsoft reportedly designing its own ARM-based chips for servers and Surface PCs,” https://www.theverge.com/2020/12/18/22189450/microsoft-arm-processors-chips-servers-surface-report, Dec 2020
- [Alcorn 21] Paul Alcorn, “Intel Shares Alder Lake Pricing, Specs and Gaming Performance: $589 for 16 Cores,” https://www.tomshardware.com/features/intel-shares-alder-lake-pricing-specs-and-gaming-performance, Oct 2021
- [Cutress 21] Ian Cutress, “AMD Demonstrates Stacked 3D V-Cache Technology: 192 MB at 2 TB/sec,” https://www.anandtech.com/show/16725/amd-demonstrates-stacked-vcache-technology-2-tbsec-for-15-gaming, May 2021
- [Google 21] Google, “Cloud TPU,” https://cloud.google.com/tpu, 2021
- [Haken 21] Michael Haken, et al., “Delta Lake 1S Server Design Specification 1v05, https://www.opencompute.org/documents/delta-lake-1s-server-design-specification-1v05-pdf, 2021
- [Intel 21] Intel corporation, “Intel® OptaneTM Technology,” https://www.intel.com/content/www/us/en/products/docs/storage/optane-technology-brief.html, 2021
- [Kostovic 21] Aleksandar Kostovic, “Esperanto Delivers Kilocore Processor in its Supercomputer-on-a-Chip Design,” https://www.tomshardware.com/news/esperanto-kilocore-processor, Aug 2021
- [Kummrow 21] Patricia Kummrow, “The IPU: A New, Strategic Resource for Cloud Service Providers,” https://itpeernetwork.intel.com/ipu-cloud/#gs.g5pkub, Aug 2021
- [Quach 21a] Katyanna Quach, “Global chip shortage probably won't let up until 2023, warns TSMC: CEO 'still expects capacity to tighten more',” https://www.theregister.com/2021/04/16/tsmc_chip_forecast, Apr 2021
- [Quach 21b] Katyanna Quach, “IBM says it's built the world's first 2nm semiconductor chips,” https://www.theregister.com/2021/05/06/ibm_2nm_semiconductor_chips, May 2021
- [Ridley 21] Jacob Ridley, “IBM agrees with Intel and TSMC: this chip shortage isn't going to end anytime soon,” https://www.pcgamer.com/ibm-agrees-with-intel-and-tsmc-this-chip-shortage-isnt-going-to-end-anytime-soon, May 2021
- [Shilov 21] Anton Shilov, “Samsung Develops 512GB DDR5 Module with HKMG DDR5 Chips,” https://www.tomshardware.com/news/samsung-512gb-ddr5-memory-module, Mar 2021
- [Shilov 21b] Anton Shilov, “Seagate Ships 20TB HAMR HDDs Commercially, Increases Shipments of Mach.2 Drives,” https://www.tomshardware.com/news/seagate-ships-hamr-hdds-increases-dual-actuator-shipments, 2021
- [Shilov 21c] Anton Shilov, “SK Hynix Envisions 600-Layer 3D NAND & EUV-Based DRAM,” https://www.tomshardware.com/news/sk-hynix-600-layer-3d-nand-euv-dram, Mar 2021
- [SuperMicro 21] SuperMicro, “B12SPE-CPU-25G (For SuperServer Only),” https://www.supermicro.com/en/products/motherboard/B12SPE-CPU-25G, 2021
- [Thaler 21] Dave Thaler, Poorna Gaddehosur, “Making eBPF work on Windows,” https://cloudblogs.microsoft.com/opensource/2021/05/10/making-ebpf-work-on-windows, May 2021
- [TornadoVM 21] TornadoVM, “TornadoVM Run your software faster and simpler!” https://www.tornadovm.org, 2021
- [Trader 21] Tiffany Trader, “Cerebras Second-Gen 7nm Wafer Scale Engine Doubles AI Performance Over First-Gen Chip ,” https://www.enterpriseai.news/2021/04/21/latest-cerebras-second-gen-7nm-wafer-scale-engine-doubles-ai-performance-over-first-gen-chip, Apr 2021
- [Ghigoff 21] Yoann Ghigoff, et al., "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing," Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation, https://www.usenix.org/system/files/nsdi21-ghigoff.pdf, Apr 2021
- [Tyson 21] Mark Tyson, "Intel Sapphire Rapids utillises tiled, modular SoC architecture," https://hexus.net/tech/news/cpu/148266-intel-sapphire-rapids-utillises-tiled-modular-soc-architecture/, Aug 2021
- [Vahdat 21] Amin Vahdat, “The past, present and future of custom compute at Google,” https://cloud.google.com/blog/topics/systems/the-past-present-and-future-of-custom-compute-at-google, Mar 2021
- [Wikipedia 21] “Semiconductor device fabrication,” https://en.wikipedia.org/wiki/Semiconductor_device_fabrication, 2021
- [Wikipedia 21b] “Silicon,” https://en.wikipedia.org/wiki/Silicon, 2021
- [ZonedStorage 21] Zoned Storage, “Zoned Namespaces (ZNS) SSDs,” https://zonedstorage.io/introduction/zns, 2021
- [Cutress 21b] Dr. Ian Cutress, Andrei Frumusanu, "The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity," https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity, Nov 2021
- [Nash 22] Paul Nash, "Now in preview: Azure Virtual Machines with Ampere Altra Arm-based processors," https://azure.microsoft.com/en-us/blog/now-in-preview-azure-virtual-machines-with-ampere-altra-armbased-processors/, Apr 2022
- [Bonshor 22] Gavin Bonshor, "AMD Releases Milan-X CPUs With 3D V-Cache," https://www.anandtech.com/show/17323/amd-releases-milan-x-cpus-with-3d-vcache-epyc-7003, Mar 2022
- [Mann 22] Tobias Mann, "Why Intel killed its Optane memory business," https://www.theregister.com/2022/07/29/intel_optane_memory_dead/, Jul 2022
- [Torvalds 22] Linus Torvalds, “Linux 6.0-rc1,” https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=568035b01cfb107af8d2e4bd2fb9aea22cf5b868, Aug 2022
- [Whalen 22] Jeanne Whalen, "Biden’s visit shows high stakes of $20 billion Ohio chip factory," https://www.washingtonpost.com/us-policy/2022/09/09/biden-intel-ohio-chip-factory/, Sep 2022
- [Robinson 22] Dan Robinson, "Intel has a secret club in the cloud for devs to try out new chips – and you ain't in it," https://www.theregister.com/2022/09/28/intel_developer_cloud/, Sep 2022
- [CloudHypervisor 22] Cloud Hypervisor Project (Linux Foundation), "Run Cloud Virtual Machines Securely and Efficiently," https://www.cloudhypervisor.org, accessed 2022
- [Cerebras 22] Cerebras, "Cerebras Wafer-Scale Cluster," https://www.cerebras.net/product-cluster, accessed 2022
- [GrafanaLabs 22] Grafana Labs, "https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/flame-graph/," https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/flame-graph/, accessed 2022
- [Pirzada 22] Usman Pirzada, "Intel Announces The Worlds First x86 CPU With HBM Memory: Xeon Max ‘Sapphire Rapids’ Data Center CPU," https://wccftech.com/intel-announces-the-worlds-first-x86-cpu-with-hbm-memory-xeon-max-sapphire-rapids-data-center-cpu/, Nov 2022
- [Smith 22] Lyle Smith, "Samsung PM1743 PCIe Gen5 SSD First Take Review," https://www.storagereview.com/review/samsung-pm1743-pcie-gen5-ssd-first-take-review, Jan 2022
- [Barr 22] Jeff Barr, "New Amazon EC2 Instance Types In the Works – C7gn, R7iz, and Hpc7g," https://aws.amazon.com/blogs/aws/new-amazon-ec2-instance-types-in-the-works-c7gn-r7iz-and-hpc7g, Nov 2022
- [Gooding 22] Matthew Gooding, "TSMC’s US fab will make 4nm chips for Apple, AMD and Nvidia," https://techmonitor.ai/technology/silicon/tsmcs-arizona-apple-amd-nvidia, Dec 2022
- [Liu 22] Zhiye Liu, "Smuggler Hid Over 200 Alder Lake CPUs in Fake Silicone Belly," https://www.tomshardware.com/news/smuggler-hid-over-200-alder-lake-cpus-in-fake-silicone-belly, Dec 2022
- [Seagate 22] Seagato, "Exos X Series," https://www.seagate.com/au/en/products/enterprise-drives/exos-x/, accessed 2022
- [Mann 22b] Tobais Mann, "Nvidia not cutting it? Google and Amazon’s latest AI chips have arrived," https://www.theregister.com/2022/10/11/google_amazon_ai_chips_nvidia/, Oct 2022
- [Intel 22] Intel, "Intel® Developer Cloud," https://www.intel.com/content/www/us/en/developer/tools/devcloud/overview.html, accessed Dec 2022
I've taken care to cite the author names along with the talk titles and dates, including for Internet sources, instead of the common practice of just listing URLs. I followed that practice when writing some earlier books, and it has since struck me as unfair that some references had author names and some didn't. Nowadays I always include full names when known.
[YouTube]: https://www.youtube.com/watch?v=zGSQdN2X_k0
[PDF]: /Slides/SREcon2022_ComputingPerformance.pdf
[online]: /Slides/SREcon2022_ComputingPerformance/
[Systems Performance 2nd Edition]: /systems-performance-2nd-edition-book.html
Greg is a Fellow at the Linux Foundation and is responsible for the Linux kernel stable releases. He is also the maintainer of a variety of different kernel subsystems (USB, char/misc, tty/serial, driver core, staging, etc.) and has written a few books and articles about Linux kernel development.
Wikipedia entry that is usually very out of date
reddit AMA in 2020
reddit AMA in 2015
Great’s setup back in 2012
USENIX's SREcon conference is the best venue for learning the latest in systems engineering (not just site reliability engineering) and if you have useful production stories and takeaways to share -- especially if you are in the Asia/Pacific region -- please consider submitting a talk proposal to [SREcon APAC 2023]. The [call for participation] ends on March 2nd, only two weeks away. It is held this year in Singapore, June 14-16, and I'm excited to be program co-chair with fellow Aussie [Jamie Wilkinson]. To quote from our CFP:
You build computer platforms, debug them, and support them, and you have learned something useful to share: You are invited to submit proposals to give talks at SREcon23 Asia/Pacific, which welcomes speakers from a variety of backgrounds, not just SRE, and from a variety of different-sized companies, not just those that are technology-focused. Your insights will help create a relevant, diverse, and inclusive program. Conversations are never complete when they focus just on successes; we encourage talks that focus on lessons learned from failures or hard problems.
At the seventh SREcon Asia/Pacific, we are especially seeking the deepest engineering talks: Those that cover gritty technical internals, advanced tools and techniques, and complex problems that may matter to others, whether your solutions were elegant, ugly, or unsuccessful.
We look forward to learning from speakers across the SRE and systems engineering space. This year we particularly welcome new speakers; many of our best talks have come from people with new perspectives to share and the last few years most certainly has given us all new experiences and stories we can share and from which we can learn.
At every SREcon globally, we welcome and encourage participation from all individuals in any country, including people that are underrepresented in, or excluded from, technology, including but not limited to: people of all colours, women, LGBTQ people, people with disabilities, neurodiverse participants, students, veterans, and others with unique characteristics.
Similarly, we welcome participants from diverse professional roles: QA testers, performance engineers, security teams, OS engineers, DBAs, network administrators, compliance experts, UX designers, government employees, and data scientists. Regardless of who you are or the job title you hold, if you are a technologist who faces unique challenges and shares our areas of interest, we encourage you to be a part of SREcon23 Asia/Pacific.
More details are or the [CFP page].
We're especially looking to highlight local content from the Asia/Pacific region. SREcon usually has good representation of the FAANGs (Facebook, Amazon, Apple, Netflix, Google, etc.) but I think every company has interesting stories to share. I'd love to see deeply technical talks about production incidents, where the approaches and tools used can be widely applied by others. And I'd love to hear about the ones that got away, where current tooling and approaches are insufficient.
For many years I've helped with the USENIX LISA conference, which was the premier event for production systems engineering. In recent years attendees have moved to SREcon, which covers similar topics, and [LISA 2021] may have been the last time LISA runs as a stand-alone conference (see [LISA made LISA obsolete]). I think LISA still exists as the people of the LISA community, many of whom are now involved with SREcon where our engineering principles live on and develop further.
If you've never been to an SREcon, check out the [SREcon APAC 2022] conference page, which links to slides and videos of the talks.
[SREcon APAC 2022]: https://www.usenix.org/conference/srecon22apac/program
[call for participation]: https://www.usenix.net/conference/srecon23apac/call-for-participation
[CFP page]: https://www.usenix.net/conference/srecon23apac/call-for-participation
[SREcon APAC 2023]: https://www.usenix.net/conference/srecon23apac
[LISA 2021]: https://www.usenix.org/conference/lisa21
[LISA made LISA obsolete]: https://www.usenix.org/publications/loginonline/lisa-made-lisa-obsolete-thats-compliment
[Jamie Wilkinson]: https://www.linkedin.com/in/jamie-wilkinson-8043553/
The latest stable kernel is Linux 6.1.11, released by Greg K-H on February 9th 2023.
The latest mainline (development) kernel is 6.2-rc7, released on February 5th 2023.
Linux 6.2 progress
A typical kernel development cycle begins with the “merge window” (period of time during which disruptive changes are allowed to be merged into the kernel) followed by a series of (weekly) Release Candidate (RC) kernels, and then the final release. In most cases, RC7 is the final RC, but it is not all that unusual to have an extra week, as is likely the case this time around. Linus said a few weeks ago, “I am expecting to do an rc8 this release regardless, just because we effectively had a lost week or two in the early rc’s”, and indeed fixes for RC8 were still coming in as recently as today. We should at this rate see RC8 tomorrow (Sunday is the normal release day), and the 6.3 merge window in another week, meaning we’ll cover the 6.3 merge window in the next edition of this podcast. In the meantime, I encourage listeners to consider subscribing and supporting LWN (Linux Weekly News), who always have a great merge window summary.
Confidential Compute (aka “CoCo”)
If there were a “theme of the moment” for the industry (other than layoffs), it would probably be Confidential Compute. It seems one can’t go more than 10 minutes without seeing a patch for some new confidential compute feature in one of the major architectures, or the system IP that goes along with it. Examples in just the past few weeks (and which we’ll cover in a bit) include patches from both Intel (TDX) and AMD (SEV-SNP) for their Confidential Compute solutions, as well as PCI pass-through support in Hyper-V for Confidential VMs. At the same time, thought is going into revising the kernel’s “threat model” to update it for a world of Confidential Compute.
A fundamental tenet of Confidential Compute is that guests no longer necessarily have to trust the hypervisor on which they are running, and quite possibly also don’t trust the operator of the system either (whether a cloud, edge network, OEM, etc.). The theory goes that you might even have a server sitting in some (less than friendly) geographical location but still hold out a certain amount of trust for your “confidential” workloads based on properties provided by the silicon (and attested by introspecting the other physical and emulated devices provided by the system). In this model, you necessarily have to trust the silicon vendor, but maybe not much beyond that.
Elana Rehetova (Intel) posted “Linux guest kernel threat model for Confidential Computing” in which she addressed Greg Kroah-Hartman (“Greg K-H”), who apparently previously requested “that we ought to start discussing the updated threat model for kernel”. She had links to quite detailed writeups on Intel’s github. Greg replied to a point about not trusting the hypervisor with “That is, frankly, a very funny threat model. How realistic is it really given all of the other ways that a hypervisor can mess with a guest?”. And that did indeed used to be a good point. Some of the earlier attempts at Confidential Compute included architectural designs in which guest registers were not protected against single step debug (and introspection) from a hypervisor, for example. And so one can be forgiven for thinking that there are some fundamental gaps, but a lot has changed over the past few years, and the architectures have advanced quite a bit since.
Greg also noted that he “hate[s] the term “hardening”” (as applied to “hardening” device drivers against malicious hardware implementations (as opposed to just potentially buggy ones). He added, “Please just say it for what it really is, “fixing bugs to handle broken hardware”. We’ve done that for years when dealing with PCI and USB and even CPUs doing things that they shouldn’t be doing. How is this any different in the end? So what you also are saying here now is “we do not trust any PCI devices”, so please just say that (why do you trust USB devices?) If that is something that you all think that Linux should support, then let’s go from there.” David Alan Gilbert piled on with some context around Intel, and AMD’s implementations, and in particular that more than mere memory encryption is used; register state, guest VMSA (control), etc. – all of that and much more – is carefully managed under the new world order.
Daniel Berrange further clarified, in response to a discussion about deliberately malicious implementations of PCI and USB controllers, that, “As a baseline requirement, in the context of confidential computing the guest would not trust the hypervisor with data that needs to remain confidential, but would generally still expect it to provide a faithful implementation of a given device.” A lot of further back and forth took place with others piling on comments indicating a few folks weren’t aware of the different technical pieces involved (e.g. PCI IDE, CMA, DOE, SPDM and other acronyms) for device attestation prior to trusting it from within a guest, or that this was even possible. The thread was more informative for revealing that general knowledge of technology involved in Confidential Compute is not broadly pervasive. Perhaps there is an opportunity there for sessions at the newly revived in-person conferences taking place in ‘23.
Ongoing Development
Miguel Ojeda posted a patch introducing a new “Rust fixes” branch, noting, “While it may be a bit early to have a “fixes” branch, I guessed it would not hurt to start practicing how to do things for the future when we may get actual users. And since the opportunity presented itself, I wanted to also use this PR to bring up a “policy” topic and ideally get kernel maintainers to think about it.” He went on to describe the PR as containing a fix for a “soundness issue” related to UB (Undefined Behavior) in which “safe” rust code can nonetheless trigger UB in C code. He wanted to understand whether such fixes were truly considered fixes suitable for backport to stable and was mostly interested in addressing the policy aspect of the development process. Linus took the pull request without discussion, so presumably it wasn’t a big deal for him.
Saurabh Sengar posted “Device tree support for Hyper-V VMBus driver”, which “expands the VMBus driver to include device tree support. This feature allows for a kernel boot without the use of ACPI tables, resulting in a smaller memory footprint and potentially faster boot times. This is tested by enabling CONFIG_FLAT and OF_EARLY_FLATTREE for x86.” It isn’t articulated in the patch series, but this smells like an effort to support a special case minimal kernel – like the kind used by Amazon’s “Firecracker” for fast spinup VMs used to back ephemeral services like “functions”. It will be interesting to see what happens with this.
Elliot Berman (QUIC, part of Qualcomm) posted version 9 of a patch series “Drivers for gunyah hypervisor”. Gunyah is, “a Type-1 hypervisor independent of any high-level OS kernel, and runs in a higher CPU privilege level. It does not depend on any lower-privileged OS kernel/code for its core functionality. This increases its security and can support a much smaller trusted computing base than a Type-2 hypervisor.” The Gunyah source is available on github.
Breno Leitao posted “netpoll: Remove 4s sleep during carrier detection” noting that “Modern NICs do not seem to have this bouncing problem anymore, and this sleep slows down the machine boot unnecessarily”. What he meant is that traditionally the carrier on a link might be reported as “up” while autonegotiation was still underway. As Jakub Kicinski noted, especially on servers the “BMC [is often] communicating over NC-SI via the same NIC as gets used for netconsole. BMC will keep the PHY up, hence the carrier appearing instantly.”
Robin Murphy (Arm) posted a patch series aiming to “retire” the “iommu_ops” per bus IOMMU operations and reconcile around a common kernel implementation.
SongJae Park continues to organize periodic “Beer/Coffee/Tea” chat series virtual sessions for those interested in DAMON. The agenda and info is an a shared Google doc.
Architectures
Arm
Suzuki K Poulose posted several sets of (large) related RFC (Request For Comment) patches beginning with, “Support for Arm CCA VMs on Linux”. Arm CCA is a new feature introduced as part of the Armv9 architecture, including both the “Realm Management Extension” (RME) and associated system level IP changes required to build machines that support Confidential Compute “Realms”. In the CCA world, there are additional security states beyond the traditional Secure/Non-Secure. There is now a Realm state in which e.g. a Confidential Compute VM communicates with a new “RMM” (Realm Management Monitor) over an “RSI” (Realm Service Interface) to obtain special services on the Realm’s behalf. The RMM is separated from the traditional hypervisor and “provides standard interfaces – Realm Management Interface (RMI) – to the Normal world hypervisor to manage the VMs running in the Realm world (also called
Realms in short).” The idea is that the RMM is well known (e.g. Open Source) code that can be attested and trusted by a Realm to provide it with services on behalf of an untrusted hypervisor.
Arm include links to an updated “FVP” (Fixed Virtual Platform) modeling an RME-enabled v9 platform, alongside patched TF-A (Trusted Firmware), an RMM (Realm Management Monitor), kernel patches, an updated kvmtool (a lightweight alternative to qemu for starting VMs), and updated kvm-unit-tests. Suzuki notes that what they are seeking is feedback on:
KVM integration of the Arm CCA
KVM UABI for managing the Realms, seeking to generalise the operations wherever possible with other Confidential Compute solutions.
Linux Guest support for Realms
kvx
Yann Sionneau (Kalrey) posted version 2 of a patch series, “Upstream kvx Linux port”, which adds support for yet another architecture, as used in the “Coolidge (aka MPPA3-80)” SoC. The architecture is a little endian VLIW (Very Long Instruction Word) with 32 and 64-bit execution modes, 64 GPRs, SIMD instructions, and (but of course) a “deep learning co-processor”. The architecture appears to borrow nomenclature from elsewhere, having both an “APIC” and a “GIC” as part of its interrupt controller story. Presumably these mean something quite different. In the mail, Yann notes that this is only an RFC at this stage, “since kvx support is not yet upstreamed into gcc/binutils”. The most infamous example of a VLIW architecture is, of course, Intel’s Itanium. It is slowly being removed from the kernel in a process that began in 2019 with the shipping of the final Itanium systems and deprecation of GCC and GLIBC support for it. If things go well, perhaps this new VLIW architecture can take Itanium’s place as the only one.
RISC-V
Anup Patel (Ventana Micro) is heavily involved in various RISC-V architecture enablement, including for the new “AIA” (Advanced Interrupt Architecture) specification, replacing the de facto use of SiFive’s “PLIC” interrupt controller. The spec has now been frozen (Anup provided a link to the frozen AIA specification) and initial patches are posted by Anup enabling support for guests to see a virtualized set of CSRs (Configuration and Status Registers). AIA is designed to be fully virtualizable, although as this author has noted from reading the spec, it does require an interaction with the IOMMU to interdict messages in order to allow for device live migration.
Sunil V L (Ventana Micro) posted patches to “Add basic ACPI support for RISC-V”. The patches come alongside others for EDK2 (UEFI, aka “Tianocore”), and Qemu (to run that firmware and boot RISC-V kernels enabled with ACPI support). This is an encouraging first step toward an embrace of the kinds of technologies required for viability in the mainstream. This author recalls the uphill battle that was getting ACPI support enabled for Arm. Perhaps the community has more experience to draw upon at this point, and a greater understanding of the importance of such standards to broader ecosystems. In any case, there were no objections this time around.
x86-64
Early last year (2022), David Woodhouse (Amazon) posted the 4th version of a patch series he had been working on titled “Parallel CPU bringup for x86_64” which aims to speed up the boot process for large SMP x86 systems. Traditionally, x86 systems would enter the Linux kernel in a single threaded mode with a “bootcpu” being the first core that happened to start Linux (not necessarily “cpu0”). Once early initialization was complete, this CPU would use a SIPI (Startup IPI or “Inter Processor Interrupt”) to signal to the “secondary” cores that they should start booting. The entire process could eventually take quite some time, and it would therefore be better if these “secondary” cores could start their initialization earlier – while the first core was getting things setup – and then rendezvous waiting for a signal to proceed.
Usama Arif (Bytedance) noted that these older patches “brought down the smpboot time from ~700ms to 100ms”. That’s a decent savings, especially when using kexec as Usama is doing (perhaps in a “Linuxboot” type of configuration with Linux as a bootloader), and at the scale of a large number of systems. Usama was interested to know whether these patches could be merged. David replied that the last time around there had been some AMD systems that broke with the patches, “We don’t *think* there are any remaining software issues; we think it’s hardware. Either an actual hardware race in CPU or chipset, or perhaps even something as simple as a voltage regulator which can’t cope with an increase in power draw from *all* the CPUs at the same time. We have prodded AMD a few times to investigate, but so far to no avail. Last time I actually spoke to Thomas [Gleixner – one of the core x86 maintainers] in person, I think he agreed that we should just merge it and disable the parallel mode for the affected AMD CPUs.”. The suggestion was to proceed to merge but to disable this feature on all AMD CPUs for the moment out of an abundance of caution.
Nikunj A Dadhania (AMD) posted patches enabling support for a “Secure TSC” for SNP (Secure Nested Paging) guests. SNP is part of AMD’s Confidential Compute strategy and securing the TSC (Time Stamp Counter) is a necessary part of enabling confidential guests to not have to trust the host hypervisor. Prior to these patches, a hypervisor could interdict the TSC, providing a different view of the passage of CPU time to the guest than reality. With the patches, “Secure TSC allows guest to securely use RDTSC/RDTSCP instructions as the parameters being used cannot be changed by hypervisor once the guest is launched. More details in the AMD64 APM Vol 2, Section “Secure TSC”. According to Nikunj, “During the boot-up of the secondary cpus, SecureTSC enabled guests need to query TSC info from Security processor (PSP). This communication channel is encrypted between the security processor and the guest, hypervisor is just the conduit to deliver the guest messages to the security processor. Each message is protected with an AEAD (AES-256 GCM).”
Rick Edgecomb (Intel) posted an updated patch series titled “Shadow stacks for userspace” that “implements Shadow Stacks for userspace using x86’s Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace.” As Rick notes, “The main use case for shadow stack is providing protection against return oriented programming attacks”. ROP attacks aim to string together pre-existing “gadgets” (existing pieces of code, not necessarily actually well defined functions in themselves) by finding a vulnerability in existing code that can cause a function to jump (return) into a gadget sequence. Shadow stacks mitigate this by adding an additional, separate hardware structure, that tracks all function entry/exit sequences and ensures returns only come from real function calls (or are special cased longjmp like sequences that usually require special handling).
Jarkko Sakkinen posted some fixes for AMD’s SEV-SNP hypervisor that had been discovered by the Enarx developers. I’m mentioning it because this patch series may have been the final one to go out by the startup “Profian”, which had been seeking to commercialize support for Enarx. Profian closed its doors in the past few weeks due to the macro-economic environment. Some great developers are available on the market and looking for new opportunities. If you are hiring, or know folks who are, you can see posts from the Profian engineers on LinkedIn.
Final words
The Open Source Summit North America returns in person (and virtual) this year, from May 10th in Vancouver, British Columbia, Canada. There are several other events planned to be colocated alongside the Open Source Summit. These include the (invite only) Linux Storage, Filesystem, Memory Management, and BPF (LSF/MM/BPF) Summit for which a CFP is open. Another colocated event is the Linux Security Summit North America, the CfP of which was announced by James Morris with a link for submitting proposals.
Cyril Hrubis (SuSE) posted an announcement that the Linux Test Project (LTP) for January 2023 had been released. It includes a number of new tests, among them “dirtyc0w_shmem aka CVE-2022-2590”. They have also updated the minimum C requirement to -std=gnu99. Linux itself moved to a baseline of C11 (from the much older C99 standard) since Linux 5.18.
The radv h264/h265 support has been merged to mesa main branch. It is still behind RADV_PERFTEST=video_decode flag, and should work for basics from VI/GFX8+. It still has not passed all the CTS tests.
anv H264 decode:
The anv h264 decode support has been merged to mesa main branch. It has been tested from Skylake up to DG2. It has no enable flag, just make sure to build with h264dec video-codec support. It passes all current CTS tests.
hasvk H264 decode:
I ported the anv h264 decoder to hasvk the vulkan driver for Ivybridge/Haswell. This in a draft MR (HASVK H264). I haven't given this much testing yet, it has worked in the past. I'll get to testing it before trying to get it merged.
radv AV1 decode:
I created an MR for spec discussion (radv av1). I've also cleaned up the radv AV1 decode code.
anv AV1 decode:
I've started on anv AV1 decode support for DG2. I've gotten one very simple frame to decode. I will attempt to do more. I think filmgrain is not going to be supported in the short term. I'll fill in more details on this when it's working better. I think there are a few things that might need to be changed in the AV1 decoder provisional spec for Intel, there are some derived values that ffmpeg knows that it would be nice to not derive again, and there are also some hw limits around tiles and command buffers that will need to be figured out.
The 2023 LPC PC is pleased to announce that we’ve begun exclusive negotiations with the Omni Hotel in Richmond, VA to host Plumbers 2023 from 13-15 November. Note: These dates are not yet final (nor is the location; we have had one failure at this stage of negotiations from all the Plumbers venues we’ve chosen). We will let you know when this preliminary location gets finalized (please don’t book irrevocable travel until then).
The November dates were the only ones that currently work for the venue, but Richmond is on the same latitude as Seville in Spain, so it should still be nice and warm.
In one week from now, Twitter will block free API access. This prevents anyone who has written interesting bot accounts, integrations, or tooling from accessing Twitter without paying for it. A whole number of fascinating accounts will cease functioning, people will no longer be able to use tools that interact with Twitter, and anyone using a free service to do things like find Twitter mutuals who have moved to Mastodon or to cross-post between Twitter and other services will be blocked.
There's a cynical interpretation to this, which is that despite firing 75% of the workforce Twitter is still not profitable and Elon is desperate to not have Twitter go bust and also not to have to tank even more of his Tesla stock to achieve that. But let's go with the less cynical interpretation, which is that API access to Twitter is something that enables bot accounts that make things worse for everyone. Except, well, why would a hostile bot account do that?
To interact with an API you generally need to present some sort of authentication token to the API to prove that you're allowed to access it. It's easy enough to restrict issuance of those tokens to people who pay for the service. But, uh, how do the apps work? They need to be able to communicate with the service to tell it to post tweets, retrieve them, and so on. And the simple answer to that is that they use some hardcoded authentication tokens. And while registering for an API token yourself identifies that you're not using an official client, using the tokens embedded in the clients makes it look like you are. If you want to make it look like you're a human, you're already using tokens ripped out of the official clients.
The Twitter client API keys are widely known. Anyone who's pretending to be a human is using those already and will be unaffected by the shutdown of the free API tier. Services like movetodon.org do get blocked. This isn't an anti-abuse choice. It's one that makes it harder to move to other services. It's one that blocks a bunch of the integrations and accounts that bring value to the platform. It's one that hurts people who follow the rules, without hurting the ones who don't. This isn't an anti-abuse choice, it's about trying to consolidate control of the platform.
After my previous efforts, I wrote up a PKCS#11 module of my own that had no odd restrictions about using non-RSA keys and I tested it. And things looked much better - ssh successfully obtained the key, negotiated with the server to determine that it was present in authorized_keys, and then went to actually do the key verification step. At which point things went wrong - the Sign() method in my PKCS#11 module was never called, and a strange debug1: identity_sign: sshkey_sign: error in libcrypto sign_and_send_pubkey: signing failed for ECDSA "testkey": error in libcrypto" error appeared in the ssh output. Odd. libcrypto was originally part of OpenSSL, but Apple ship the LibreSSL fork. Apple don't include the LibreSSL source in their public source repo, but do include OpenSSH. I grabbed the OpenSSH source and jumped through a whole bunch of hoops to make it build (it uses the macosx.internal SDK, which isn't publicly available, so I had to cobble together a bunch of headers from various places), and also installed upstream LibreSSL with a version number matching what Apple shipped. And everything worked - I logged into the server using a hardware-backed key.
Was the difference in OpenSSH or in LibreSSL? Telling my OpenSSH to use the system libcrypto resulted in the same failure, so it seemed pretty clear this was an issue with the Apple version of the library. The way all this works is that when OpenSSH has a challenge to sign, it calls ECDSA_do_sign(). This then calls ECDSA_do_sign_ex(), which in turn follows a function pointer to the actual signature method. By default this is a software implementation that expects to have the private key available, but you can also register your own callback that will be used instead. The OpenSSH PKCS#11 code does this by calling EC_KEY_set_method(), and as a result calling ECDSA_do_sign() ends up calling back into the PKCS#11 code that then calls into the module that communicates with the hardware and everything works.
Except it doesn't under macOS. Running under a debugger and setting a breakpoint on EC_do_sign(), I saw that we went down a code path with a function called ECDSA_do_sign_new(). This doesn't appear in any of the public source code, so seems to be an Apple-specific patch. I pushed Apple's libcrypto into Ghidra and looked at ECDSA_do_sign() and found something that approximates this:
nid = EC_GROUP_get_curve_name(curve);
if (nid == NID_X9_62_prime256v1) {
return ECDSA_do_sign_new(dgst,dgst_len,eckey);
}
return ECDSA_do_sign_ex(dgst,dgst_len,NULL,NULL,eckey);
What this means is that if you ask ECDSA_do_sign() to sign something on a Mac, and if the key in question corresponds to the NIST P256 elliptic curve type, it goes down the ECDSA_do_sign_new() path and never calls the registered callback. This is the only key type supported by the Apple Secure Enclave, so I assume it's special-cased to do something with that. Unfortunately the consequence is that it's impossible to use a PKCS#11 module that uses Secure Enclave keys with the shipped version of OpenSSH under macOS. For now I'm working around this with an SSH agent built using Go's agent module, forwarding most requests through to the default session agent but appending hardware-backed keys and implementing signing with them, which is probably what I should have done in the first place.
A correspondent closed out 2022 by sending me an off-list email asking whether or not a pair of Rust crates (rcu_clean and left_right) were really implementations of read-copy update (RCU), with an LWN commenter throwing in crossbeam's epoch crate for good measure. At first glance, this is a pair of simple yes/no questions that one should be able to answer off the cuff.
What Is An RCU?
Except that there is quite a variety of RCU implementations in the wild. Even if we remain within the cozy confines of the Linux kernel, we have: (1) The original "vanilla" RCU, (2) Sleepable RCU (SRCU), (3) Tasks RCU, (4) Tasks Rude RCU, and Tasks Trace RCU. These differ not just in performance characteristics, in fact, it is not in general possible to mechanically convert (say) SRCU to RCU. The key attributes of RCU implementations are the marking of read-side code regions and data accesses on the one hand and some means of waiting on all pre-existing readers on the other. For more detail, see the 2019 LWN article and for more background, see the Linux Foundation RCU presentations here and here.
The next sections provide an overview of the Linux-kernel RCU implementations' functional properties, with performance and scalability characteristics left as an exercise for the interested reader.
Vanilla RCU
Vanilla RCU has quite a variety of bells and whistles:
Explicit nesting read-side markers, rcu_read_lock(), rcu_read_unlock(), rcu_dereference(), and friends.
Pointer-update function, rcu_assign_pointer().
Synchronous grace-period-wait primitives, synchronize_rcu() and synchronize_rcu_expedited().
An asynchronous grace-period wait primitive, call_rcu(). And additionally a synchronous callback wait primitive, rcu_barrier().
SRCU has a similar variety of bells and whistles, but some important differences. The most important difference is that SRCU supports multiple domains, each represented by an srcu_struct structure. A reader in one domain does not block a grace period in another domain. In contrast, RCU is global in nature, with exactly one domain. On the other hand, the price SRCU pays for this flexibility is reduced amortization of grace-period overhead.
Explicit read-side markers, srcu_read_lock(), srcu_read_unlock(), srcu_dereference(), and friends. Except that, unlike rcu_read_lock() and rcu_read_unlock(), srcu_read_lock() and srcu_read_unlock() do not nest. Instead, the return value from srcu_read_lock() must be passed to the corresponding srcu_read_unlock(). This means that SRCU (but not RCU!) can represent non-nested partially overlapping read-side critical sections. Not that this was considered a good thing, instead being a way of avoiding the need for T*S storage, where T is the number of tasks and S the number of srcu_struct structures.
Synchronous grace-period-wait primitives, synchronize_srcu() and synchronize_srcu_expedited().
An asynchronous grace-period wait primitive, call_srcu(). And additionally a synchronous callback wait primitive, srcu_barrier().
Polled grace-period wait primitives, although less variety than RCU enjoys. (Does this enjoyment extend to RCU's users? You decide.)
Tasks RCU
Tasks RCU was designed specially to handle the trampolines used in Linux-kernel tracing.
It has no explicit read-side markers. Instead, voluntary context switches separate successive Tasks RCU read-side critical sections.
A synchronous grace-period-wait primitives, synchronize_rcu_tasks().
An asynchronous grace-period wait primitive, call_rcu_tasks(). And additionally a synchronous callback wait primitive, rcu_barrier_tasks().
No polled grace-period wait primitives.
Tasks Rude RCU
By design, Tasks RCU does not wait for idle tasks. Something about them never doing any voluntary context switches on CPUs that remain idle for long periods of time. So trampoline that might be involved in tracing of code within the idle loop need something else, and that something is Tasks Rude RCU.
It has no explicit read-side markers. Instead, any preemption-disabled region of code is a Tasks Rude RCU reader.
A synchronous grace-period-wait primitives, synchronize_rcu_tasks_rude().
An asynchronous grace-period wait primitive, call_rcu_tasks_rude(). And additionally a synchronous callback wait primitive, rcu_barrier_tasks_rude().
No polled grace-period wait primitives.
Tasks Trace RCU
Both Tasks RCU and Tasks Rude RCU disallow sleeping while executing in a given trampoline. Some BPF programs need to sleep, hence Tasks Trace RCU.
Explicit nesting read-side markers, rcu_read_lock_trace() and rcu_read_unlock_trace().
A synchronous grace-period-wait primitives, synchronize_rcu_tasks_trace().
An asynchronous grace-period wait primitive, call_rcu_tasks_trace(). And additionally a synchronous callback wait primitive, rcu_barrier_tasks_trace().
No polled grace-period wait primitives.
DYNIX/ptx rclock
The various Linux examples are taken from a code base in which RCU has been under active development for more than 20 years, which might yield an overly stringent set of criteria. In contrast, the 1990s DYNIX/ptx implementation of RCU (called "rclock" for "read-copy lock") was only under active development for about five years. The implementation was correspondingly minimal, as can be seen from this February 2001 patch (hat trick to Greg Lehey):
Explicit nesting read-side markers, RC_RDPROTECT() and RC_RDUNPROTECT(). The lack of anything resembling rcu_dereference() shows just how small DYNIX/ptx's installed base was.
Pointer-update barrier, RC_MEMSYNC(). This is the counterpart of smp_wmb() in early Linux-kernel RCU use cases.
No synchronous grace-period-wait primitive.
An asynchronous grace-period wait primitive, rc_callback(). However, there was no synchronous callback wait primitive, perhaps because DYNIX/ptx did not have modules, let alone unloadable ones.
No polled grace-period wait primitives.
Perhaps this can form the basis of an RCU classification system, though some translation will no doubt be required to bridge from C to Rust. There is ownership, if nothing else!
RCU Classification and Rust RCU Crates
Except that the first RCU crate, rcu_clean, throws a monkey wrench into the works. It does not have any grace-period primitives, but instead a clean() function that takes a reference to a RCU-protected data item. The user invokes this at some point in the code where it is known that there are no readers, either within this thread or anywhere else. In true Rust fashion, in some cases, the compiler is able to prove the presence or absence of readers and issue a diagnostic when needed. The documentation notes that the addition of grace periods (also known as "epochs") would allow greater accuracy.
This sort of thing is not unprecedented. The userspace RCU library has long had an rcu_quiescent_state() function that can be invoked from a given thread when that particular thread is in a quiescent state, and thus cannot have references to any RCU-protected object. However, rcu_clean takes this a step further by having no RCU grace-period mechanism at all.
Nevertheless, rcu_clean could be used to implement the add-only list RCU use case, so it is difficult to argue that is not an RCU implementation. But it is clearly a very primitive implementation. That said, primitive implementations do have their place, for example:
Languages with garbage collectors have built-in RCU updaters.
Programs with short runtimes can just leak memory, cleaning up when restarted.
Other synchronization primitives can be used to protect and exclude readers.
In addition, an RCU implementation even more primitive than rcu_clean would omit the clean() function, instead leaking memory that had been removed from an RCU-protected structure.
The left_right crate definitely uses RCU in the guise of epochs, and it can be used for at least some of the things that RCU can be used for. It does have a single-writer restriction, though as the documentation says, you could use a Mutex to serialize at least some multi-writer use cases. In addition, it has long been known that RCU use cases involving only a single writer thread permit wait-free updaters as well as wait-free readers.
One might argue that the fact that the left_right crate uses RCU means that it cannot possibly be itself an implementation of RCU. Except that in the Linux kernel, RCU Tasks uses vanilla RCU, RCU Tasks Trace uses SRCU, and previous versions of SRCU used vanilla RCU. So let's give the left_right crate the benefit of the doubt, at least for the time being, but with the understanding that it might eventually instead be classified as an RCU use case rather than an RCU implementation.
The crossbeam epoch crate again uses the guise of epochs. It has explicit read-side markers in RAII guard form using the pin function and its Atomic pointers. Grace periods are computed automatically, and the defer method provides an asynchronous grace-period-wait function. As with DYNIX/ptx, the crossbeam epoch crate lacks any other means of waiting for grace periods, and it also lacks a callback-wait API. However, to it credit, and unlike DYNIX/ptx, this crate does provide safe means for handling pointers to RCU-protected data.
Here is a prototype classification system, again, leaving performance and scalability aside:
Are there explicit RCU read-side markers? Of the Linux-kernel RCU implementations, RCU Tasks and RCU Tasks Rude lack such markers. Given the Rust borrow checker, it is hard to imagine an implementation without such markers, but feel free to prove me wrong.
Are grace periods computed automatically? (If not, as in rcu_clean, none of the remaining questions apply.)
Are there synchronous grace-period-wait APIs? All of the Linux-kernel implementations do, and left_right also looks to.
Are there asynchronous grace-period-wait APIs? If so, are there callback-wait APIs?All of the Linux-kernel implementations do, but left_right does not appear to. Providing them seems doable, but might result in more than two copies of recently-updated data structures. The crossbeam's epoch crate provides an asynchronous grace-period-wait function in the form of defer, but lacks a callback-wait API.
Are there polled grace-period-wait APIs? The Linux-kernel RCU and SRCU implementations do.
Are there multiple grace-period domains? The Linux-kernel SRCU implementation does.
But does this classification scheme work for your favorite RCU implementation? What about your favorite RCU use case?
History
January 25, 2023: Initial version.
January 26, 2023: Add DYNIX/ptx RCU equivalent, note that left_right might be a use of RCU rather than an implementation, and call out the fact that some of the Linux-kernel RCU implementations are based on others.
January 30, 2023: Respond to LWN request for the crossbeam crate. Expand the section summarizing RCU.
Working in information security means building controls, developing technologies that ensure that sensitive material can only be accessed by people that you trust. It also means categorising people into "trustworthy" and "untrustworthy", and trying to come up with a reasonable way to apply that such that people can do their jobs without all your secrets being available to just anyone in the company who wants to sell them to a competitor. It means ensuring that accounts who you consider to be threats shouldn't be able to do any damage, because if someone compromises an internal account you need to be able to shut them down quickly.
And like pretty much any security control, this can be used for both good and bad. The technologies you develop to monitor users to identify compromised accounts can also be used to compromise legitimate users who management don't like. The infrastructure you build to push updates to users can also be used to push browser extensions that interfere with labour organisation efforts. In many cases there's no technical barrier between something you've developed to flag compromised accounts and the same technology being used to flag users who are unhappy with certain aspects of management.
If you're asked to build technology that lets you make this sort of decision, think about whether that's what you want to be doing. Think about who can compel you to use it in ways other than how it was intended. Consider whether that's something you want on your conscience. And then think about whether you can meet those requirements in a different way. If they can simply compel one junior engineer to alter configuration, that's very different to an implementation that requires sign-offs from multiple senior developers. Make sure that all such policy changes have to be clearly documented, including not just who signed off on it but who asked them to. Build infrastructure that creates a record of who decided to fuck over your coworkers, rather than just blaming whoever committed the config update. The blame trail should never terminate in the person who was told to do something or get fired - the blame trail should clearly indicate who ordered them to do that.
But most importantly: build security features as if they'll be used against you.
This is the pilot episode for what will become season 2 of the Linux Kernel Podcast. Back in 2008-2009 I recorded a daily “kernel podcast” that summarized the happenings of the Linux Kernel Mailing List (LKML). Eventually, daily became a little too much, and the podcast went weekly, followed by…not. This time around, I’m not committing to any specific cadence – let’s call it “periodic” (every few weeks). In each episode, I will aim to broadly summarize the latest happenings in the “plumbing” of the Linux kernel, and occasionally related bits of userspace “plumbing” (glibc, systemd, etc.), as well as impactful toolchain changes that enable new features or rebaseline requirements. I welcome your feedback. Please let me know what you think about the format, as well as what you would like to see covered in future episodes. I’m going to play with some ideas over time. These may include “deep diving” into topics of interest to a broader audience. Keep in mind that this podcast is not intended to editorialize, but only to report on what is happening. Both this author, and others, have their own personal opinions, but this podcast aims to focus only on the facts, regardless of who is involved, or their motives.”
On with the show.
For the week ending January 21st 2023, I’m Jon Masters and this is the Linux Kernel Podcast.
Summary
The latest stable kernel is Linux 6.1.7, released by Greg K-H on January 18th 2023.
The latest mainline (development) kernel is 6.2-rc4, released on January 15th 2023.
Long Term Stable 6.1?
The “stable” kernel series is maintained by Greg K-H (Kroah-Hartman), who posts hundreds of patches with fixes to each Linus kernel. This is where the “.7” comes in on top of Linux 6.1. Such stable patches are maintained between kernel releases, so when 6.2 is released, it will become the next “stable” kernel. Once every year or so, Greg will choose a kernel to be the next “Long Term Stable” (LTS) kernel that will receive even more patches, potentially for many years at a time. Back in October, Kaiwan N Billimoria (author of a book titled “Linux Kernel Programming”), seeking a baseline for the next edition, asked if 6.1 would become the next LTS kernel. A great amount of discussion has followed, with Greg responding to a recent ping by saying, “You tell me please. How has your testing gone for 6.1 so far? Does it work properly for you? Are you and/or your company willing to test out the -rc releases and provide feedback if it works or not for your systems?” and so on. This motivated various others to pile on with comments about their level of testing, though I haven’t seen an official 6.1 LTS as of yet.
Linux 6.2 progress
Linus noted in his 6.2-rc4 announcement mail that this came “with pretty much everybody back from winter holidays, and so things should be back to normal. And you can see that in the size, this is pretty much bang in the middle of a regular rc size for this time in the merge window.” The “merge window” is the period of time during which disruptive changes are allowed to be merged (typically the first two weeks of a kernel cycle prior to the first “RC”) so Linus means to refer to a “cycle” and not “merge window” in his announcement.
Speaking of Linux 6.2, it counts among new features additional support for Rust. Linux 6.1 had added initial Rust patches capable of supporting a “hello world” kernel module (but not much more). 6.2 adds support for accessing certain kernel data structures (such as “task_struct”, the per-task/process structure) and handles converting C-style structure “objects” with collections of (possibly null pointers) into the “memory safe” structures understood by Rust. As usual, Linux Weekly News (LWN) has a great article going into much more detail.
Ongoing Development
Richard Guy Briggs posted the 6th version of a patch series titled “fanotify: Allow user space to pass back additional audit info”, which “defines a new flag (FAN_INFO) and new extensions that define additional information which are appended after the response structure returned from user space on a permission event”. This allows audit logging to much more usefully capture why a policy allowed (or disallowed) certain access. The idea is to “enable the creation of tools that can suggest changes to the policy similar to how audit2allow can help refine labeled security”.
Maximillian Luz posted a patch series titled “firmware: Add support for Qualcomm UEFI Secure Application” that allows regular UEFI applications to access EFI variable via proxy calls to the “UEFI Secure Application” (uefisecapp) running in Q’s “secure world” implementation of Arm Trustzone. He has tested using this on a variety of tables, including a Surface Pro X. The application interface was reverse engineer from the Windows QcTrEE8180.sys driver.
Kees Cook requested a stable kernel backport of support for “oops_limit”, a new kernel feature that seeks to limit the number of “oopses” allowed before a kernel will “panic”. An “oops” is what happens when the kernel attempts to access a null pointer reference. Normal application software will crash (with a “segmentation fault”) when this happens. Inside the kernel, the access is caught (provided it happened while in process context), and the associated (but perhaps unrelated) userspace task (process) is killed in the process of generating an “oops” with a backtrace. The kernel may at that moment leak critical resources associated with the process, such as file handles, memory areas, or locks. These aren’t cleaned up. Consequently, it is possible that repeated oopses can be generated by an attacker and used for privilege escalation. The “oops_limit” patches mitigate this by limiting the number of such oopses allowed before the kernel will give up and “panic” (properly crash, and reboot, depending on config).
Vegard Nossum posted version 3 of a patch series titled “kmod: harden user namespaces with new kernel.ns_modules_allowed syscall”, which seeks to “reduce the attack surface and block exploits by ensuring that user namespaces cannot trigger module (auto-)loading”.
Arseniy Lesin reposted an RFC (Request For Comments) of a “SIGOOM Proposal” that would seek to enable the kernel to send a signal whenever a task (process) was in danger of being killed by the “OOM” (Out Of Memory) killer due to consuming too much anonymous (regular) memory. Willy Tarreau and Ted Ts’o noted that we were actually essentially out of space for new signals, and so rather than declaring a new “SIGOOM”, it would be better to allow a process to select which of the existing signals should be used for this process when it registered to receive such notifications. Arseniy said they would follow up with patches that followed this approach.
Architectures
On the architecture front, Mark Brown posted the 4th version of a patch series enabling support for Arm’s SME (Scalable Matrix Extension) version 2 and 2.1. Huang Ying posted patches enabling “migrate_pages()” (which moves memory between NUMA nodes – memory chips specific to e.g. a certain socket in a server) to support batching of the new(er) memory “folios”, rather than doing them one at a time. Batching allows associated TLB invalidation (tearing down the MMU’s understanding of active virtual to physical addresses) to be batched, which is important on Intel systems using IPIs (Inter-Processor-Interrupts), which are reduced by 99.1% during the associated testing, increasing pages migrated per second on a 2P server by 291.7%.
Xin Li posted version 6 of a patch series titled “x86: Enable LKGS instruction”. The “LKGS instruction is introduced with Intel FRED (flexible return and event delivery) specification. As LKGS is independent of FRED, we enable it as a standalone feature”. LKGS (which is an abbreviation of “load into IA32_KERNEL_GS_BASE”) “behaves like the MOV to GS instruction except that it loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor cache.” This means that an Operating System can perform the necessary work to context switch a user-level thread by updating IA32_KERNEL_GS_BASE and avoiding an explicit set of balanced calls to SWAPGS. This is part of the broader “FRED” architecture defined by Intel in the Flexible Return and Event Delivery (FRED) Specification.
David E. Box posted version 2 of a patch series titled “Extend Intel On Demand (SDSi) support, noting that “Intel Software Defined Silicon (SDSi) is now known as Intel On Demand”. These patches enable support for the Intel feature intended to allow users to load signed payloads into their CPUs to turn on certain features after purchasing a system. This might include (for example) certain accelerators present in future chips that could be enabled as needed, similar to how certain automobiles now include subscription-locked heated seats and other features.
Meanwhile, Anup Patel posted patches titled “RISC-V KVM virtualize AIA CSRs” that enable support for the new AIA (Advanced Interrupt Architecture), which replaces the legacy “PLIC”, and Sia Jee Heng posted patches that enable “RISC-V Hibernation Support”.
After hacking the Intel media-driver and ffmpeg I managed to work out how the anv hardware mostly works now for h264 decoding.
I've pushed a branch [1] and a MR[2] to mesa. The basics of h264 decoding are working great on gen9 and compatible hardware. I've tested it on my one Lenovo WhiskeyLake laptop.
I have ported the code to hasvk as well, and once we get moving on this I'll polish that up and check we can h264 decode on IVB/HSW devices.
The one feature I know is missing is status reporting, radv can't support that from what I can work out due to firmware, but anv should be able to so I might dig into that a bit.
There's a bunch of ways you can store cryptographic keys. The most obvious is to just stick them on disk, but that has the downside that anyone with access to the system could just steal them and do whatever they wanted with them. At the far end of the scale you have Hardware Security Modules (HSMs), hardware devices that are specially designed to self destruct if you try to take them apart and extract the keys, and which will generate an audit trail of every key operation. In between you have things like smartcards, TPMs, Yubikeys, and other platform secure enclaves - devices that don't allow arbitrary access to keys, but which don't offer the same level of assurance as an actual HSM (and are, as a result, orders of magnitude cheaper).
The problem with all of these hardware approaches is that they have entirely different communication mechanisms. The industry realised this wasn't ideal, and in 1994 RSA released version 1 of the PKCS#11 specification. This defines a C interface with a single entry point - C_GetFunctionList. Applications call this and are given a structure containing function pointers, with each entry corresponding to a PKCS#11 function. The application can then simply call the appropriate function pointer to trigger the desired functionality, such as "Tell me how many keys you have" and "Sign this, please". This is both an example of C not just being a programming language and also of you having to shove a bunch of vendor-supplied code into your security critical tooling, but what could possibly go wrong.
(Linux distros work around this problem by using p11-kit, which is a daemon that speaks d-bus and loads PKCS#11 modules for you. You can either speak to it directly over d-bus, or for apps that only speak PKCS#11 you can load a module that just transports the PKCS#11 commands over d-bus. This moves the weird vendor C code out of process, and also means you can deal with these modules without having to speak the C ABI, so everyone wins)
One of my work tasks at the moment is helping secure SSH keys, ensuring that they're only issued to appropriate machines and can't be stolen afterwards. For Windows and Linux machines we can stick them in the TPM, but Macs don't have a TPM as such. Instead, there's the Secure Enclave - part of the T2 security chip on x86 Macs, and directly integrated into the M-series SoCs. It doesn't have anywhere near as many features as a TPM, let alone an HSM, but it can generate NIST curve elliptic curve keys and sign things with them and that's good enough. Things are made more complicated by Apple only allowing keys to be used by the app that generated them, so it's hard for applications to generate keys on behalf of each other. This can be mitigated by using CryptoTokenKit, an interface that allows apps to present tokens to the systemwide keychain. Although this is intended for allowing a generic interface for access to such tokens (kind of like PKCS#11), an app can generate its own keys in the Secure Enclave and then expose them to other apps via the keychain through CryptoTokenKit.
Of course, applications then need to know how to communicate with the keychain. Browsers mostly do so, and Apple's version of SSH can to an extent. Unfortunately, that extent is "Retrieve passwords to unlock on-disk keys", which doesn't help in our case. PKCS#11 comes to the rescue here! Apple ship a module called ssh-keychain.dylib, a PKCS#11 module that's intended to allow SSH to use keys that are present in the system keychain. Unfortunately it's not super well maintained - it got broken when Big Sur moved all the system libraries into a cache, but got fixed up a few releases later. Unfortunately every time I tested it with our CryptoTokenKit provider (and also when I retried with SecureEnclaveToken to make sure it wasn't just our code being broken), ssh would tell me "provider /usr/lib/ssh-keychain.dylib returned no slots" which is not especially helpful. Finally I realised that it was actually generating more debug output, but it was being sent to the system debug logs rather than the ssh debug output. Well, when I say "more debug output", I mean "Certificate []: algorithm is not supported, ignoring it", which still doesn't tell me all that much. So I stuck it in Ghidra and searched for that string, and the line above it was
with it immediately failing if the key isn't RSA. Which it isn't, since the Secure Enclave doesn't support RSA. Apple's PKCS#11 module appears incapable of making use of keys generated on Apple's hardware.
There's a couple of ways of dealing with this. The first, which is taken by projects like Secretive, is to implement the SSH agent protocol and have SSH delegate key management to that agent, which can then speak to the keychain. But if you want this to work in all cases you need to implement all the functionality in the existing ssh-agent, and that seems like a bunch of work. The second is to implement a PKCS#11 module, which sounds like less work but probably more mental anguish. I'll figure that out tomorrow.
Needless to say h264/5 weren't my real goals in life for video decoding. Lynne and myself decided to see what we could do to drive AV1 decode forward by creating our own extensions called VK_MESA_video_decode_av1. This is a radv only extension so far, and may expose some peculiarities of AMD hardware/firmware.
Lynne's blog entry[1] has all the gory details, so go read that first. (really read it first).
Now that you've read and understood all that, I'll just rant here a bit. Figuring out the DPB management and hw frame ref and curr_pic_idx fields was a bit of a nightmare. I spent a few days hacking up a lot of wrong things before landing on the thing we agreed was the least wrong which was having the ffmpeg code allocate a frame index in the same fashion as the vaapi radeon implementation did. I had another hacky solution that involved overloading the slotIndex value to mean something that wasn't DPB slot index, but it wasn't really any better. I think there may be something about the hw I don't understand so hopefully we can achieve clarity later.
Long-term Linux users may remember that Alan Cox used to write an online diary. This was before the concept of a "Weblog" had really become a thing, and there certainly weren't any expectations around what one was used for - while now blogging tends to imply a reasonably long-form piece on a specific topic, Alan was just sitting there noting small life concerns or particular technical details in interesting problems he'd solved that day. For me, that was fascinating. I was trying to figure out how to get into kernel development, and was trying to read as much LKML as I could to figure out how kernel developers did stuff. But when you see discussion on LKML, you're frequently missing the early stages. If an LKML patch is a picture of an owl, I wanted to know how to draw the owl, and most of the conversations about starting in kernel development were very "Draw two circles. Now draw the rest of the owl". Alan's musings gave me insight into the thought processes involved in getting from "Here's the bug" to "Here's the patch" in ways that really wouldn't have worked in a more long-form medium.
For the past decade or so, as I moved away from just doing kernel development and focused more on security work instead, Twitter's filled a similar role for me. I've seen people just dumping their thought process as they work through a problem, helping me come up with effective models for solving similar problems. I've learned that the smartest people in the field will spend hours (if not days) working on an issue before realising that they misread something back at the beginning and that's helped me feel like I'm not unusually bad at any of this. It's helped me learn more about my peers, about my field, and about myself.
Twitter's now under new ownership that appears to think all the worst bits of Twitter were actually the good bits, so I've mostly bailed to the Fediverse instead. There's no intrinsic length limit on posts there - Mastodon defaults to 500 characters per post, but that's configurable per instance. But even at 500 characters, it means there's more room to provide thoughtful context than there is on Twitter, and what I've seen so far is more detailed conversation and higher levels of meaningful engagement. Which is great! Except it also seems to discourage some of the posting style that I found so valuable on Twitter - if your timeline is full of nuanced discourse, it feels kind of rude to just scream "THIS FUCKING PIECE OF SHIT IGNORES THE HIGH ADDRESS BIT ON EVERY OTHER WRITE" even though that's exactly the sort of content I'm there for.
And, yeah, not everything has to be for me. But I worry that as Twitter's relevance fades for the people I'm most interested in, we're replacing it with something that's not equivalent - something that doesn't encourage just dropping 50 characters or so of your current thought process into a space where it can be seen by thousands of people. And I think that's a shame.
I've written about bearer tokens and how much pain they cause me before, but sadly wishing for a better world doesn't make it happen so I'm making do with what's available. Okta has a feature called Device Trust which allows to you configure access control policies that prevent people obtaining tokens unless they're using a trusted device. This doesn't actually bind the tokens to the hardware in any way, so if a device is compromised or if a user is untrustworthy this doesn't prevent the token ending up on an unmonitored system with no security policies. But it's an incremental improvement, other than the fact that for desktop it's only supported on Windows and MacOS, which really doesn't line up well with my interests.
Obviously there's nothing fundamentally magic about these platforms, so it seemed fairly likely that it would be possible to make this work elsewhere. I spent a while staring at the implementation using Charles Proxy and the Chrome developer tools network tab and had worked out a lot, and then Okta published a paper describing a lot of what I'd just laboriously figured out. But it did also help clear up some points of confusion and clarified some design choices. I'm not going to give a full description of the details (with luck there'll be code shared for that before too long), but here's an outline of how all of this works. Also, to be clear, I'm only going to talk about the desktop support here - mobile is a bunch of related but distinct things that I haven't looked at in detail yet.
Okta's Device Trust (as officially supported) relies on Okta Verify, a local agent. When initially installed, Verify authenticates as the user, obtains a token with a scope that allows it to manage devices, and then registers the user's computer as an additional MFA factor. This involves it generating a JWT that embeds a number of custom claims about the device and its state, including things like the serial number. This JWT is signed with a locally generated (and hardware-backed, using a TPM or Secure Enclave) key, which allows Okta to determine that any future updates from a device claiming the same identity are genuinely from the same device (you could construct an update with a spoofed serial number, but you can't copy the key out of a TPM so you can't sign it appropriately). This is sufficient to get a device registered with Okta, at which point it can be used with Fastpass, Okta's hardware-backed MFA mechanism.
As outlined in the aforementioned deep dive paper, Fastpass is implemented via multiple mechanisms. I'm going to focus on the loopback one, since it's the one that has the strongest security properties. In this mode, Verify listens on one of a list of 10 or so ports on localhost. When you hit the Okta signin widget, choosing Fastpass triggers the widget into hitting each of these ports in turn until it finds one that speaks Fastpass and then submits a challenge to it (along with the URL that's making the request). Verify then constructs a response that includes the challenge and signs it with the hardware-backed key, along with information about whether this was done automatically or whether it included forcing the user to prove their presence. Verify then submits this back to Okta, and if that checks out Okta completes the authentication.
Doing this via loopback from the browser has a bunch of nice properties, primarily around the browser providing information about which site triggered the request. This means the Verify agent can make a decision about whether to submit something there (ie, if a fake login widget requests your creds, the agent will ignore it), and also allows the issued token to be cross-checked against the site that requested it (eg, if g1thub.com requests a token that's valid for github.com, that's a red flag). It's not quite at the same level as a hardware WebAuthn token, but it has many of the anti-phishing properties.
But none of this actually validates the device identity! The entire registration process is up to the client, and clients are in a position to lie. Someone could simply reimplement Verify to lie about, say, a device serial number when registering, and there'd be no proof to the contrary. Thankfully there's another level to this to provide stronger assurances. Okta allows you to provide a CA root[1]. When Okta issues a Fastpass challenge to a device the challenge includes a list of the trusted CAs. If a client has a certificate that chains back to that, it can embed an additional JWT in the auth JWT, this one containing the certificate and signed with the certificate's private key. This binds the CA-issued identity to the Fastpass validation, and causes the device to start appearing as "Managed" in the Okta device management UI. At that point you can configure policy to restrict various apps to managed devices, ensuring that users are only able to get tokens if they're using a device you've previously issued a certificate to.
I've managed to get Linux tooling working with this, though there's still a few drawbacks. The main issue is that the API only allows you to register devices that declare themselves as Windows or MacOS, followed by the login system sniffing browser user agent and only offering Fastpass if you're on one of the officially supported platforms. This can be worked around with an extension that spoofs user agent specifically on the login page, but that's still going to result in devices being logged as a non-Linux OS which makes interpreting the logs more difficult. There's also no ability to choose which bits of device state you log: there's a couple of existing integrations, and otherwise a fixed set of parameters that are reported. It'd be lovely to be able to log arbitrary material and make policy decisions based on that.
This also doesn't help with ChromeOS. There's no real way to automatically launch something that's bound to localhost (you could probably make this work using Crostini but there's no way to launch a Crostini app at login), and access to hardware-backed keys is kind of a complicated topic in ChromeOS for privacy reasons. I haven't tried this yet, but I think using an enterprise force-installed extension and the chrome.enterprise.platformKeys API to obtain a device identity cert and then intercepting requests to the appropriate port range on localhost ought to be enough to do that? But I've literally never written any Javascript so I don't know. Okta supports falling back from the loopback protocol to calling a custom URI scheme, but once you allow that you're also losing a bunch of the phishing protection, so I'd prefer not to take that approach.
Like I said, none of this prevents exfiltration of bearer tokens once they've been issued, and there's still a lot of ecosystem work to do there. But ensuring that tokens can't be issued to unmanaged machines in the first place is still a step forwards, and with luck we'll be able to make use of this on Linux systems without relying on proprietary client-side tooling.
(Time taken to code this implementation: about two days, and under 1000 lines of new code. Time taken to figure out what the fuck to write: rather a lot longer)
[1] There's also support for having Okta issue certificates, but then you're kind of back to the "How do I know this is my device" situation
If you read more than a few articles in my blog you’ve probably figured out that I’m pretty much a public cloud Luddite: I run my own cloud (including my own email server) and don’t really have much of my data in any public cloud. I still have public cloud logins: everyone wants to share documents with Google nowadays, but Google regards people who don’t use its services “properly” with extreme prejudice and I get my account flagged with a security alert quite often when I try to log in.
However, this isn’t about my public cloud phobia, it’s about the evolution of a single one of my services: a cloud based PBX. It will probably come as no surprise that the PBX I run is Asterisk on Linux but it may be a surprise that I’ve been running it since the early days (since 1999 to be exact). This is the story of why.
I should also add that the motivation for this article is that I’m unable to get a discord account: discord apparently has a verification system that requires a phone number and explicitly excludes any VOIP system, which is all I have nowadays. This got me to thinking that my choices must be pretty unusual if they’re so pejoratively excluded by a company whose mission is to “Create Space for Everyone to find Belonging”. I’m sure the suspicion that this is because Discord the company also offers VoIP services and doesn’t like the competition is unworthy.
Early Days
I’ve pretty much worked remotely in the US all my career. In the 90s this meant having three phone lines (These were actually physical lines into the house): one for the family, one for work and one for the modem. When DSL finally became a thing and we were running a business, the modem was replaced by a fax machine. The minor annoyance was knowing which line was occupied but if line 1 is the house and line 2 the office, it’s not hard. The big change was unbundling. This meant initially the call costs to the UK through the line provider skyrocketed and US out of state rates followed. The way around this was to use unbundled providers via dial-around (a prefix number), but finding the cheapest was hard and the rates changed almost monthly. I needed a system that could add the current dial-around prefix for the relevant provider automatically. The solution: asterisk running on a server in the basement with two digium FX cards for the POTS lines (fax facility now being handled by asterisk) and Aastra 9113i SIP phones wired over the house ethernet with PoE injectors. Some fun jiggery pokery with asterisk busy lamp feature allowed the lights on the SIP phone to indicate busy lines and extensions.conf could be programmed to keep the correct dial-around prefix. For a bonus, asterisk can be programmed to do call screening, so now if the phone system doesn’t recognize your number you get told we don’t accept solicitation calls and to hang up now, otherwise press 0 to ring the house phone … and we’ve had peaceful dinner times ever after. It was also somewhat useful to have each phone in the house on its own PBX extension so people could call from the living room to my office without having to yell.
Enter SIP Trunking
While dial-arounds worked successfully for a few years, they always ended with problems (usually signalled by a massive phone bill) and a new dial-around was needed. However by 2007 several companies were offering SIP trunking over the internet. The one I chose (Localphone, a UK based company) was actually a successful ring back provider before moving into SIP. They offered a pay as you go service with phone termination in whatever country you were calling. The UK and US rates were really good, so suddenly the phone bills went down and as a bonus they gave me a free UK incoming number (called a DID – Direct Inward Dialing) which family and friends in the UK could call us on at local UK rates. Pretty much every call apart from local ones was now being routed over the internet, although most incoming calls, apart for those from the UK, were still over the POTS lines.
The beginning of Mobile (For Me)
I was never really a big consumer of mobile phones, but that all changed in 2009 when google presented all kernel developers with a Nexus One. Of course, they didn’t give us SIM cards to go with it, so my initial experiments were all over wifi. I soon had CyanogenMod installed and found a SIP client called Sipdroid. This allowed me to install my Nexus One as a SIP extension on the house network. SIP calls over 2G data were not very usable (the bandwidth was too low), but implementing multiple codecs and speex support got it to at least work (and is actually made me an android developer … scratching my own itch again). The bandwidth problems on 2G evaporated on 3G and SIP became really usable (although I didn’t have a mobile “plan”, I did use pay-as-you-go SIMs while travelling). It already struck me that all you really needed the mobile network for was data and then all calls could simply travel to a SIP provider. When LTE came along it seemed to be confirming this view because IP became the main communication layer.
I suppose I should add that I used the Nexus One long beyond its design life, even updating its protocol stack so it kept working. I did this partly because it annoyed certain people to see me with an old phone (I have a set of friends who were very amused by this and kept me supplied with a stock of Nexus One phones in case my old one broke) but mostly because of inertia and liking small phones.
SIP Becomes My Only Phone Service
In 2012, thanks to a work assignment, we relocated from the US to London. Since these moves take a while, I relocated the in-house PBX machine to a dedicated server in Los Angeles (my nascent private cloud), ditched the POTS connections and used the UK incoming number as our primary line that could be delivered to us while we were in temporary accommodation as well as when we were in our final residence in London. This did have the somewhat inefficient result that when you called from the downstairs living room to the upstairs office, the call was routed over an 8,000 mile round trip from London to Los Angeles and back, but thanks to internet latency improvements, you couldn’t really tell. The other problem was that the area code I’d chosen back in 2007 was in Whitby, some 200 Miles north of London but fortunately this didn’t seem to be much of an issue except for London Pizza delivery places who steadfastly refused to believe we lived locally.
When the time came in 2013 to move back to Seattle in the USA, the adjustment was simply made by purchasing a 206 area code DID and plugging it into the asterisk system and continued using a fully VoIP system based in Los Angeles. Although I got my incoming UK number for free, being an early service consumer, renting DIDs now costs around $1 per month depending on your provider.
SIP and the Home Office
I’ve worked remotely all my career (even when in London). However, I’ve usually worked for a company with a physical office setup and that means a phone system. Most corporate PBX’s use SIP under the covers or offer a SIP connector. So, by dint of finding the PBX administrator I’ve usually managed to get a SIP extension that will simply plug into my asterisk PBX. Using correct dial plan routing (and a prefix for outbound calling), the office number usually routes to my mobile and desk phone, meaning I can make and receive calls from my office number wherever in the world I happen to be. For those who want to try this at home, the trick is to find the phone system administrator; if you just ask the IT department, chances are you’ll simply get a blanket “no” because they don’t understand it might be easy to do and definitely don’t want to find out.
Evolution to Fully SIP (Data Only) Mobile
Although I said above that I maintained a range of in-country Mobile SIMs, this became less true as the difficulty in running in-country SIMs increased (most started to insist you add cash or use them fairly regularly). When COVID hit in 2020, and I had no ability to travel, my list of in-country SIMs was reduced to one from 3 UK largely because they allowed you to keep your number provided you maintained a balance (and they had a nice internet roaming agreement which meant you paid UK data rates in a nice range of countries). The big problem giving up a mobile number was no text messaging when travelling (SMS). For years I’ve been running a xmpp server, but the subset of my friends who have xmpp accounts has always been under 10% so it wasn’t very practical (actually, this is somewhat untrue because I wrote an xmpp to google chat bridge but the interface became very impedance mismatched as Google moved to rich media).
The major events that got me to move away from xmpp and the Nexus One were the shutdown of the 3G network in the US and the viability of the Matrix federated chat service (the Matrix android client relied on too many modern APIs ever to be backported to the version of android that ran on the Nexus One). Of the available LTE phones, I chose the Pixel-3 as the smallest and most open one with the best price/performance (and rapidly became acquainted with the fact that only some of them can actually be rooted) and LineageOS 17.1 (Android 10). The integration of SIP with the Dialer is great (I can now use SIP on the car’s bluetooth, yay!) but I rapidly ran into severe bugs in the Google SIP implementation (which hasn’t been updated for years). I managed to find and fix all the bugs (or at least those that affected me most, repositories here; all beginning with android_ and having the jejb-10 branch) but that does now mean I’m stuck on Android 10 since Google ripped SIP out in Android 12.
For messaging I adopted matrix (Apart from the Plumbers Matrix problem, I haven’t really written about it since matrix on debian testing just works out of the box) and set up bridges to Signal, Google Chat, Slack and WhatsApp (The WhatsApp one requires you be running WhatsApp on your phone, but I run mine on an Android VM in my cloud) all using the 3 UK Sim number where they require a mobile number confirmation. The final thing I did was to get a universal roaming data SIM and put it in my phone, meaning I now rely on matrix for messaging and SIP for voice when I travel because the data SIM has no working mobile number at all (either for voice or SMS). In many ways, this is no hardship: I never really had a permanent SMS number when travelling because of the use of in-country SIMs, so no-one has a number for me they rely on for SMS.
Conclusion and Problems
Although I implied above I can’t receive SMS, that’s not quite true: one of my VOIP numbers does accept SMS inbound and is able to send outbound, the problem is that it doesn’t come over the SIP MESSAGE protocol, but instead goes to a web page in the provider backend, making it inconvenient to use and meaning I have to know the message is coming (although I do use it for things like Delta Boarding passes, which only send the location of the web page to receive pkpasses over SMS). However, this isn’t usually a problem because most people I know have moved on from SMS to rich messaging over one of the protocols I have (and if one came along with a new protocol, well I can install a bridge for that).
In terms of SIP over an IP substrate giving rise to unbundled services, I could claim to be half right, since most of the modern phone like services have a SIP signalling core. However, the unbundling never really came about: the silo provider just moved from landline to mobile (or a mobile resale service like Google Fi). Indeed, today, if you give anyone your US phone number they invariably assume it is a mobile (and then wonder why you don’t reply to their SMS messages). This mobile assumption problem can be worked around by emphasizing “it’s a landline” every time you give out your VOIP number, but people don’t always retain the information.
So what about the future? I definitely still like the way my phone system works … having a single number for the house which any household member can answer from anywhere and side numbers for travelling really suits me, and I have the technical skills to maintain it indefinitely (provided the SIP trunking providers sill exist), but I can see the day coming where the Discord intolerance of non-siloed numbers is going to spread and most silos will require non-VOIP phone numbers with the same prejudice, thus locking people who don’t comply out in much the same way as it’s happening with email now; however, hopefully that day for VoIP is somewhat further off.
Update: There's actually a more detailed writeup of this here that I somehow missed. Original entry follows:
Today I had to deal with a system that had an irritating restriction - a firmware configuration option I really wanted to be able to change appeared as a greyed out entry in the configuration menu. Some emails revealed that this was a deliberate choice on the part of the system vendor, so that seemed to be that. Thankfully in this case there was a way around that.
One of the things UEFI introduced was a mechanism to generically describe firmware configuration options, called Visual Forms Representation (or VFR). At the most straightforward level, this lets you define a set of forms containing questions, with each question associated with a value in a variable. Questions can be made dependent upon the answers to other questions, so you can have options that appear or disappear based on how other questions were answered. An example in this language might be something like: CheckBox Prompt: "Console Redirection", Help: "Console Redirection Enable or Disable.", QuestionFlags: 0x10, QuestionId: 53, VarStoreId: 1, VarStoreOffset: 0x39, Flags: 0x0 In which question 53 asks whether console redirection should be enabled or disabled. Other questions can then rely on the answer to question 53 to influence whether or not they're relevant (eg, if console redirection is disabled, there's no point in asking which port it should be redirected to). As a checkbox, if it's set then the value will be set to 1, and 0 otherwise. But where's that stored? Earlier we have another declaration: VarStore GUID: EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9, VarStoreId: 1, Size: 0xF4, Name: "Setup" A UEFI variable called "Setup" and with GUID EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9 is declared as VarStoreId 1 (matching the declaration in the question) and is 0xf4 bytes long. The question indicates that the offset for that variable is 0x39. Rewriting Setup-EC87D643-EBA4-4BB5-A1E5-3F3E36B20DA9 with a modified value in offset 0x39 will allow direct manipulation of the config option.
But how do we get this data in the first place? VFR isn't built into the firmware directly - instead it's turned into something called Intermediate Forms Representation, or IFR. UEFI firmware images are typically in a standardised format, and you can use UEFITool to extract individual components from that firmware. If you use UEFITool to search for "Setup" there's a good chance you'll be able to find the component that implements the setup UI. Running IFRExtractor-RS against it will then pull out any IFR data it finds, and decompile that into something resembling the original VFR. And now you have the list of variables and offsets and the configuration associated with them, even if your firmware has chosen to hide those options from you.
Given that a bunch of these config values may be security relevant, this seems a little concerning - what stops an attacker who has access to the OS from simply modifying these variables directly? UEFI avoids this by having two separate stages of boot, one where the full firmware ("Boot Services") is available, and one where only a subset ("Runtime Services") is available. The transition is triggered by the OS calling ExitBootServices, indicating the handoff from the firmware owning the hardware to the OS owning the hardware. This is also considered a security boundary - before ExitBootServices everything running has been subject to any secure boot restrictions, and afterwards applications can do whatever they want. UEFI variables can be flagged as being visible in both Boot and Runtime Services, or can be flagged as Boot Services only. As long as all the security critical variables are Boot Services only, an attacker should never be able to run untrusted code that could alter them.
In my case, the firmware option I wanted to alter had been enclosed in "GrayOutIf True" blocks. But the questions were still defined and the code that acted on those options was still present, so simply modifying the variables while still inside Boot Services gave me what I wanted. Note that this isn't a given! The presence of configuration options in the IFR data doesn't mean that anything will later actually read and make use of that variable - a vendor may have flagged options as unavailable and then removed the code, but never actually removed the config data. And also please do note that the reason stuff was removed may have been that it doesn't actually work, and altering any of these variables risks bricking your hardware in a way that's extremely difficult to recover. And there's also no requirement that vendors use IFR to describe their configuration, so you may not get any help here anyway.
In summary: if you do this you may break your computer. If you don't break your computer, it might not work anyway. I'm not going to help you try to break your computer. And I didn't come up with any of this, I just didn't find it all written down in one place while I was researching it.
I have a patchset that makes it safe to use hibernation under Linux even in a secure boot world but it relies on preventing userland from using one of the TPM PCRs[1] that would otherwise be available to it. TPM 2 devices support emulating PCRs in NVRAM, which would obviously be beneficial in not taking away a scare resource, but the docs suggest there's no way to tie a key generation to an NVRAM value which makes this more complicated.
So, why not ask a computer to do this for me? I asked ChatGPT to write me some code that would bind a TPM key to an NVRAM value. After a couple of false starts it told me:To bind an encryption key to the value of a TPM NV index, you can use the TPM2_NV_DefineSpace command to define an NV index with the TPMA_NV_BIND attribute. Woo! That sounds perfect. Only one problem - I can't actually find the TPMA_NV_BIND attribute defined anywhere in the spec. Well, ask the AI again.
Here is the definition of TPMA_NV_BIND from the TPM 2.0 specification: TPMA_NV_BIND (BIT 17) If SET, the contents of the NV Index are bound to the platform hierarchy.
Ok, that sounds convincing. Where is it in the spec?
The definition of TPMA_NV_BIND is on page 126 of the TPM 2.0 specification, in section 9.7.1 "TPMA_NV".
Huh. Nope. Page 126 says nothing about this. Section 9.7.1 says nothing about this. The TPMA_NV bits are actually described on page 168 of volume 2 of the spec, in section 13.4. Bit 17 is TPMA_NV_OWNERREAD. TPMA_NV_BIND does not appear in the spec anywhere. I told the AI that.
I apologize for the error in my previous response. You are correct that the TPMA_NV_BIND attribute does not appear in the TPM 2.0 specification.
To bind an encryption key to the value of a TPM NV index, you can use the TPM2_NV_DefineSpace command with the TPMA_NV_BIND attribute.
I'm not afraid of being unemployed in the near future.
[1] "Platform Configuration Registers" - a modern TPM has 24 of these. When something security sensitive happens, you generate a hash of that event and pass it to the TPM. The TPM appends that to an existing PCR value and then hashes that concatenated value and sets the PCR to that. This means the PCR value depends not only on the values provided, but also the order they're provided in. Various TPM operations can be made conditional on the PCR values meeting specific criteria.
We had 206 responses to the Linux Plumbers survey in 2022, which, given the total number of in person conference participants of 401, and virtual participants of 320, has provided high confidence in the feedback. Overall there were about 89% of those registered, either showed up as in person or virtual. As this was the first time we’ve tried to do this type of hybrid event, the feedback has been essential as we start planning for something similar in 2023. One piece of input, we’ll definitely be incorporating for next year is to have separate surveys for in person and virtual attendees! So a heartfelt “thank you” to everyone who participated in this survey and waded through the non relevant questions to share their experience!
Overall: 91.8% of respondents were positive about the event, with 6.3% as neutral and 1.9% were dissatisfied. 80.1% indicated that the discussions they participated in helped resolve problems. The BOF track was popular and we’re looking to include it again in 2023. Due to the fact we were having our first in person since the pandemic started, we did this event as a hybrid event with reduced in person registration compared to prior years, as we were unsure how many would be willing to travel and our venue’s capacity. The conference sold out of regular tickets very quickly after opening up registration though, so we set up a waiting list. With some the travel conditions and cancelations, we were able to work through the daunting waiting list, and offer spots to all of those on the list by the conference date. Venue capacity is something we’re looking closely at for next year and will outline the plan when the CFP opens early this year.
Based on feedback from prior years, we videotaped all of the sessions, and the videos are now posted. There are 195 videos from the conference! The committee has also linked them to the detailed schedule and clicking on the video link in the presentation materials section of any given talk or discussion. 72% of respondents plan to watch them to clarify points and another 10% are planning to watch them to catch up on sessions that they were not able to attend.
Venue: In general, 45.6% of respondents considered the venue size to be a good match, but a significant portion would have preferred it to be bigger (47%) as well. The room size was considered effective for participation by 78.6% of the respondents.
Content: In terms of track feedback, Linux Plumbers Refereed track and Kernel Summit track were indicated as very relevant by almost all respondents who attended. The BOFs track was positively received and will continue. The hallway track continues to be regarded as most relevant, and appreciated. We will continue to evaluate options for making private meeting and hack rooms available for groups who need to meet onsite.
Communication: The emails from the committee continue to be positively received. We were able to incorporate some of the suggestions from prior surveys, and are continuing to look for options to make the hybrid event communications between in person and virtual attendees work better.
Events: Our evening events are feeling the pressure from the number of attendees especially with the other factors from the pandemic. The first night event had more issues than the closing event and we appreciate the constructive suggestions in the write-in comments. The survey was still positive about the events overall, so we’ll see what we can do make this part of the “hallway track” more effective for everyone next year.
There were lots of great suggestions to the “what one thing would you like to see changed” question, and the program committee has met to discuss them. Once a venue is secured, we’ll be reviewing them again to see what is possible to implement this coming year.
Thank you again to the participants for their input and help on improving the Linux Plumbers Conference. The conference is planned to be in North America in the October/November timeframe for 2023. As soon as we secure a venue, dates and location information will be posted in a blog by the committee chair, Christian Brauner.
After the video decode stuff was fairly nailed down, Lynne from ffmpeg nerdsniped^Wtalked me into looking at h264 encoding.
The AMD VCN encoder engine is a very different interface to the decode engine and required a lot of code porting from the radeon vaapi driver. Prior to Xmas I burned a few days on typing that all in, and yesterday I finished typing and moved to debugging the pile of trash I'd just typed in.
Lynne meanwhile had written the initial ffmpeg side implementation, and today we threw them at each other, and polished off a lot of sharp edges. We were rewarded with valid encoded frames.
The code at this point is only doing I-frame encoding, we will work on P/B frames when we get a chance.
There are also a bunch of hacks and workarounds for API/hw mismatches, that I need to consult with Vulkan spec and AMD teams about, but we have a good starting point to move forward from. I'll also be offline for a few days on holidays so I'm not sure it will get much further until mid January.
I've been working the past couple of weeks with an ffmpeg developer (Lynne) doing Vulkan video decode bringup on radv.
The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.
Khronos has released the final specs for these extensions. This work is rebased onto the final KHR form and is in a merge request for radv[2].
This contains an initial implementation of H264 and H265 decoding for AMD GPUs from TONGA to NAVI2x. It passes the basic conformance tests but fails some of the more complicated ones, but it has decoded the streams we've been throwing at it using ffmpeg.
Turns out that linking to several days old public data in order to demonstrate that Elon's jet was broadcasting its tail number in the clear is apparently "posting private information" so for anyone looking for me there I'm actually here
After cleaning up the radv stuff I decided to go back and dig into the anv support for H264.
The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.
This
contains an initial implementation of H264 Intel GPUs that anv supports. I've only tested it on Kabylake equivalents so far. It decodes some of the basic streams I've thrown at it from ffmpeg. Now this isn't as far along as the AMD implementation, but I'm also not sure I'm programming the hardware correctly. The Windows DXVA API has 2 ways to decode H264, short and long. I believe but I'm not 100% sure the current Vulkan API is quite close to "short", but the only Intel implementations I've found source for are for "long". I've bridged this gap by writing a slice header parser in mesa, but I think the hw might be capable of taking over that task, and I could in theory dump a bunch of code. But the programming guides for the hw block are a bit vague on some of the details around how "long" works. Maybe at some point someone in Intel can tell me :-)
A previous post in this series showed how you can use the --bootargs parameter and .boot files to supply kernel boot parameters to the kernels under test. This works, but it turns out that there is another way, which is often the case with the Linux kernel. This other way is Masami Hiramatsu's bootconfig facility, which is nicely documented in detail here. This blog post is a how-to guide on making use of bootconfig when running rcutorture.
The bootconfig facility allows kernel boot parameters to be built into initrd or directly into the kernel itself, this last being the method used here. This requires that the kernel build system be informed of the parameters. Suppose that these parameters are placed in a file named /tmp/dump_tree.bootparam as follows:
Note well the "kernel." prefix, which is required here. The other option is an "init." prefix, which would cause the parameter to instead be passed to the init process.
Then the following three Kconfig options inform the build system of this file:
The resulting kernel image will then contain the above pair of kernel boot parameters. Except that you also have to tell the kernel to look for these parameters, which is done by passing in the "bootconfig" kernel boot parameter. And no, it does not work to add a "kernel.bootconfig" line to the /tmp/dump_tree.bootparam file! You can instead add it to a .boot file or to the kvm.sh command line like this: "--bootargs bootconfig".
The resulting console.log file would contain the following text, indicating that these boot parameters had in fact been processed correctly, as indicated by the "Boot-time adjustment of callback invocation limit to 15." and the last three lines that begin with "rcu_node tree layout dump".
----------- Running RCU self tests rcu: Preemptible hierarchical RCU implementation. rcu: CONFIG_RCU_FANOUT set to non-default value of 6. rcu: RCU lockdep checking is enabled. rcu: Build-time adjustment of leaf fanout to 6. rcu: Boot-time adjustment of callback invocation limit to 15. rcu: RCU debug GP pre-init slowdown 3 jiffies. rcu: RCU debug GP init slowdown 3 jiffies. rcu: RCU debug GP cleanup slowdown 3 jiffies. Trampoline variant of Tasks RCU enabled. rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies. rcu: rcu_node tree layout dump rcu: 0:7 ^0 rcu: 0:3 ^0 4:7 ^1 -----------
What happens if you use both CONFIG_BOOT_CONFIG_EMBED_FILE and the --bootargs parameter? The kernel boot parameters passed to --bootargs will be processed first, followed by those in /tmp/dump_tree.bootparam. Please note that the semantics of repeated kernel-boot parameters is subsystem-specific, so please also be careful.
The requirement that the "bootconfig" parameter be specified on the normal kernel command line can be an issue in environments where the this command line is not easily modified. One way of avoiding such issues is to create a Kconfig option that causes the kernel to act as if the "bootconfig" parameter had been specified. For example, the following -rcu commit does just this with a new CONFIG_BOOT_CONFIG_FORCE Kconfig option:
It is important to note that although these embedded kernel-boot parameters show up at the beginning of the "/proc/cmdline" file, they may also be found in isolation in the "/proc/bootconfig" file, for example, like this:
Why do the embedded kernel-boot parameters show up at the beginning of "/proc/cmdline" instead of at the end, given that they are processed after the other parameters? Because of the possibility of a "--" in the non-embedded traditionally sourced kernel boot parameters, which would make it appear that the embedded kernel-boot parameters were intended for the init process rather than for the kernel.
But what is the point of all this?
Within the context of rcutorture, probably not much. But there are environments where external means of setting per-kernel-version kernel-boot parameters is inconvenient, and rcutorture is an easy way of testing the embedding of those parameters directly into the kernel image itself.
First up: what I'm covering here is probably not relevant for most people. That's ok! Different situations have different threat models, and if what I'm talking about here doesn't feel like you have to worry about it, that's great! Your life is easier as a result. But I have worked in situations where we had to care about some of the scenarios I'm going to describe here, and the technologies I'm going to talk about here solve a bunch of these problems.
So. You run a typical VM in the cloud. Who has access to that VM? Well, firstly, anyone who has the ability to log into the host machine with administrative capabilities. With enough effort, perhaps also anyone who has physical access to the host machine. But the hypervisor also has the ability to inspect what's running inside a VM, so anyone with the ability to install a backdoor into the hypervisor could theoretically target you. And who's to say the cloud platform launched the correct image in the first place? The control plane could have introduced a backdoor into your image and run that instead. Or the javascript running in the web UI that you used to configure the instance could have selected a different image without telling you. Anyone with the ability to get a (cleverly obfuscated) backdoor introduced into quite a lot of code could achieve that. Obviously you'd hope that everyone working for a cloud provider is honest, and you'd also hope that their security policies are good and that all code is well reviewed before being committed. But when you have several thousand people working on various components of a cloud platform, there's always the potential for something to slip up.
Let's imagine a large enterprise with a whole bunch of laptops used by developers. If someone has the ability to push a new package to one of those laptops, they're in a good position to obtain credentials belonging to the user of that laptop. That means anyone with that ability effectively has the ability to obtain arbitrary other privileges - they just need to target someone with the privilege they want. You can largely mitigate this by ensuring that the group of people able to do this is as small as possible, and put technical barriers in place to prevent them from pushing new packages unilaterally.
Now imagine this in the cloud scenario. Anyone able to interfere with the control plane (either directly or by getting code accepted that alters its behaviour) is in a position to obtain credentials belonging to anyone running in that cloud. That's probably a much larger set of people than have the ability to push stuff to laptops, but they have much the same level of power. You'll obviously have a whole bunch of processes and policies and oversights to make it difficult for a compromised user to do such a thing, but if you're a high enough profile target it's a plausible scenario.
How can we avoid this? The easiest way is to take the people able to interfere with the control plane out of the loop. The hypervisor knows what it booted, and if there's a mechanism for the VM to pass that information to a user in a trusted way, you'll be able to detect the control plane handing over the wrong image. This can be achieved using trusted boot. The hypervisor-provided firmware performs a "measurement" (basically a cryptographic hash of some data) of what it's booting, storing that information in a virtualised TPM. This TPM can later provide a signed copy of the measurements on demand. A remote system can look at these measurements and determine whether the system is trustworthy - if a modified image had been provided, the measurements would be different. As long as the hypervisor is trustworthy, it doesn't matter whether or not the control plane is - you can detect whether you were given the correct OS image, and you can build your trust on top of that.
(Of course, this depends on you being able to verify the key used to sign those measurements. On real hardware the TPM has a certificate that chains back to the manufacturer and uniquely identifies the TPM. On cloud platforms you typically have to retrieve the public key via the metadata channel, which means you're trusting the control plane to give you information about the hypervisor in order to verify what the control plane gave to the hypervisor. This is suboptimal, even though realistically the number of moving parts in that part of the control plane is much smaller than the number involved in provisioning the instance in the first place, so an attacker managing to compromise both is less realistic. Still, AWS doesn't even give you that, which does make it all rather more complicated)
Ok, so we can (largely) decouple our trust in the VM from having to trust the control plane. But we're still relying on the hypervisor to provide those attestations. What if the hypervisor isn't trustworthy? This sounds somewhat ridiculous (if you can't run a trusted app on top of an untrusted OS, how can you run a trusted OS on top of an untrusted hypervisor?), but AMD actually have a solution for that. SEV ("Secure Encrypted Virtualisation") is a technology where (handwavily) an encryption key is generated when a new VM is created, and the memory belonging to that VM is encrypted with that key. The hypervisor has no access to that encryption key, and any access to memory initiated by the hypervisor will only see the encrypted content. This means that nobody with the ability to tamper with the hypervisor can see what's going on inside the OS (and also means that nobody with physical access can either, so that's another threat dealt with).
But how do we know that the hypervisor set this up, and how do we know that the correct image was booted? SEV has support for a "Launch attestation", a CPU generated signed statement that it booted the current VM with SEV enabled. But it goes further than that! The attestation includes a measurement of what was booted, which means we don't need to trust the hypervisor at all - the CPU itself will tell us what image we were given. Perfect.
Except, well. There's a few problems. AWS just doesn't have any VMs that implement SEV yet (there are bare metal instances that do, but obviously you're building your own infrastructure to make that work). Google only seem to provide the launch measurement via the logging service - and they only include the parsed out data, not the original measurement. So, we still have to trust (a subset of) the control plane. Azure provides it via a separate attestation service, but again it doesn't seem to provide the raw attestation and so you're still trusting the attestation service. For the newest generation of SEV, SEV-SNP, this is less of a big deal because the guest can provide its own attestation. But Google doesn't offer SEV-SNP hardware yet, and the driver you need for this only shipped in Linux 5.19 and Azure's SEV Ubuntu images only offer up to 5.15 at the moment, so making use of that means you're putting your own image together at the moment.
And there's one other kind of major problem. A normal VM image provides a bootloader and a kernel and a filesystem. That bootloader needs to run on something. That "something" is typically hypervisor-provided "firmware" - for instance, OVMF. This probably has some level of cloud vendor patching, and they probably don't ship the source for it. You're just having to trust that the firmware is trustworthy, and we're talking about trying to avoid placing trust in the cloud provider. Azure has a private beta allowing users to upload images that include their own firmware, meaning that all the code you trust (outside the CPU itself) can be provided by the user, and once that's GA it ought to be possible to boot Azure VMs without having to trust any Microsoft-provided code.
Well, mostly. As AMD admit, SEV isn't guaranteed to be resistant to certain microarchitectural attacks. This is still much more restrictive than the status quo where the hypervisor could just read arbitrary content out of the VM whenever it wanted to, but it's still not ideal. Which, to be fair, is where we are with CPUs in general.
(Thanks to Leonard Cohnen who gave me a bunch of excellent pointers on this stuff while I was digging through it yesterday)
I've been ridiculously burned out for a while now but I'm taking the month off to recover and that's giving me an opportunity to catch up on a lot of stuff. This has included me actually writing some code to work with the Pluton in my Thinkpad Z13. I've learned some more stuff in the process, but based on everything I know I'd still say that in its current form Pluton isn't a threat to free software.
So, first up: by default on the Z13, Pluton is disabled. It's not obviously exposed to the OS at all, which also means there's no obvious mechanism for Microsoft to push out a firmware update to it via Windows Update. The Windows drivers that bind to Pluton don't load in this configuration. It's theoretically possible that there's some hidden mechanism to re-enable it at runtime, but that code doesn't seem to be in Windows at the moment. I'm reasonably confident that "Disabled" is pretty genuinely disabled.
Second, when enabled, Pluton exposes two separate devices. The first of these has an MSFT0101 identifier in ACPI, which is the ID used for a TPM 2 device. The Pluton TPM implementation doesn't work out of the box with existing TPM 2 drivers, though, because it uses a custom start method. TPM 2 devices commonly use something called a "Command Response Buffer" architecture, where a command is written into a buffer, the TPM is told to do a thing, and the response to the command ends up in another buffer. The mechanism to tell the TPM to do a thing varies, and an ACPI table exposed to the OS defines which of those various things should be used for a given TPM. Pluton systems have a mechanism that isn't defined in the existing version of the spec (1.3 rev 8 at the time of writing), so I had to spend a while staring hard at the Windows drivers to figure out how to implement it. The good news is that I now have a patch that successfully gets the existing Linux TPM driver code work correctly with the Pluton implementation.
The second device has an MSFT0200 identifier, and is entirely not a TPM. The Windows driver appears to be a relatively thin layer that simply takes commands from userland and passes them on to the chip - I haven't found any userland applications that make use of this, so it's tough to figure out what functionality is actually available. But what does seem pretty clear from the code I've looked at is that it's a component that only responds when it's asked - if the OS never sends it any commands, it's not able to do anything.
One key point from this recently published Microsoft doc is that the whole "Microsoft can update Pluton firmware" thing does just seem to be the ability for the OS to push new code to the chip at runtime. That means Microsoft can't arbitrarily push new firmware to the chip - the OS needs to be involved. This is unsurprising, but it's nice to see some stronger confirmation of that.
Anyway. tl;dr - Pluton can (now) be used as a regular TPM. Pluton also exposes some additional functionality which is not yet clear, but there's no obvious mechanism for it to compromise user privacy or restrict what users can run on a Free operating system. The Pluton firmware update mechanism appears to be OS mediated, so users who control their OS can simply choose not to opt in to that.
WebAuthn improves login security a lot by making it significantly harder for a user's credentials to be misused - a WebAuthn token will only respond to a challenge if it's issued by the site a secret was issued to, and in general will only do so if the user provides proof of physical presence[1]. But giving people tokens is tedious and also I have a new laptop which only has USB-C but does have a working fingerprint reader and I hate the aesthetics of the Yubikey 5C Nano, so I've been thinking about what WebAuthn looks like done without extra hardware.
Let's talk about the broad set of problems first. For this to work you want to be able to generate a key in hardware (so it can't just be copied elsewhere if the machine is compromised), prove to a remote site that it's generated in hardware (so the remote site isn't confused about what security assertions you're making), and tie use of that key to the user being physically present (which may range from "I touched this object" to "I presented biometric evidence of identity"). What's important here is that a compromised OS shouldn't be able to just fake a response. For that to be possible, the chain between proof of physical presence to the secret needs to be outside the control of the OS.
For a physical security token like a Yubikey, this is pretty easy. The communication protocol involves the OS passing a challenge and the source of the challenge to the token. The token then waits for a physical touch, verifies that the source of the challenge corresponds to the secret it's being asked to respond to the challenge with, and provides a response. At the point where keys are being enrolled, the token can generate a signed attestation that it generated the key, and a remote site can then conclude that this key is legitimately sequestered away from the OS. This all takes place outside the control of the OS, meeting all the goals described above.
How about Macs? The easiest approach here is to make use of the secure enclave and TouchID. The secure enclave is a separate piece of hardware built into either a support chip (for x86-based Macs) or directly on the SoC (for ARM-based Macs). It's capable of generating keys and also capable of producing attestations that said key was generated on an Apple secure enclave ("Apple Anonymous Attestation", which has the interesting property of attesting that it was generated on Apple hardware, but not which Apple hardware, avoiding a lot of privacy concerns). These keys can have an associated policy that says they're only usable if the user provides a legitimate touch on the fingerprint sensor, which means it can not only assert physical presence of a user, it can assert physical presence of an authorised user. Communication between the fingerprint sensor and the secure enclave is a private channel that the OS can't meaningfully interfere with, which means even a compromised OS can't fake physical presence responses (eg, the OS can't record a legitimate fingerprint press and then send that to the secure enclave again in order to mimic the user being present - the secure enclave requires that each response from the fingerprint sensor be unique). This achieves our goals.
The PC space is more complicated. In the Mac case, communication between the biometric sensors (be that TouchID or FaceID) occurs in a controlled communication channel where all the hardware involved knows how to talk to the other hardware. In the PC case, the typical location where we'd store secrets is in the TPM, but TPMs conform to a standardised spec that has no understanding of this sort of communication, and biometric components on PCs have no way to communicate with the TPM other than via the OS. We can generate keys in the TPM, and the TPM can attest to those keys being TPM-generated, which means an attacker can't exfiltrate those secrets and mimic the user's token on another machine. But in the absence of any explicit binding between the TPM and the physical presence indicator, the association needs to be up to code running on the CPU. If that's in the OS, an attacker who compromises the OS can simply ask the TPM to respond to an challenge it wants, skipping the biometric validation entirely.
Windows solves this problem in an interesting way. The Windows Hello Enhanced Signin doesn't add new hardware, but relies on the use of virtualisation. The agent that handles WebAuthn responses isn't running in the OS, it's running in another VM that's entirely isolated from the OS. Hardware that supports this model has a mechanism for proving its identity to the local code (eg, fingerprint readers that support this can sign their responses with a key that has a certificate that chains back to Microsoft). Additionally, the secrets that are associated with the TPM can be held in this VM rather than in the OS, meaning that the OS can't use them directly. This means we have a flow where a browser asks for a WebAuthn response, that's passed to the VM, the VM asks the biometric device for proof of user presence (including some sort of random value to prevent the OS just replaying that), receives it, and then asks the TPM to generate a response to the challenge. Compromising the OS doesn't give you the ability to forge the responses between the biometric device and the VM, and doesn't give you access to the secrets in the TPM, so again we meet all our goals.
On Linux (and other free OSes), things are less good. Projects like tpm-fido generate keys on the TPM, but there's no secure channel between that code and whatever's providing proof of physical presence. An attacker who compromises the OS may not be able to copy the keys to their own system, but while they're on the compromised system they can respond to as many challenges as they like. That's not the same security assertion we have in the other cases.
Overall, Apple's approach is the simplest - having binding between the various hardware components involved means you can just ignore the OS entirely. Windows doesn't have the luxury of having as much control over what the hardware landscape looks like, so has to rely on virtualisation to provide a security barrier against a compromised OS. And in Linux land, we're fucked. Who do I have to pay to write a lightweight hypervisor that runs on commodity hardware and provides an environment where we can run this sort of code?
[1] As I discussed recently there are scenarios where these assertions are less strong, but even so
(Disclaimer: I'm not a cryptographer, and I do not claim to be an expert in Signal. I've had this read over by a couple of people who are so with luck there's no egregious errors, but any mistakes here are mine)
There are indications that Twitter is working on end-to-end encrypted DMs, likely building on work that was done back in 2018. This made use of libsignal, the reference implementation of the protocol used by the Signal encrypted messaging app. There seems to be a fairly widespread perception that, since libsignal is widely deployed (it's also the basis for WhatsApp's e2e encryption) and open source and has been worked on by a whole bunch of cryptography experts, choosing to use libsignal means that 90% of the work has already been done. And in some ways this is true - the security of the protocol is probably just fine. But there's rather more to producing a secure and usable client than just sprinkling on some libsignal.
(Aside: To be clear, I have no reason to believe that the people who were working on this feature in 2018 were unaware of this. This thread kind of implies that the practical problems are why it didn't ship at the time. Given the reduction in Twitter's engineering headcount, and given the new leadership's espousal of political and social perspectives that don't line up terribly well with the bulk of the cryptography community, I have doubts that any implementation deployed in the near future will get all of these details right)
I was musing about this last night and someone pointed out some prior art. Bridgefy is a messaging app that uses Bluetooth as its transport layer, allowing messaging even in the absence of data services. The initial implementation involved a bunch of custom cryptography, enabling a range of attacks ranging from denial of service to extracting plaintext from encrypted messages. In response to criticism Bridgefy replaced their custom cryptographic protocol with libsignal, but that didn't fix everything. One issue is the potential for MITMing - keys are shared on first communication, but the client provided no mechanism to verify those keys, so a hostile actor could pretend to be a user, receive messages intended for that user, and then reencrypt them with the user's actual key. This isn't a weakness in libsignal, in the same way that the ability to add a custom certificate authority to a browser's trust store isn't a weakness in TLS. In Signal the app key distribution is all handled via Signal's servers, so if you're just using libsignal you need to implement the equivalent yourself.
The other issue was more subtle. libsignal has no awareness at all of the Bluetooth transport layer. Deciding where to send a message is up to the client, and these routing messages were spoofable. Any phone in the mesh could say "Send messages for Bob here", and other phones would do so. This should have been a denial of service at worst, since the messages for Bob would still be encrypted with Bob's key, so the attacker would be able to prevent Bob from receiving the messages but wouldn't be able to decrypt them. However, the code to decide where to send the message and the code to decide which key to encrypt the message with were separate, and the routing decision was made before the encryption key decision. An attacker could send a message saying "Route messages for Bob to me", and then another saying "Actually lol no I'm Mallory". If a message was sent between those two messages, the message intended for Bob would be delivered to Mallory's phone and encrypted with Mallory's key.
Again, this isn't a libsignal issue. libsignal encrypted the message using the key bundle it was told to encrypt it with, but the client code gave it a key bundle corresponding to the wrong user. A race condition in the client logic allowed messages intended for one person to be delivered to and readable by another.
This isn't the only case where client code has used libsignal poorly. The Bond Touch is a Bluetooth-connected bracelet that you wear. Tapping it or drawing gestures sends a signal to your phone, which culminates in a message being sent to someone else's phone which sends a signal to their bracelet, which then vibrates and glows in order to indicate a specific sentiment. The idea is that you can send brief indications of your feelings to someone you care about by simply tapping on your wrist, and they can know what you're thinking without having to interrupt whatever they're doing at the time. It's kind of sweet in a way that I'm not, but it also advertised "Private Spaces", a supposedly secure way to send chat messages and pictures, and that seemed more interesting. I grabbed the app and disassembled it, and found it was using libsignal. So I bought one and played with it, including dumping the traffic from the app. One important thing to realise is that libsignal is just the protocol library - it doesn't implement a server, and so you still need some way to get information between clients. And one of the bits of information you have to get between clients is the public key material.
Back when I played with this earlier this year, key distribution was implemented by uploading the public key to a database. The other end would download the public key, and everything works out fine. And this doesn't sound like a problem, given that the entire point of a public key is to be, well, public. Except that there was no access control on this database, and the filenames were simply phone numbers, so you could overwrite anyone's public key with one of your choosing. This didn't let you cause messages intended for them to be delivered to you, so exploiting this for anything other than a DoS would require another vulnerability somewhere, but there are contrived situations where this would potentially allow the privacy expectations to be broken.
Another issue with this app was its handling of one-time prekeys. When you send someone new a message via Signal, it's encrypted with a key derived from not only the recipient's identity key, but also from what's referred to as a "one-time prekey". Users generate a bunch of keypairs and upload the public half to the server. When you want to send a message to someone, you ask the server for one of their one-time prekeys and use that. Decrypting this message requires using the private half of the one-time prekey, and the recipient deletes it afterwards. This means that an attacker who intercepts a bunch of encrypted messages over the network and then later somehow obtains the long-term keys still won't be able to decrypt the messages, since they depended on keys that no longer exist. Since these one-time prekeys are only supposed to be used once (it's in the name!) there's a risk that they can all be consumed before they're replenished. The spec regarding pre-keys says that servers should consider rate-limiting this, but the protocol also supports falling back to just not using one-time prekeys if they're exhausted (you lose the forward secrecy benefits, but it's still end-to-end encrypted). This implementation not only implemented no rate-limiting, making it easy to exhaust the one-time prekeys, it then also failed to fall back to running without them. Another easy way to force DoS.
(And, remember, a successful DoS on an encrypted communications channel potentially results in the users falling back to an unencrypted communications channel instead. DoS may not break the encrypted protocol, but it may be sufficient to obtain plaintext anyway)
And finally, there's ClearSignal. I looked at this earlier this year - it's avoided many of these pitfalls by literally just being a modified version of the official Signal client and using the existing Signal servers (it's even interoperable with Actual Signal), but it's then got a bunch of other weirdness. The Signal database (I /think/ including the keys, but I haven't completely verified that) gets backed up to an AWS S3 bucket, identified using something derived from a key using KERI, and I've seen no external review of that whatsoever. So, who knows. It also has crash reporting enabled, and it's unclear how much internal state it sends on crashes, and it's also based on an extremely old version of Signal with the "You need to upgrade Signal" functionality disabled.
Three clients all using libsignal in one form or another, and three clients that do things wrong in ways that potentially have a privacy impact. Again, none of these issues are down to issues with libsignal, they're all in the code that surrounds it. And remember that Twitter probably has to worry about other issues as well! If I lose my phone I'm probably not going to worry too much about whether the messages sent through my weird bracelet app being gone forever, but losing all my Twitter DMs would be a significant change in behaviour from the status quo. But that's not an easy thing to do when you're not supposed to have access to any keys! Group chats? That's another significant problem to deal with. And making the messages readable through the web UI as well as on mobile means dealing with another set of key distribution issues. Get any of this wrong in one way and the user experience doesn't line up with expectations, get it wrong in another way and the worst case involves some of your users in countries with poor human rights records being executed.
Simply building something on top of libsignal doesn't mean it's secure. If you want meaningful functionality you need to build a lot of infrastructure around libsignal, and doing that well involves not just competent development and UX design, but also a strong understanding of security and cryptography. Given Twitter's lost most of their engineering and is led by someone who's alienated all the cryptographers I know, I wouldn't be optimistic.
As I mentioned a couple of weeks ago, I've been trying to hack an Orbic Speed RC400L mobile hotspot so it'll automatically boot when power is attached. When plugged in it would flash a "Welcome" screen and then switch to a display showing the battery charging - it wouldn't show up on USB, and didn't turn on any networking. So, my initial assumption was that the bootloader was making a policy decision not to boot Linux. After getting root (as described in the previous post), I was able to cat /proc/mtd and see that partition 7 was titled "aboot". Aboot is a commonly used Android bootloader, based on Little Kernel - LK provides the hardware interface, aboot is simply an app that runs on top of it. I was able to find the source code for Quectel's aboot, which is intended to run on the same SoC that's in this hotspot, so it was relatively easy to line up a bunch of the Ghidra decompilation with actual source (top tip: find interesting strings in your decompilation and paste them into github search, and see whether you get a repo back).
Unfortunately looking through this showed various cases where bootloader policy decisions were made, but all of them seemed to result in Linux booting. Patching them and flashing the patched loader back to the hotspot didn't change the behaviour. So now I was confused: it seemed like Linux was loading, but there wasn't an obvious point in the boot scripts where it then decided not to do stuff. No boot logs were retained between boots, which made things even more annoying. But then I realised that, well, I have root - I can just do my own logging. I hacked in an additional init script to dump dmesg to /var, powered it down, and then plugged in a USB cable. It booted to the charging screen. I hit the power button and it booted fully, appearing on USB. I adb shelled in, checked the logs, and saw that it had booted twice. So, we were definitely entering Linux before showing the charging screen. But what was the difference?
Diffing the dmesg showed that there was a major distinction on the kernel command line. The kernel command line is data populated by the bootloader and then passed to the kernel - it's how you provide arguments that alter kernel behaviour without having to recompile it, but it's also exposed to userland by the running kernel so it also serves as a way for the bootloader to pass information to the running userland. The boot that resulted in the charging screen had a androidboot.poweronreason=USB argument, the one that booted fully had androidboot.poweronreason=PWRKEY. Searching the filesystem for androidboot.poweronreason showed that the script that configures USB did not enable USB if poweronreason was USB, and the same string also showed up in a bunch of other applications. The bootloader was always booting Linux, but it was telling Linux why it had booted, and if that reason was "USB" then Linux was choosing not to enable USB and not starting the networking stack.
One approach would be to modify every application that parsed this data and make it work even if the power on reason was "USB". That would have been tedious. It seemed easier to modify the bootloader. Looking for that string in Ghidra showed that it was reading a register from the power management controller and then interpreting that to determine the reason it had booted. In effect, it was doing something like:
Changing the 0x20 to 0xff meant that the USB case would never be detected, and it would fall through to the default. All the userland code was happy to accept "Hard_Reset" as a legitimate reason to boot, and now plugging in USB results in the modem booting to a functional state. Woo.
If you want to do this yourself, dump the aboot partition from your device, and search for the hex sequence "03 02 00 0a 20"". Change the final 0x20 to 0xff, copy it back to the device, and write it to mtdblock7. If it doesn't work, feel free to curse me, but I'm almost certainly going to be no use to you whatsoever. Also, please, do not just attempt to mechanically apply this to other hotspots. It's not going to end well.
Thank you to everyone that attended Linux Plumbers 2022 both in person or virtually. After two years of being 100% virtual due to the pandemic, we were able to have a very successful hybrid conference, with 418 people registering in-person where 401 attended (96%), and 361 registered virtually and 320 who actually participated online (89%), not counting all those that used the free YouTube service. After two years of being 100% remote, we decided to keep this year’s in-person count lower than normal due to the unknowns caused by the pandemic. To compensate for the smaller venue, we tried something new, and created a virtual attendance as well. We took a different approach than other hybrid conferences, and treated this one as a virtual event with an in-person component, where the in room attendees were simply participants of the virtual event. This required all presentations to be uploaded to Big Blue Button, and the presenters presented through the virtual platform even though they were doing so on stage. This allowed the virtual attendees to be treated as first class citizens of the conference. Although we found this format a success, it wasn’t without technical difficulties, like problems with having no sound in the beginning of the first day, but that’s expected when attempting to do something for the first time. Overall, we found it to be a better experience and will continue to do so in future conferences.
We had a total of 18 microconferences (where patches are already going out on the mailing lists that are results of discussions that happened there), 16 Refereed talks, 8 Kernel Summit talks, 29 Networking and BPF Summit track talks, and 9 Toolchain track talks. There were also 17 birds-of-a-feather talks, where several were added at the last minute to solve issues that have just arrived. Most of these presentations can still be seen on video.
Stay tune for the feedback report of our attendees.
Next year Linux Plumbers will take place in North America (but not necessarily in the United States). We are still locking down on locations. As it is custom for Linux Plumbers to change chairs every year, next year will be chaired by Christian Brauner. It seems we like to have the chair live in another continent than where the conference takes place. We are hoping to find a venue that can hold at least 600 people, where we will be able to increase the number of in-person attendees.
Finally, I want to thank all those that were involved in making Linux Plumbers the best technical conference there is. This would not have happened without the hard work from the planning committee (Alice Ferrazzi, Christian Brauner, David Woodhouse, Guy Lunardi, James Bottomley, Kate Stewart, Mike Rapoport, and Paul E. McKenney), the runners of the Networking and BPF Summit track, the Toolchain track, Kernel Summit, and those that put together the very productive microconferences. I would also like to thank all those that presented as well as those who attended both in-person and virtually. I want to thank our sponsors for their continued support, and hope that this year’s conference was well worth it for them. I want to give special thanks to the Linux Foundation and their staff, who went above and beyond to make this conference run smoothly. They do a lot of work behind the scenes and the planning committee greatly appreciates it.
Before signing off from 2022, I would like to ask if anyone would be interested in volunteering with helping out at next year’s conference? We are especially looking for those that could help on a technical level, as we found running a virtual component along with a live event requires a bit more people than what we currently have. If you are interested, please send an email to contact@linuxplumbersconf.org.
Sincerely,
Steven Rostedt
Linux Plumbers 2022 Conference chair
One of the huge benefits of WebAuthn is that it makes traditional phishing attacks impossible. An attacker sends you a link to a site that looks legitimate but isn't, and you type in your credentials. With SMS or TOTP-based 2FA, you type in your second factor as well, and the attacker now has both your credentials and a legitimate (if time-limited) second factor token to log in with. WebAuthn prevents this by verifying that the site it's sending the secret to is the one that issued it in the first place - visit an attacker-controlled site and said attacker may get your username and password, but they won't be able to obtain a valid WebAuthn response.
But what if there was a mechanism for an attacker to direct a user to a legitimate login page, resulting in a happy WebAuthn flow, and obtain valid credentials for that user anyway? This seems like the lead-in to someone saying "The Aristocrats", but unfortunately it's (a) real, (b) RFC-defined, and (c) implemented in a whole bunch of places that handle sensitive credentials. The villain of this piece is RFC 8628, and while it exists for good reasons it can be used in a whole bunch of ways that have unfortunate security consequences.
What is the RFC 8628-defined Device Authorization Grant, and why does it exist? Imagine a device that you don't want to type a password into - either it has no input devices at all (eg, some IoT thing) or it's awkward to type a complicated password (eg, a TV with an on-screen keyboard). You want that device to be able to access resources on behalf of a user, so you want to ensure that that user authenticates the device. RFC 8628 describes an approach where the device requests the credentials, and then presents a code to the user (either on screen or over Bluetooth or something), and starts polling an endpoint for a result. The user visits a URL and types in that code (or is given a URL that has the code pre-populated) and is then guided through a standard auth process. The key distinction is that if the user authenticates correctly, the issued credentials are passed back to the device rather than the user - on successful auth, the endpoint the device is polling will return an oauth token.
But what happens if it's not a device that requests the credentials, but an attacker? What if said attacker obfuscates the URL in some way and tricks a user into clicking it? The user will be presented with their legitimate ID provider login screen, and if they're using a WebAuthn token for second factor it'll work correctly (because it's genuinely talking to the real ID provider!). The user will then typically be prompted to approve the request, but in every example I've seen the language used here is very generic and doesn't describe what's going on or ask the user. AWS simply says "An application or device requested authorization using your AWS sign-in" and has a big "Allow" button, giving the user no indication at all that hitting "Allow" may give a third party their credentials.
This isn't novel! Christoph Tafani-Dereeper has an excellent writeup on this topic from last year, which builds on Nestori Syynimaa's earlier work. But whenever I've talked about this, people seem surprised at the consequences. WebAuthn is supposed to protect against phishing attacks, but this approach subverts that protection by presenting the user with a legitimate login page and then handing their credentials to someone else.
RFC 8628 actually recognises this vector and presents a set of mitigations. Unfortunately nobody actually seems to implement these, and most of the mitigations are based around the idea that this flow will only be used for physical devices. Sadly, AWS uses this for initial authentication for the aws-cli tool, so there's no device in that scenario. Another mitigation is that there's a relatively short window where the code is valid, and so sending a link via email is likely to result in it expiring before the user clicks it. An attacker could avoid this by directing the user to a domain under their control that triggers the flow and then redirects the user to the login page, ensuring that the code is only generated after the user has clicked the link.
Can this be avoided? The best way to do so is to ensure that you don't support this token issuance flow anywhere, or if you do then ensure that any tokens issued that way are extremely narrowly scoped. Unfortunately if you're an AWS user, that's probably not viable - this flow is required for the cli tool to perform SSO login, and users are going to end up with broadly scoped tokens as a result. The logs are also not terribly useful.
The infuriating thing is that this isn't necessary for CLI tooling. The reason this approach is taken is that you need a way to get the token to a local process even if the user is doing authentication in a browser. This can be avoided by having the process listen on localhost, and then have the login flow redirect to localhost (including the token) on successful completion. In this scenario the attacker can't get access to the token without having access to the user's machine, and if they have that they probably have access to the token anyway.
There's no real moral here other than "Security is hard". Sorry.
I've been playing with an Orbic Speed, a relatively outdated device that only speaks LTE Cat 4, but the towers I can see from here are, uh, not well provisioned so throughput really isn't a concern (and refurbs are $18, so). As usual I'm pretty terrible at just buying devices and using them for their intended purpose, and in this case it has the irritating behaviour that if there's a power cut and the battery runs out it doesn't boot again when power returns, so here's what I've learned so far.
First, it's clearly running Linux (nmap indicates that, as do the headers from the built-in webserver). The login page for the web interface has some text reading "Open Source Notice" that highlights when you move the mouse over it, but that's it - there's code to make the text light up, but it's not actually a link. There's no exposed license notices at all, although there is a copy on the filesystem that doesn't seem to be reachable from anywhere. The notice tells you to email them to receive source code, but doesn't actually provide an email address.
Still! Let's see what else we can figure out. There's no open ports other than the web server, but there is an update utility that includes some interesting components. First, there's a copy of adb, the Android Debug Bridge. That doesn't mean the device is running Android, it's common for embedded devices from various vendors to use a bunch of Android infrastructure (including the bootloader) while having a non-Android userland on top. But this is still slightly surprising, because the device isn't exposing an adb interface over USB. There's also drivers for various Qualcomm endpoints that are, again, not exposed. Running the utility under Windows while the modem is connected results in the modem rebooting and Windows talking about new hardware being detected, and watching the device manager shows a bunch of COM ports being detected and bound by Qualcomm drivers. So, what's it doing?
Sticking the utility into Ghidra and looking for strings that correspond to the output that the tool conveniently leaves in the logs subdirectory shows that after finding a device it calls vendor_device_send_cmd(). This is implemented in a copy of libusb-win32 that, again, has no offer for source code. But it's also easy to drop that into Ghidra and discover thatn vendor_device_send_cmd() is just a wrapper for usb_control_msg(dev,0xc0,0xa0,0,0,NULL,0,1000);. Sending that from Linux results in the device rebooting and suddenly exposing some more USB endpoints, including a functional adb interface. Although, annoyingly, the rndis interface that enables USB tethering via the modem is now missing.
Unfortunately the adb user is unprivileged, but most files on the system are world-readable. data/logs/atfwd.log is especially interesting. This modem has an application processor built into the modem chipset itself, and while the modem implements the Hayes Command Set there's also a mechanism for userland to register that certain AT commands should be pushed up to userland. These are handled by the atfwd_daemon that runs as root, and conveniently logs everything it's up to. This includes having logged all the communications executed when the update tool was run earlier, so let's dig into that.
The system sends a bunch of AT+SYSCMD= commands, each of which is in the form of echo (stuff) >>/usrdata/sec/chipid. Once that's all done, it sends AT+CHIPID, receives a response of CHIPID:PASS, and then AT+SER=3,1, at which point the modem reboots back into the normal mode - adb is gone, but rndis is back. But the logs also reveal that between the CHIPID request and the response is a security check that involves RSA. The logs on the client side show that the text being written to the chipid file is a single block of base64 encoded data. Decoding it just gives apparently random binary. Heading back to Ghidra shows that atfwd_daemon is reading the chipid file and then decrypting it with an RSA key. The key is obtained by calling a series of functions, each of which returns a long base64-encoded string. Decoding each of these gives 1028 bytes of high entropy data, which is then passed to another function that decrypts it using AES CBC using a key of 000102030405060708090a0b0c0d0e0f and an initialization vector of all 0s. This is somewhat weird, since there's 1028 bytes of data and 128 bit AES works on blocks of 16 bytes. The behaviour of OpenSSL is apparently to just pad the data out to a multiple of 16 bytes, but that implies that we're going to end up with a block of garbage at the end. It turns out not to matter - despite the fact that we decrypt 1028 bytes of input only the first 200 bytes mean anything, with the rest just being garbage. Concatenating all of that together gives us a PKCS#8 private key blob in PEM format. Which means we have not only the private key, but also the public key.
So, what's in the encrypted data, and where did it come from in the first place? It turns out to be a JSON blob that contains the IMEI and the serial number of the modem. This is information that can be read from the modem in the first place, so it's not secret. The modem decrypts it, compares the values in the blob to its own values, and if they match sets a flag indicating that validation has succeeeded. But what encrypted it in the first place? It turns out that the json blob is just POSTed to http://pro.w.ifelman.com/api/encrypt and an encrypted blob returned. Of course, the fact that it's being encrypted on the server with the public key and sent to the modem that decrypted with the private key means that having access to the modem gives us the public key as well, which means we can just encrypt our own blobs.
What does that buy us? Digging through the code shows the only case that it seems to matter is when parsing the AT+SER command. The first argument to this is the serial mode to transition to, and the second is whether this should be a temporary transition or a permanent one. Once parsed, these arguments are passed to /sbin/usb/compositions/switch_usb which just writes the mode out to /usrdata/mode.cfg (if permanent) or /usrdata/mode_tmp.cfg (if temporary). On boot, /data/usb/boot_hsusb_composition reads the number from this file and chooses which USB profile to apply. This requires no special permissions, except if the number is 3 - if so, the RSA verification has to be performed first. This is somewhat strange, since mode 9 gives the same rndis functionality as mode 3, but also still leaves the debug and diagnostic interfaces enabled.
So what's the point of all of this? I'm honestly not sure! It doesn't seem like any sort of effective enforcement mechanism (even ignoring the fact that you can just create your own blobs, if you change the IMEI on the device somehow, you can just POST the new values to the server and get back a new blob), so the best I've been able to come up with is to ensure that there's some mapping between IMEI and serial number before the device can be transitioned into production mode during manufacturing.
But, uh, we can just ignore all of this anyway. Remember that AT+SYSCMD= stuff that was writing the data to /usrdata/sec/chipid in the first place? Anything that's passed to AT+SYSCMD is just executed as root. Which means we can just write a new value (including 3) to /usrdata/mode.cfg in the first place, without needing to jump through any of these hoops. Which also means we can just adb push a shell onto there and then use the AT interface to make it suid root, which avoids needing to figure out how to exploit any of the bugs that are just sitting there given it's running a 3.18.48 kernel.
Anyway, I've now got a modem that's got working USB tethering and also exposes a working adb interface, and I've got root on it. Which let me dump the bootloader and discover that it implements fastboot and has an oem off-mode-charge command which solves the problem I wanted to solve of having the device boot when it gets power again. Unfortunately I still need to get into fastboot mode. I haven't found a way to do it through software (adb reboot bootloader doesn't do anything), but this post suggests it's just a matter of grounding a test pad, at which point I should just be able to run fastboot oem off-mode-charge and it'll be all set. But that's a job for tomorrow.
Edit: Got into fastboot mode and ran fastboot oem off-mode-charge 0 but sadly it doesn't actually do anything, so I guess next is going to involve patching the bootloader binary. Since it's signed with a cert titled "General Use Test Key (for testing only)" it apparently doesn't have secure boot enabled, so this should be easy enough.
A few years ago, I posted on the challenges of maintaining low weight as one ages. I have managed to stay near my target weight, with the occasional excursion in either direction, though admittedly more often up than down. My suspicion that maintaining weight would prove 90% as difficult as losing it has proven to be all too well founded. As has the observation that exercise is inherently damaging to muscles (see for example here), especially as one's body's ability to repair itself decreases inexorably with age.
It can be helpful to refer back to those old college physics courses. One helpful formula is the well-worn Newtonian formula for kinetic energy, which is equal to half your mass times the square of your velocity. Now, the human body does not maintain precisely the same speed while moving (that is after all what bicycles are for), and the faster you are going, the more energy your body must absorb when decreasing your velocity by a set amount on each footfall. In fact, this amount of energy increases linearly with your average velocity. So you can reduce the energy absorption (and thus the muscle and joint damage) by decreasing your speed. And here you were wondering why old people often move much more slowly than do young people!
But moving more slowly decreases the quality of the exercise, for example, requiring more time to gain the same cardiovascular benefits. One approach is to switch from (say) running to hiking uphill, thus decreasing velocity while increasing exertion. This works quite well, at least until it comes time to hike back down the hill.
At this point, another formula comes into play, that for potential energy. The energy released by lowering your elevation is your mass times the force of gravity time the difference in elevation. With each step you take downhill, your body must dissipate this amount of energy. Alternatively, you can transfer the potential energy into kinetic energy, but please see the previous discussion. And again, this is what bicycles are for, at least for those retaining sufficiently fast reflexes to operate them safely under those conditions. (Not me!!!)
The potential energy can be dissipated by your joints or by your muscles, with muscular dissipation normally being less damaging. In other words, bend your knee and hip before, during, and after the time that your foot strikes the ground. This gives your leg muscles more time to dissipate that step's worth of potential energy. Walking backwards helps by bringing your ankle joint into play and also by increasing the extent to which your hip and knee can flex. Just be careful to watch where you are going, as falling backwards down a hill is not normally what you want to be doing. (Me, I walk backwards down the steepest slopes, which allow me to see behind myself just by looking down. It is also helpful to have someone else watching out for you.)
Also, take small steps. This reduces the difference in elevation, thus reducing the amount of energy that must be dissipated per step.
But wait! This also increases the number of steps, so that the effect of reducing your stride cancels out, right?
Wrong.
First, longer stride tends to result in higher velocity, the damaging effects of which were described above. Second, the damage your muscles incur while dissipating energy is non-linear with both the force that your muscles are exerting and the energy per unit time (also known as "power") that they are dissipating. To see this, recall that a certain level of force/power will cause your muscle to rupture completely, so that a (say) 10x reduction in force/power results in much greater than a 10x reduction in damage.
These approaches allow you to get good exercise with minimal damage to your body. Other strategies include the aforementioned bicycling as well as swimming. Although I am fond of swimming, I recognize that it is my exercise endgame, and that I will therefore need to learn to like it. But not just yet.
To circle back to the subject of the earlier blog post, one common term in the formulas for both kinetic and potential energy is one's mass. And I do find hiking easier than it was when I weighed 30 pounds more than I do now. Should I lose more weight? On this, I defer to my wife, who is a dietitian. She assures me that 180 pounds is my target weight.
So here I am! And here I will endeavor to stay, despite my body's continued fear of the food-free winter that it has never directly experienced.
One of the main points made is that common device frameworks are normally about targeting a common userspace (e.g. mesa for GPUs). Since a common userspace doesn't exist for accelerators, this presents a problem of what sort of common things can be targetted. Discussion about tensorflow, pytorch as being the userspace, but also camera image processing and OpenCL. OpenXLA was also named as a userspace API that might be of interest to use as a target for implementations.
There was a discussion on what to call the subsystem and where to place it in the tree. It was agreed that the drivers would likely need to use DRM subsystem functionality but having things live in drivers/gpu/drm would not be great. Moving things around now for current drivers is too hard to deal with for backports etc. Adding a new directory for accel drivers would be a good plan, even if they used the drm framework. There was a lot naming discussion, I think we landed on drivers/skynet or drivers/accel (Greg and I like skynet more).
We had a discussion about RAS (Reliability, Availability, Serviceability) which is how hardware is monitored in data centers. GPU and acceleration drivers for datacentre operations define a their own RAS interfaces that get plugged into monitoring systems. This seems like an area that could be standardised across drivers. Netlink was suggested as a possible solution for this area.
Namespacing for devices was brought up. I think the advice was if you ever think you are going to namespace something in the future, you should probably consider creating a namespace for it up front, as designing one in later might prove difficult to secure properly.
We should use the drm framework with another major number to avoid some of the pain points and lifetime issues other frameworks see.
There was discussion about who could drive this forward, and Oded Gabbay from Intel's Habana Labs team was the obvious and best placed person to move it forward, Oded said he would endeavor to make it happen.
This is mostly a summary of the notes, I think we have a fair idea on a path forward we just need to start bringing the pieces together upstream now.
How does the rcu_synchronize() wait-for-readers operation work?
But the rcu_domain class contains lock() and unlock() member functions!!!
Lockless things make me nervous!
There was limited time for questions, and each question's answer could easily have consumed the full 50 minutes alloted for the full talk. Therefore, I address these questions in the following sections.
How Does rcu_synchronize() Work?
There are a great many ways to make this work. Very roughly speaking, userspace RCU implementations usually have per-thread counters that are updated by readers and sampled by updaters, with the updaters waiting for all of the counters to reach zero. There are a large number of pitfalls and optimizations, some of which are covered in the 2012 Transactions On Parallel and Distributed Systems paper (non-paywalled draft). The most detailed discussion is in the supplementary materials.
The rcu_domain Class Contains lock() and unlock() Member Functions?
Indeed it does!
But names notwithstanding, these lock() and unlock() member functions need not contain memory-barrier instructions, let alone read-modify-write atomic operations, let alone acquisition and release of actual locks.
So why these misleading names??? These misleading names exist so that the rcu_domain class meets the requirements of Cpp17BasicLockable, which provides RAII capability for C++ RCU readers. Earlier versions of the RCU proposal for the C++ standard rolled their own RAII capability, but the committee wisely insisted that Cpp17BasicLockable's existing RAII capabilities be used instead.
So it is that rcu_domain::lock() simply enters an RCU read-side critical section and rcu_domain::unlock() exits that critical section. Yes, RCU read-side critical sections can be nested.
Lockless Things Make Me Nervous!!!
As well they should!
The wise developer will be at least somewhat nervous when implementing lockless code because that nervousness will help motivate the developer to be careful, to test and stress-test carefully, and, when necessary, make good use of formal verification.
In fact, one of the purposes of RCU is to package lockless code so as to make it easier to use. This presentation dove into one RCU use case, and other materials called out in this CPP Summit presentation looked into many other RCU use cases.
So proper use of RCU should enable developers to be less nervous. But hopefully not completely lacking in nervousness! :-)
Fast laptops are expensive, cheap laptops are slow. But even a fast laptop is slower than a decent workstation, and if your developers want a local build environment they're probably going to want a decent workstation. They'll want a fast (and expensive) laptop as well, though, because they're not going to carry their workstation home with them and obviously you expect them to be able to work from home. And in two or three years they'll probably want a new laptop and a new workstation, and that's even more money. Not to mention the risks associated with them doing development work on their laptop and then drunkenly leaving it in a bar or having it stolen or the contents being copied off it while they're passing through immigration at an airport. Surely there's a better way?
This is the thinking that leads to "Let's give developers a Chromebook and a VM running in the cloud". And it's an appealing option! You spend far less on the laptop, and the VM is probably cheaper than the workstation - you can shut it down when it's idle, you can upgrade it to have more CPUs and RAM as necessary, and you get to impose all sorts of additional neat security policies because you have full control over the network. You can run a full desktop environment on the VM, stream it to a cheap laptop, and get the fast workstation experience on something that weighs about a kilogram. Your developers get the benefit of a fast machine wherever they are, and everyone's happy.
But having worked at more than one company that's tried this approach, my experience is that very few people end up happy. I'm going to give a few reasons here, but I can't guarantee that they cover everything - and, to be clear, many (possibly most) of the reasons I'm going to describe aren't impossible to fix, they're simply not priorities. I'm also going to restrict this discussion to the case of "We run a full graphical environment on the VM, and stream that to the laptop" - an approach that only offers SSH access is much more manageable, but also significantly more restricted in certain ways. With those details mentioned, let's begin.
The first thing to note is that the overall experience is heavily tied to the protocol you use for the remote display. Chrome Remote Desktop is extremely appealing from a simplicity perspective, but is also lacking some extremely key features (eg, letting you use multiple displays on the local system), so from a developer perspective it's suboptimal. If you read the rest of this post and want to try this anyway, spend some time working with your users to find out what their requirements are and figure out which technology best suits them.
Second, let's talk about GPUs. Trying to run a modern desktop environment without any GPU acceleration is going to be a miserable experience. Sure, throwing enough CPU at the problem will get you past the worst of this, but you're still going to end up with users who need to do 3D visualisation, or who are doing VR development, or who expect WebGL to work without burning up every single one of the CPU cores you so graciously allocated to their VM. Cloud providers will happily give you GPU instances, but that's going to cost more and you're going to need to re-run your numbers to verify that this is still a financial win. "But most of my users don't need that!" you might say, and we'll get to that later on.
Next! Video quality! This seems like a trivial point, but if you're giving your users a VM as their primary interface, then they're going to do things like try to use Youtube inside it because there's a conference presentation that's relevant to their interests. The obvious solution here is "Do your video streaming in a browser on the local system, not on the VM" but from personal experience that's a super awkward pain point! If I click on a link inside the VM it's going to open a browser there, and now I have a browser in the VM and a local browser and which of them contains the tab I'm looking for WHO CAN SAY. So your users are going to watch stuff inside their VM, and re-compressing decompressed video is going to look like shit unless you're throwing a huge amount of bandwidth at the problem. And this is ignoring the additional irritation of your browser being unreadable while you're rapidly scrolling through pages, or terminal output from build processes being a muddy blur of artifacts, or the corner case of "I work for Youtube and I need to be able to examine 4K streams to determine whether changes have resulted in a degraded experience" which is a very real job and one that becomes impossible when you pass their lovingly crafted optimisations through whatever codec your remote desktop protocol has decided to pick based on some random guesses about the local network, and look everyone is going to have a bad time.
The browser experience. As mentioned before, you'll have local browsers and remote browsers. Do they have the same security policy? Who knows! Are all the third party services you depend on going to be ok with the same user being logged in from two different IPs simultaneously because they lost track of which browser they had an open session in? Who knows! Are your users going to become frustrated? Who knows oh wait no I know the answer to this one, it's "yes".
Accessibility! More of your users than you expect rely on various accessibility interfaces, be those mechanisms for increasing contrast, screen magnifiers, text-to-speech, speech-to-text, alternative input mechanisms and so on. And you probably don't know this, but most of these mechanisms involve having accessibility software be able to introspect the UI of applications in order to provide appropriate input or expose available options and the like. So, I'm running a local text-to-speech agent. How does it know what's happening in the remote VM? It doesn't because it's just getting an a/v stream, so you need to run another accessibility stack inside the remote VM and the two of them are unaware of each others existence and this works just as badly as you'd think. Alternative input mechanism? Good fucking luck with that, you're at best going to fall back to "Send synthesized keyboard inputs" and that is nowhere near as good as "Set the contents of this text box to this unicode string" and yeah I used to work on accessibility software maybe you can tell. And how is the VM going to send data to a braille output device? Anyway, good luck with the lawsuits over arbitrarily making life harder for a bunch of members of a protected class.
One of the benefits here is supposed to be a security improvement, so let's talk about WebAuthn. I'm a big fan of WebAuthn, given that it's a multi-factor authentication mechanism that actually does a good job of protecting against phishing, but if my users are running stuff inside a VM, how do I use it? If you work at Google there's a solution, but that does mean limiting yourself to Chrome Remote Desktop (there are extremely good reasons why this isn't generally available). Microsoft have apparently just specced a mechanism for doing this over RDP, but otherwise you're left doing stuff like forwarding USB over IP, and that means that your USB WebAuthn no longer works locally. It also doesn't work for any other type of WebAuthn token, such as a bluetooth device, or an Apple TouchID sensor, or any of the Windows Hello support. If you're planning on moving to WebAuthn and also planning on moving to remote VM desktops, you're going to have a bad time.
That's the stuff that comes to mind immediately. And sure, maybe each of these issues is irrelevant to most of your users. But the actual question you need to ask is what percentage of your users will hit one or more of these, because if that's more than an insignificant percentage you'll still be staffing all the teams that dealt with hardware, handling local OS installs, worrying about lost or stolen devices, and the glorious future of just being able to stop worrying about this is going to be gone and the financial benefits you promised would appear are probably not going to work out in the same way.
A lot of this falls back to the usual story of corporate IT - understand the needs of your users and whether what you're proposing actually meets them. Almost everything I've described here is a corner case, but if your company is larger than about 20 people there's a high probability that at least one person is going to fall into at least one of these corner cases. You're going to need to spend a lot of time understanding your user population to have a real understanding of what the actual costs are here, and I haven't seen anyone do that work before trying to launch this and (inevitably) going back to just giving people actual computers.
There are alternatives! Modern IDEs tend to support SSHing out to remote hosts to perform builds there, so as long as you're ok with source code being visible on laptops you can at least shift the "I need a workstation with a bunch of CPU" problem out to the cloud. The laptops are going to need to be more expensive because they're also going to need to run more software locally, but it wouldn't surprise me if this ends up being cheaper than the full-on cloud desktop experience in most cases.
Overall, the most important thing to take into account here is that your users almost certainly have more use cases than you expect, and this sort of change is going to have direct impact on the workflow of every single one of your users. Make sure you know how much that's going to be, and take that into consideration when suggesting it'll save you money.
Over the last few years it’s become popular to suggest that open source maintainers should be paid. There are a couple of stated motivations for this, one being that it would improve the security and reliability of the ecosystem (pioneered by several companies like Tidelift) and the others contending that it would be a solution to the maintainer burnout and finally that it would solve the open source free rider problem. The purpose of this blog is to examine each of these in turn to seeing if paying maintainers actually would solve the problem (or, for some, does the problem even exist in the first place).
Free Riders
The free rider problem is simply expressed: There’s a class of corporations which consume open source, for free, as the foundation of their profits but don’t give back enough of their allegedly ill gotten gains. In fact, a version of this problem is as old as time: the “workers” don’t get paid enough (or at all) by the “bosses”; greedy people disproportionately exploit the free but limited resources of the planet. Open Source is uniquely vulnerable to this problem because of the free (as in beer) nature of the software: people who don’t have to pay for something often don’t. Part of the problem also comes from the general philosophy of open source which tries to explain that it’s free (as in freedom) which matters not free (as in beer) and everyone, both producers and consumers should care about the former. In fact, in economic terms, the biggest impact open source has had on industry is from the free (as in beer) effect.
Open Source as a Destroyer of Market Value
Open Source is often portrayed as a “disrupter” of the market, but it’s not often appreciated that a huge part of that disruption is value destruction. Consider one of the older Open Source systems: Linux. As an operating system (when coupled with GNU or other user space software) it competed in the early days with proprietary UNIX. However, it’s impossible to maintain your margin competing against free and the net result was that one by one the existing players were forced out of the market or refocussed on other offerings and now, other than for historical or niche markets, there’s really no proprietary UNIX maker left … essentially the value contained within the OS market was destroyed. This value destruction effect was exploited brilliantly by Google with Android: to enter and disrupt an existing lucrative smart phone market, created and owned by Apple, with a free OS based on Open Source successfully created a load of undercutting handset manufacturers eager to be cheaper than Apple who went on to carve out an 80% market share. Here, the value isn’t completely destroyed, but it has significantly reduced (smart phones going from a huge margin business to a medium to low margin one).
All of this value destruction is achieved by the free (as in beer) effect of open source: the innovator who uses it doesn’t have to pay the full economic cost for developing everything from scratch, they just have to pay the innovation expense of adapting it (such adaptation being made far easier by access to the source code). This effect is also the reason why Microsoft and other companies railed about Open Source being a cancer on intellectual property: because it is1. However, this view is also the product of rigid and incorrect thinking: by destroying value in existing markets, open source presents far more varied and unique opportunities in newly created ones. The cardinal economic benefit of value destruction is that it lowers the barrier to entry (as Google demonstrated with Android) thus opening the market up to new and varied competition (or turning monopoly markets into competitive ones).
Envy isn’t a Good Look
If you follow the above, you’ll see the supposed “free rider” problem is simply a natural consequence of open source being free as in beer (someone is creating a market out of the thing you offered for free precisely because they didn’t have to pay for it): it’s not a problem to be solved, it’s a consequence to be embraced and exploited (if you’re clever enough). Not all of us possess the business acumen to exploit market opportunities like this, but if you don’t, envying those who do definitely won’t cause your quality of life to improve.
The bottom line is that having a mechanism to pay maintainers isn’t going to do anything about this supposed “free rider” problem because the companies that exploit open source and don’t give back have no motivation to use it.
Maintainer Burnout
This has become a hot topic over recent years with many blog posts and support groups devoted to it. From my observation it seems to matter what kind of maintainer you are: If you only have hobby projects you maintain on an as time becomes available basis, it seems the chances of burn out isn’t high. On the other hand, if you’re effectively a full time Maintainer, burn out becomes a distinct possibility. I should point out I’m the former not the latter type of maintainer, so this is observation not experience, but it does seem to me that burn out at any job (not just that of a Maintainer) seems to happen when delivery expectations exceed your ability to deliver and you start to get depressed about the ever increasing backlog and vocal complaints. In industry when someone starts to burn out, the usual way of rectifying it is either lighten the load or provide assistance. I have noticed that full time Maintainers are remarkably reluctant to give up projects (presumably because each one is part of their core value), so helping with tooling to decrease the load is about the only possible intervention here.
As an aside about tooling, from parallels with Industry, although tools correctly used can provide useful assistance, there are sure fire ways to increase the possibility of burn out with inappropriate demands:
It does strike me that some of our much venerated open source systems, like github, have some of these same management anti-patterns, like encouraging Maintainers to chase repository stars to show value, having a daily reminder of outstanding PRs and Issues, showing everyone who visits your home page your contribution records for every project over the last year.
To get back to the main point, again by parallel with Industry, paying people more doesn’t decrease industrial burn out; it may produce a temporary feel good high, but the backlog pile eventually overcomes this. If someone is already working at full stretch at something they want to do giving them more money isn’t going to make them stretch further. For hobby maintainers like me, even if you could find a way to pay me that my current employer wouldn’t object to, I’m already devoting as much time as I can spare to my Maintainer projects, so I’m unlikely to find more (although I’m not going to refuse free money …).
Security and Reliability
Everyone wants Maintainers to code securely and reliably and also to respond to bug reports within a fixed SLA. Obviously usual open source Maintainers are already trying to code securely and reliably and aren’t going to do the SLA thing because they don’t have to (as the licence says “NO WARRANTY …”), so paying them won’t improve the former and if they’re already devoting all the time they can to Maintenance, it won’t achieve the latter either. So how could Security and Reliability be improved? All a maintainer can really do is keep current with current coding techniques (when was the last time someone offered a free course to Maintainers to help with this?). Suggesting to a project that if they truly believed in security they’d rewrite it in Rust from tens of thousands of lines of C really is annoying and unhelpful.
One of the key ways to keep software secure and reliable is to run checkers over the code and do extensive unit and integration testing. The great thing about this is that it can be done as a side project from the main Maintenance task provided someone triages and fixes the issues generated. This latter is key; simply dumping issue reports on an overloaded maintainer makes the overload problem worse and adds to a pile of things they might never get around to. So if you are thinking of providing checker or tester resources, please also think how any generated issues might get resolved without generating more work for a Maintainer.
Business Models around Security and Reliability
A pretty old business model for code checking and testing is to have a distribution do it. The good ones tend to send patches upstream and their business model is to sell the software (or at least a support licence to it) which gives the recipients a SLA as well. So what’s the problem? Mainly the economics of this tried and trusted model. Firstly what you want supported must be shipped by a distribution, which means it must have a big enough audience for a distribution to consider it a fairly essential item. Secondly you end up paying a per instance use cost that’s an average of everything the distribution ships. The main killer is this per instance cost, particularly if you are a hyperscaler, so it’s no wonder there’s a lot of pressure to shift the cost from being per instance to per project.
As I said above, Maintainers often really need more help than more money. One good way to start would potentially be to add testing and checking (including bug fixing and upstreaming) services to a project. This would necessarily involve liaising with the maintainer (and could involve an honorarium) but the object should be to be assistive (and thus scalably added) to what the Maintainer is already doing and prevent the service becoming a Maintainer time sink.
Professional Maintainers
Most of the above analysis assumed Maintainers are giving all the time they have available to the project. However, in the case where a Maintainer is doing the project in their spare time or is an Employee of a Company and being paid partly to work on the project and partly on other things, paying them to become a full time Maintainer (thus leaving their current employment) has the potential to add the hours spent on “other things” to the time spent on the project and would thus be a net win. However, you have to also remember that turning from employee to independent contractor also comes with costs in terms of red tape (like health care, tax filings, accounting and so on), which can become a significant time sink, so the net gain in hours to the project might not be as many as one would think. In an ideal world, entities paying maintainers would also consider this problem and offer services to offload the burden (although none currently seem to consider this). Additionally, turning from part time to full time can increase the problem of burn out, particularly if you spend increasing portions of your newly acquired time worrying about admin issues or other problems associated with running your own consulting business.
Conclusions
The obvious conclusion from the above analysis is that paying maintainers mostly doesn’t achieve it’s stated goals. However, you have to remember that this is looking at the problem thorough the prism of claimed end results. One thing paying maintainers definitely does do is increase the mechanisms by which maintainers themselves make a living (which is a kind of essential existential precursor). Before paying maintainers became a thing, the only real way of making a living as a maintainer was reputation monetization (corporations paid you to have a maintainer on staff or because being a maintainer demonstrated a skill set they needed in other aspects of their business) but now a Maintainer also has the option to turn Professional. Increasing the net ways of rewarding Maintainership therefore should be a net benefit to attracting people into all ecosystems.
In general, I think that paying maintainers is a good thing, but should be the beginning of the search for ways of remunerating Open Source contributors, not the end.
This version boasts an expanded index and API index, and also adds a number of improvements, perhaps most notably boldface for the most pertinent pages for a given index entry, courtesy of Akira Yokosawa. Akira also further improved the new ebook-friendly PDFs, expanded the list of acronyms, updated the build system to allow different perfbook formats to be built concurrently, adjusted for Ghostscript changes, carried out per-Linux-version updates, and did a great deal of formatting and other cleanup.
One of the code samples now use C11 thread-local storage instead of the GCC __thread storage class, courtesy of Elad Lahav. Elad also added support for building code samples on QNX.
Johann Klähn, SeongJae Park, Xuwei Fu, and Zhouyi Zhou provided many welcome fixes throughout the book.
This release also includes a number of updates to RCU, memory ordering, locking, and non-blocking synchronization, as well as additional information on the combined use of synchronization mechanisms.
Being able to SSH into remote machines and do work there is great. Using hardware security tokens for 2FA is also great. But trying to use them both at the same time doesn't work super well, because if you hit a WebAuthn request on the remote machine it doesn't matter how much you mash your token - it's not going to work.
But could it?
The SSH agent protocol abstracts key management out of SSH itself and into a separate process. When you run "ssh-add .ssh/id_rsa", that key is being loaded into the SSH agent. When SSH wants to use that key to authenticate to a remote system, it asks the SSH agent to perform the cryptographic signatures on its behalf. SSH also supports forwarding the SSH agent protocol over SSH itself, so if you SSH into a remote system then remote clients can also access your keys - this allows you to bounce through one remote system into another without having to copy your keys to those remote systems.
More recently, SSH gained the ability to store SSH keys on hardware tokens such as Yubikeys. If configured appropriately, this means that even if you forward your agent to a remote site, that site can't do anything with your keys unless you physically touch the token. But out of the box, this is only useful for SSH keys - you can't do anything else with this support.
Well, that's what I thought, at least. And then I looked at the code and realised that SSH is communicating with the security tokens using the same library that a browser would, except it ensures that any signature request starts with the string "ssh:" (which a genuine WebAuthn request never will). This constraint can actually be disabled by passing -O no-restrict-websafe to ssh-agent, except that was broken until this weekend. But let's assume there's a glorious future where that patch gets backported everywhere, and see what we can do with it.
First we need to load the key into the security token. For this I ended up hacking up the Go SSH agent support. Annoyingly it doesn't seem to be possible to make calls to the agent without going via one of the exported methods here, so I don't think this logic can be implemented without modifying the agent module itself. But this is basically as simple as adding another key message type that looks something like:
Where Type is ssh.KeyAlgoSKECDSA256, Curve is "nistp256", RpId is the identity of the relying party (eg, "webauthn.io"), Flags is 0x1 if you want the user to have to touch the key, KeyHandle is the hardware token's representation of the key (basically an opaque blob that's sufficient for the token to regenerate the keypair - this is generally stored by the remote site and handed back to you when it wants you to authenticate). The other fields can be ignored, other than PubKeyBytes, which is supposed to be the public half of the keypair.
This causes an obvious problem. We have an opaque blob that represents a keypair. We don't have the public key. And OpenSSH verifies that PubKeyByes is a legitimate ecdsa public key before it'll load the key. Fortunately it only verifies that it's a legitimate ecdsa public key, and does nothing to verify that it's related to the private key in any way. So, just generate a new ECDSA key (ecdsa.GenerateKey(elliptic.P256(), rand.Reader)) and marshal it ( elliptic.Marshal(ecKey.Curve, ecKey.X, ecKey.Y)) and we're good. Pass that struct to ssh.Marshal() and then make an agent call.
Now you can use the standard agent interfaces to trigger a signature event. You want to pass the raw challenge (not the hash of the challenge!) - the SSH code will do the hashing itself. If you're using agent forwarding this will be forwarded from the remote system to your local one, and your security token should start blinking - touch it and you'll get back an ssh.Signature blob. ssh.Unmarshal() the Blob member to a struct like
type ecSig struct {
R *big.Int
S *big.Int
}
and then ssh.Unmarshal the Rest member to
type authData struct {
Flags uint8
SigCount uint32
}
The signature needs to be converted back to a DER-encoded ASN.1 structure (eg,
var b cryptobyte.Builder
b.AddASN1(asn1.SEQUENCE, func(b *cryptobyte.Builder) {
b.AddASN1BigInt(ecSig.R)
b.AddASN1BigInt(ecSig.S)
})
signatureDER, _ := b.Bytes()
, and then you need to construct the Authenticator Data structure. For this, take the RpId used earlier and generate the sha256. Append the one byte Flags variable, and then convert SigCount to big endian and append those 4 bytes. You should now have a 37 byte structure. This needs to be CBOR encoded (I used github.com/fxamacker/cbor and just called cbor.Marshal(data, cbor.EncOptions{})).
Now base64 encode the sha256 of the challenge data, the DER-encoded signature and the CBOR-encoded authenticator data and you've got everything you need to provide to the remote site to satisfy the challenge.
There are alternative approaches - you can use USB/IP to forward the hardware token directly to the remote system. But that means you can't use it locally, so it's less than ideal. Or you could implement a proxy that communicates with the key locally and have that tunneled through to the remote host, but at that point you're just reinventing ssh-agent.
And you should bear in mind that the default behaviour of blocking this sort of request is for a good reason! If someone is able to compromise a remote system that you're SSHed into, they can potentially trick you into hitting the key to sign a request they've made on behalf of an arbitrary site. Obviously they could do the same without any of this if they've compromised your local system, but there is some additional risk to this. It would be nice to have sensible MAC policies that default-denied access to the SSH agent socket and only allowed trustworthy binaries to do so, or maybe have some sort of reasonable flatpak-style portal to gate access. For my threat model I think it's a worthwhile security tradeoff, but you should evaluate that carefully yourself.
Anyway. Now to figure out whether there's a reasonable way to get browsers to work with this.
After my last post, someone suggested that having employers be able to restrict keys to machines they control is a bad thing. So here's why I think Bring Your Own Device (BYOD) scenarios are bad not only for employers, but also for users.
There's obvious mutual appeal to having developers use their own hardware rather than rely on employer-provided hardware. The user gets to use hardware they're familiar with, and which matches their ergonomic desires. The employer gets to save on the money required to buy new hardware for the employee. From this perspective, there's a clear win-win outcome.
But once you start thinking about security, it gets more complicated. If I, as an employer, want to ensure that any systems that can access my resources meet a certain security baseline (eg, I don't want my developers using unpatched Windows ME), I need some of my own software installed on there. And that software doesn't magically go away when the user is doing their own thing. If a user lends their machine to their partner, is the partner fully informed about what level of access I have? Are they going to feel that their privacy has been violated if they find out afterwards?
But it's not just about monitoring. If an employee's machine is compromised and the compromise is detected, what happens next? If the employer owns the system then it's easy - you pick up the device for forensic analysis and give the employee a new machine to use while that's going on. If the employee owns the system, they're probably not going to be super enthusiastic about handing over a machine that also contains a bunch of their personal data. In much of the world the law is probably on their side, and even if it isn't then telling the employee that they have a choice between handing over their laptop or getting fired probably isn't going to end well.
But obviously this is all predicated on the idea that an employer needs visibility into what's happening on systems that have access to their systems, or which are used to develop code that they'll be deploying. And I think it's fair to say that not everyone needs that! But if you hold any sort of personal data (including passwords) for any external users, I really do think you need to protect against compromised employee machines, and that does mean having some degree of insight into what's happening on those machines. If you don't want to deal with the complicated consequences of allowing employees to use their own hardware, it's rational to ensure that only employer-owned hardware can be used.
But what about the employers that don't currently need that? If there's no plausible future where you'll host user data, or where you'll sell products to others who'll host user data, then sure! But if that might happen in future (even if it doesn't right now), what's your transition plan? How are you going to deal with employees who are happily using their personal systems right now? At what point are you going to buy new laptops for everyone? BYOD might work for you now, but will it always?
And if your employer insists on employees using their own hardware, those employees should ask what happens in the event of a security breach. Whose responsibility is it to ensure that hardware is kept up to date? Is there an expectation that security can insist on the hardware being handed over for investigation? What information about the employee's use of their own hardware is going to be logged, who has access to those logs, and how long are those logs going to be kept for? If those questions can't be answered in a reasonable way, it's a huge red flag. You shouldn't have to give up your privacy and (potentially) your hardware for a job.
Using technical mechanisms to ensure that employees only use employer-provided hardware is understandably icky, but it's something that allows employers to impose appropriate security policies without violating employee privacy.