Kernel Planet

May 27, 2022

Dave Airlie (blogspot): lavapipe Vulkan 1.2 conformant

The software Vulkan renderer in Mesa, lavapipe, achieved official Vulkan 1.2 conformance. The non obvious entry in the table is here.

Thanks to all the Mesa team who helped achieve this, Shout outs to Mike of Zink fame who drove a bunch of pieces over the line, Roland who helped review some of the funkier changes. 

We will be submitting 1.3 conformance soon, just a few things to iron out.

May 27, 2022 07:58 PM

Paul E. Mc Kenney: Stupid RCU Tricks: Is RCU Watching?

It is just as easy to ask why RCU wouldn't be watching all the time. After all, you never know when you might need to synchronize!

Unfortunately, an eternally watchful RCU is impractical in the Linux kernel due to energy-efficiency considerations. The problem is that if RCU watches an idle CPU, RCU needs that CPU to execute instructions. And making an idle CPU unnecessarily execute instructions (for a rather broad definition of the word “unnecessary”) will terminally annoy a great many people in the battery-powered embedded world. And for good reason: Making RCU avoid watching idle CPUs can provide 30-40% increases in battery lifetime.

In this, CPUs are not all that different from people. Interrupting someone who is deep in thought can cause them to lose 20 minutes of work. Similarly, when a CPU is deeply idle, asking it to execute instructions will consume not only the energy required for those instructions, but also much more energy to work its way out of that deep idle state, and then to return back to that deep idle state.

And this is why CPUs must tell RCU to stop watching them when they go idle. This allows RCU to ignore them completely, in particular, to refrain from asking them to execute instructions.

In some kernel configurations, RCU also ignores portions of the kernel's entry/exit code, that is, the last bits of kernel code before switching to userspace and the first bits of kernel code after switching away from userspace. This happens only in kernels built with CONFIG_NO_HZ_FULL=y, and even then only on CPUs mentioned in the CPU list passed to the nohz_full kernel parameter. This enables carefully configured HPC applications and CPU-bound real-time applications to get near-bare-metal performance from such CPUs, while still having the entire Linux kernel at their beck and call. Because RCU is not watching such applications, the scheduling-clock interrupt can be turned off entirely, thus avoiding disturbing such performance-critical applications.

But if RCU is not watching a given CPU, rcu_read_lock() has no effect on that CPU, which can come as a nasty surprise to the corresponding RCU read-side critical section, which naively expected to be able to safely traverse an RCU-protected data structure. This can be a trap for the unwary, which is why kernels built with CONFIG_PROVE_LOCKING=y (lockdep) complain bitterly when rcu_read_lock() is invoked on CPUs that RCU is not watching.

But suppose that you have code using RCU that is invoked both from deep within the idle loop and from normal tasks.

Back in the day, this was not much of a problem. True to its name, the idle loop was not much more than a loop, and the deep architecture-specific code on the kernel entry/exit paths had no need of RCU. This has changed, especially with the advent of idle drivers and governors, to say nothing of tracing. So what can you do?

First, you can invoke rcu_is_watching(), which will return true if RCU is watching. And, as you might expect, lockdep uses this function to figure out when it should complain bitterly. So suppose that you invoke rcu_is_watching() and it helpfully returns false, indicating that you cannot invoke rcu_read_lock() and friends. What now?

You could do what the v5.18 Linux kernel's kernel_text_address() does, which can be abbreviated as follows:

no_rcu = !rcu_is_watching();
if (no_rcu)
        rcu_nmi_enter(); // Make RCU watch!!!
do_rcu_traversals();
if (no_rcu)
        rcu_nmi_exit(); // Return RCU to its prior watchfulness state.

If your code is not so performance-critical, you can do what the arm64 implementation of cpu_suspend() does:

RCU_NONIDLE(__cpu_suspend_exit());

This macro forces RCU to watch while it executes its argument as follows:

#define RCU_NONIDLE(a) \
        do { \
                rcu_irq_enter_irqson(); \
                do { a; } while (0); \
                rcu_irq_exit_irqson(); \
        } while (0)

The rcu_irq_enter_irqson() and rcu_irq_exit_irqson() functions are essentially wrappers around the aforementioned rcu_nmi_enter() and rcu_nmi_exit().

Although RCU_NONIDLE() is more compact than the approach used by kernel_text_address(), it is still annoying to have to decorate your code. And this is why Peter Zijlstra has been reworking the various idle loops to cause RCU to be watching a much greater fraction of their code. This might well be an ongoing process as the idle loops continue gaining functionality, but Peter's good work thus far at least makes RCU watch the idle governors and a much larger fraction of the idle loop's trace events. When combined with the kernel entry/exit work by Peter, Thomas Gleixner, Mark Rutland, and many others, it is hoped that the functions not watched by RCU will all eventually be decorated with something like noinstr, for example:

static noinline noinstr unsigned long rcu_dynticks_inc(int incby)
{
        return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
}

We don't need to worry about exactly what this function does. For this blog entry, it is enough to know that its noinstr tag prevents tracing this function, making it less problematic for RCU to not be watching it.

What exactly are you prohibited from doing while RCU is not watching you?

As noted before, RCU readers are a no-go. If you try invoking rcu_read_lock(), rcu_read_unlock(), rcu_read_lock_bh(), rcu_read_unlock_bh(), rcu_read_lock_sched(), or rcu_read_lock_sched() while rcu_is_watching() would return false, lockdep will complain.

On the other hand, using SRCU (srcu_read_lock() and srcu_read_unlock()) is just fine, as is RCU Tasks Trace (rcu_read_lock_trace() and rcu_read_unlock_trace()). RCU Tasks Rude does not have explicit read-side markers, but anything that disables preemption acts as an RCU Tasks Rude reader no matter what rcu_is_watching() would return at the time.

RCU Tasks is an odd special case. Like RCU Tasks Rude, RCU Tasks has implicit read-side markers, which are any region of non-idle-task kernel code that does not do a voluntary context switch (the idle tasks being handled by RCU Tasks Rude). Except that in kernels built with CONFIG_PREEMPTION=n and without any of RCU's test suite, the RCU Tasks API maps to plain old RCU. This means that code not watched by RCU is ignored by the remapped RCU Tasks in such kernels. Given that RCU Tasks ignores the idle tasks, this affects only user entry/exit code in kernels built with CONFIG_NO_HZ_FULL=y, and even then, only on CPUs mentioned in the list of CPUs given to the nohz_full kernel boot parameter, but it can nevertheless be a trap for the unwary.

In post-v5.18 mainline, you can build your kernel with CONFIG_FORCE_TASKS_RCU=y, in which case RCU Tasks will always be built into your kernel, avoiding this trap.

In summary, energy-efficiency, battery-lifetime, and application-performance/latency concerns force RCU to avert its gaze from idle CPUs, and, in kernels built with CONFIG_NO_HZ_FULL=y, also from CPUs on the low-level kernel entry/exit code paths. Fortunately, the recent changes have allowed RCU to be watching more code, but this being the kernel, corner cases will always be with us. This corner-case code from which RCU averts its gaze requires the special handling described in this post.

May 27, 2022 12:43 AM

May 26, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Containers and Checkpoint/Restore

Linux Plumbers Conference 2022 is pleased to host the Containers and Checkpoint/Restore Microconference

The Containers and Checkpoint/Restore Microconference focuses on both userspace and kernel related work. The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.

Potential discussion topcis include :

Please come and join the discussion centered on what holds “The Cloud” together.

We hope to see you there!

May 26, 2022 09:55 AM

May 23, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Kernel Testing & Dependability

Linux Plumbers Conference 2022 is pleased to host the Kernel Testing & Dependability Microconference

The Kernel Testing & Dependability Microconference focuses on advancing the state of testing of the Linux kernel and testing on Linux in general. The main purpose is to improve software quality and dependability for applications that require predictability and trust. The microconference aims to create connections between folks working on similar projects, and help individual projects make progress

This microconference is a merge of Testing and Fuzzing and the Kernel Dependability and Assurance microconferences into a single session. There was a lot of overlap in topics and attendees of these MCs and and combining the two tracks will promote collaboration between all the interested communities and people.

The Microconference is open to all topics related to testing on Linux, not necessarily in the kernel space.

Please come and join us in the discussion on how we can assure that Linux becomes the most trusted and dependable software in the world!

We hope to see you there!

May 23, 2022 09:23 AM

May 20, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: linux/arch

Linux Plumbers Conference 2022 is pleased to host the linux/arch Microconference

The linux/arch microconference aims to bring architecture maintainers in one room to discuss how the code in arch/ can be improved, consolidated and generalized.

Potential topics for the discussion are:

Please come and join us in the discussion about improving architectures integration with generic kernel code!

We hope to see you there!

May 20, 2022 08:48 AM

May 17, 2022

Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Confidential Computing

Linux Plumbers Conference 2022 is pleased to host the Confidential Computing Microconference.

The Confidential Computing Microconference brings together plumbers enabling secure execution features in hypervisors, firmware, Linux Kernel, over low-level user space up to container runtimes.

Good progress was made on a couple of topics since the last year, but enabling Confidential Computing in the Linux ecosystem is an ongoing process, and there are still many problems to solve. The most important ones are:

Please come and join us in the discussion for solutions to the open problems for supporting these technologies!

We hope to see you there!

May 17, 2022 02:56 PM

May 16, 2022

Matthew Garrett: Can we fix bearer tokens?

Last month I wrote about how bearer tokens are just awful, and a week later Github announced that someone had managed to exfiltrate bearer tokens from Heroku that gave them access to, well, a lot of Github repositories. This has inevitably resulted in a whole bunch of discussion about a number of things, but people seem to be largely ignoring the fundamental issue that maybe we just shouldn't have magical blobs that grant you access to basically everything even if you've copied them from a legitimate holder to Honest John's Totally Legitimate API Consumer.

To make it clearer what the problem is here, let's use an analogy. You have a safety deposit box. To gain access to it, you simply need to be able to open it with a key you were given. Anyone who turns up with the key can open the box and do whatever they want with the contents. Unfortunately, the key is extremely easy to copy - anyone who is able to get hold of your keyring for a moment is in a position to duplicate it, and then they have access to the box. Wouldn't it be better if something could be done to ensure that whoever showed up with a working key was someone who was actually authorised to have that key?

To achieve that we need some way to verify the identity of the person holding the key. In the physical world we have a range of ways to achieve this, from simply checking whether someone has a piece of ID that associates them with the safety deposit box all the way up to invasive biometric measurements that supposedly verify that they're definitely the same person. But computers don't have passports or fingerprints, so we need another way to identify them.

When you open a browser and try to connect to your bank, the bank's website provides a TLS certificate that lets your browser know that you're talking to your bank instead of someone pretending to be your bank. The spec allows this to be a bi-directional transaction - you can also prove your identity to the remote website. This is referred to as "mutual TLS", or mTLS, and a successful mTLS transaction ends up with both ends knowing who they're talking to, as long as they have a reason to trust the certificate they were presented with.

That's actually a pretty big constraint! We have a reasonable model for the server - it's something that's issued by a trusted third party and it's tied to the DNS name for the server in question. Clients don't tend to have stable DNS identity, and that makes the entire thing sort of awkward. But, thankfully, maybe we don't need to? We don't need the client to be able to prove its identity to arbitrary third party sites here - we just need the client to be able to prove it's a legitimate holder of whichever bearer token it's presenting to that site. And that's a much easier problem.

Here's the simple solution - clients generate a TLS cert. This can be self-signed, because all we want to do here is be able to verify whether the machine talking to us is the same one that had a token issued to it. The client contacts a service that's going to give it a bearer token. The service requests mTLS auth without being picky about the certificate that's presented. The service embeds a hash of that certificate in the token before handing it back to the client. Whenever the client presents that token to any other service, the service ensures that the mTLS cert the client presented matches the hash in the bearer token. Copy the token without copying the mTLS certificate and the token gets rejected. Hurrah hurrah hats for everyone.

Well except for the obvious problem that if you're in a position to exfiltrate the bearer tokens you can probably just steal the client certificates and keys as well, and now you can pretend to be the original client and this is not adding much additional security. Fortunately pretty much everything we care about has the ability to store the private half of an asymmetric key in hardware (TPMs on Linux and Windows systems, the Secure Enclave on Macs and iPhones, either a piece of magical hardware or Trustzone on Android) in a way that avoids anyone being able to just steal the key.

How do we know that the key is actually in hardware? Here's the fun bit - it doesn't matter. If you're issuing a bearer token to a system then you're already asserting that the system is trusted. If the system is lying to you about whether or not the key it's presenting is hardware-backed then you've already lost. If it lied and the system is later compromised then sure all your apes get stolen, but maybe don't run systems that lie and avoid that situation as a result?

Anyway. This is covered in RFC 8705 so why aren't we all doing this already? From the client side, the largest generic issue is that TPMs are astonishingly slow in comparison to doing a TLS handshake on the CPU. RSA signing operations on TPMs can take around half a second, which doesn't sound too bad, except your browser is probably establishing multiple TLS connections to subdomains on the site it's connecting to and performance is going to tank. Fixing this involves doing whatever's necessary to convince the browser to pipe everything over a single TLS connection, and that's just not really where the web is right at the moment. Using EC keys instead helps a lot (~0.1 seconds per signature on modern TPMs), but it's still going to be a bottleneck.

The other problem, of course, is that ecosystem support for hardware-backed certificates is just awful. Windows lets you stick them into the standard platform certificate store, but the docs for this are hidden in a random PDF in a Github repo. Macs require you to do some weird bridging between the Secure Enclave API and the keychain API. Linux? Well, the standard answer is to do PKCS#11, and I have literally never met anybody who likes PKCS#11 and I have spent a bunch of time in standards meetings with the sort of people you might expect to like PKCS#11 and even they don't like it. It turns out that loading a bunch of random C bullshit that has strong feelings about function pointers into your security critical process is not necessarily something that is going to improve your quality of life, so instead you should use something like this and just have enough C to bridge to a language that isn't secretly plotting to kill your pets the moment you turn your back.

And, uh, obviously none of this matters at all unless people actually support it. Github has no support at all for validating the identity of whoever holds a bearer token. Most issuers of bearer tokens have no support for embedding holder identity into the token. This is not good! As of last week, all three of the big cloud providers support virtualised TPMs in their VMs - we should be running CI on systems that can do that, and tying any issued tokens to the VMs that are supposed to be making use of them.

So sure this isn't trivial. But it's also not impossible, and making this stuff work would improve the security of, well, everything. We literally have the technology to prevent attacks like Github suffered. What do we have to do to get people to actually start working on implementing that?

comment count unavailable comments

May 16, 2022 07:48 AM

May 09, 2022

Rusty Russell: Pickhardt Payments Implementation: Finding ?!

So, I’ve finally started implementing Pickhardt Payments in Core Lightning (#cln) and there are some practical complications beyond the paper which are worth noting for others who consider this!

In particular, the cost function in the paper cleverly combines the probability of success, with the fee charged by the channel, giving a cost function of:

? log( (ce + 1 ? fe) / (ce + 1)) + ? · fe · fee(e)

Which is great: bigger ? means fees matter more, smaller means they matter less. And the paper suggests various ways of adjusting them if you don’t like the initial results.

But, what’s a reasonable ? value? 1? 1000? 0.00001? Since the left term is the negative log of a probability, and the right is a value in millisats, it’s deeply unclear to me!

So it’s useful to look at the typical ranges of the first term, and the typical fees (the rest of the second term which is not ?), using stats from the real network.

If we want these two terms to be equal, we get:

? log( (ce + 1 ? fe) / (ce + 1)) = ? · fe · fee(e)
=> ? = ? log( (ce + 1 ? fe) / (ce + 1)) / ( fe · fee(e))

Let’s assume that fee(e) is the median fee: 51 parts per million. I chose to look at amounts of 1sat, 10sat, 100sat, 1000sat, 10,000sat, 100,000sat and 1M sat, and calculated the ? values for each channel. It turns out that, for almost all those values, the 10th percentile ? value is 0.125 the median, and the 90th percentile ? value is 12.5 times the median, though for 1M sats it’s 0.21 and 51x, which probably reflects that the median fee is not 51 for these channels!

Nonetheless, this suggests we can calculate the “expected ?” using the median capacity of channels we could use for a payment (i.e. those with capacity >= amount), and the median feerate of those channels. We can then bias it by a factor of 10 or so either way, to reasonably promote certainty over fees or vice versa.

So, in the internal API for the moment I accept a frugality factor, generally 0.1 (not frugal, prefer certainty to fees) to 10 (frugal, prefer fees to certainty), and derive ?:

? = -log((median_capacity_msat + 1 – amount_msat) / (median_capacity_msat + 1)) * frugality / (median_fee + 1)

The median is selected only from the channels with capacity > amount, and the +1 on the median_fee covers the case where median fee turns out to be 0 (such as in one of my tests!).

Note that it’s possible to try to send a payment larger than any channel in the network, using MPP. This is a corner case, where you generally care less about fees, so I set median_capacity_msat in the “no channels” case to amount_msat, and the resulting ? is really large, but at that point you can’t be fussy about fees!

May 09, 2022 05:37 AM

May 02, 2022

Brendan Gregg: Brendan@Intel.com

I'm thrilled to be joining Intel to work on the performance of everything, apps to metal, with a focus on cloud computing. It's an exciting time to be joining: The geeks are back with [Pat Gelsinger] and [Greg Lavender] as the CEO and CTO; new products are launching including the Sapphire Rapids processor; there are more competitors, which will drive innovation and move the whole industry forward more quickly; and Intel are building new fabs on US soil. It's a critical time to join, and an honour to do so as an Intel fellow, based in Australia. My dream is to turn computer performance analysis into a science, one where we can completely understand the performance of everything: of applications, libraries, kernels, hypervisors, firmware, and hardware. These were the opening words of my 2019 [AWS re:Invent talk], which I followed by demonstrating rapid on-the-fly dynamic instrumentation of the Intel wireless driver. With the growing complexities of our industry, both hardware and software offerings, it has become increasingly challenging to find the root causes for system performance problems. I dream of solving this: to be able to observe everything and to provide complete answers to any performance question, for any workload, any operating system, and any hardware type. A while ago I began exploring the idea of building this performance analysis capability for a major cloud, and use it to help find performance improvements to make that cloud industry-leading. The question was: Which cloud should I make number one? I'm grateful for the companies who explored this idea with me and offered good opportunities. I wasn't thinking Intel to start with, but after spending much time talking to Greg and other Intel leaders, I realized the massive opportunity and scope of an Intel role: I can work on new performance and debugging technologies for everything from apps down to silicon, across all xPUs (CPUs, GPUs, IPUs, etc.), and have a massive impact on the world. It was the most challenging of the options before me, just as Netflix was when I joined it, and at that point it gets hard for me to say no. I want to know if I can do this. Why climb the highest mountain? Intel is a deeply technical company and a leader in high performance computing, and I'm excited to be working with others who have a similar appetite for deep technical work. I'll also have the opportunity to hire and mentor staff and build a team of the world's best performance and debugging engineers. My work will still involve hands-on engineering, but this time with resources. I was reminded of this while interviewing with other companies: One interviewer who had studied my work asked "How many staff report to you?" "None." He kept returning to this question. I got the feeling that he didn't actually believe it, and thought if he asked enough times I'd confess to having a team. (I'm reminded of my days at Sun Microsystems, where the joke was that I must be a team of clones to get so much done – so I made a [picture].) People have helped me build things (I have detailed Acknowledgement lists in my books), but I've always been an individual contributor. I now have an opportunity at Intel to grow further in my career, and to help grow other people. Teaching others is another passion of mine – it's what drives me to write books, create training materials, and write blog posts here. I'll still be working on eBPF and other open source projects, and I'm glad that Intel's leadership has committed to continue supporting [open source]. Intel has historically been a top contributor to the Linux kernel, and supports countless other projects. Many of the performance tools I've worked on are open source, in particular eBPF, which plays an important role for understanding the performance of everything. eBPF is in development for Windows as well (not just Linux where it originated). For cloud computing performance, I'll be working on projects such as the [Intel DevCloud], run by [Markus Flierl], Corporate VP, General Manager, Intel Developer Cloud Platforms. I know Markus from Sun and I'm glad to be working for him at Intel. While at Netflix in the last few years I've had more regular meetings with Intel than any other company, to the point where we had started joking that I already worked for Intel. I've found them to be not only the deepest technical company – capable of analysis and debugging at a mind-blowing atomic depth – but also professional and a pleasure to work with. This became especially clear after I recently worked with another hardware vendor, who were initially friendly and supportive but after evaluations of their technology went poorly became bullying and misleading. You never know a company (or person) until you see them on their worst day. Over the years I've seen Intel on good days and bad, and they have always been professional and respectful, and work hard to do right by the customer. A bonus of the close relationship between Intel and Netflix, and my focus on cloud computing at Intel, is that I'll likely continue to help the performance of the Netflix cloud, as well as other clouds (that might mean you!). I'm looking forward to meeting new people in this bigger ecosystem, making computers faster everywhere, and understanding the performance of everything. The title of this post is indeed my email address (I also used to be Brendan@Sun.com). [picture]: https://www.brendangregg.com/Images/brendan_clones2006.jpg [open source]: https://www.linkedin.com/pulse/open-letter-ecosystem-pat-gelsinger [Pat Gelsinger]: https://twitter.com/PGelsinger [Greg Lavender]: https://twitter.com/GregL_Intel [Intel DevCloud]: https://www.intel.com/content/www/us/en/developer/tools/devcloud/overview.html [AWS re:Invent talk]: https://www.youtube.com/watch?v=16slh29iN1g [Markus Flierl]: https://www.linkedin.com/in/markus-flierl-375185/

May 02, 2022 12:00 AM

April 17, 2022

Matthew Garrett: The Freedom Phone is not great at privacy

The Freedom Phone advertises itself as a "Free speech and privacy first focused phone". As documented on the features page, it runs ClearOS, an Android-based OS produced by Clear United (or maybe one of the bewildering array of associated companies, we'll come back to that later). It's advertised as including Signal, but what's shipped is not the version available from the Signal website or any official app store - instead it's this fork called "ClearSignal".

The first thing to note about ClearSignal is that the privacy policy link from that page 404s, which is not a great start. The second thing is that it has a version number of 5.8.14, which is strange because upstream went from 5.8.10 to 5.9.0. The third is that, despite Signal being GPL 3, there's no source code available. So, I grabbed jadx and started looking for differences between ClearSignal and the upstream 5.8.10 release. The results were, uh, surprising.

First up is that they seem to have integrated ACRA, a crash reporting framework. This feels a little odd - in the absence of a privacy policy, it's unclear what information this gathers or how it'll be stored. Having a piece of privacy software automatically uploading information about what you were doing in the event of a crash with no notification other than a toast that appears saying "Crash Report" feels a little dubious.

Next is that Signal (for fairly obvious reasons) warns you if your version is out of date and eventually refuses to work unless you upgrade. ClearSignal has dealt with this problem by, uh, simply removing that code. The MacOS version of the desktop app they provide for download seems to be derived from a release from last September, which for an Electron-based app feels like a pretty terrible idea. Weirdly, for Windows they link to an official binary release from February 2021, and for Linux they tell you how to use the upstream repo properly. I have no idea what's going on here.

They've also added support for network backups of your Signal data. This involves the backups being pushed to an S3 bucket using credentials that are statically available in the app. It's ok, though, each upload has some sort of nominally unique identifier associated with it, so it's not trivial to just download other people's backups. But, uh, where does this identifier come from? It turns out that Clear Center, another of the Clear family of companies, employs a bunch of people to work on a ClearID[1], some sort of decentralised something or other that seems to be based on KERI. There's an overview slide deck here which didn't really answer any of my questions and as far as I can tell this is entirely lacking any sort of peer review, but hey it's only the one thing that stops anyone on the internet being able to grab your Signal backups so how important can it be.

The final thing, though? They've extended Signal's invitation support to encourage users to get others to sign up for Clear United. There's an exposed API endpoint called "get_user_email_by_mobile_number" which does exactly what you'd expect - if you give it a registered phone number, it gives you back the associated email address. This requires no authentication. But it gets better! The API to generate a referral link to send to others sends the name and phone number of everyone in your phone's contact list. There does not appear to be any indication that this is going to happen.

So, from a privacy perspective, going to go with things being some distance from ideal. But what's going on with all these Clear companies anyway? They all seem to be related to Michael Proper, who founded the Clear Foundation in 2009. They are, perhaps unsurprisingly, heavily invested in blockchain stuff, while Clear United also appears to be some sort of multi-level marketing scheme which has a membership agreement that includes the somewhat astonishing claim that:

Specifically, the initial focus of the Association will provide members with supplements and technologies for:

9a. Frequency Evaluation, Scans, Reports;

9b. Remote Frequency Health Tuning through Quantum Entanglement;

9c. General and Customized Frequency Optimizations;


- there's more discussion of this and other weirdness here. Clear Center, meanwhile, has a Chief Physics Officer? I have a lot of questions.

Anyway. We have a company that seems to be combining blockchain and MLM, has some opinions about Quantum Entanglement, bases the security of its platform on a set of novel cryptographic primitives that seem to have had no external review, has implemented an API that just hands out personal information without any authentication and an app that appears more than happy to upload all your contact details without telling you first, has failed to update this app to keep up with upstream security updates, and is violating the upstream license. If this is their idea of "privacy first", I really hate to think what their code looks like when privacy comes further down the list.

[1] Pointed out to me here

comment count unavailable comments

April 17, 2022 12:23 AM

April 15, 2022

Brendan Gregg: Netflix End of Series 1

A large and unexpected opportunity has come my way outside of Netflix that I've decided to try. Netflix has been the best job of my career so far, and I'll miss my colleagues and the culture.


offer letter logo (2014)

flame graphs (2014)

eBPF tools (2014-2019)

PMC analysis (2017)

my pandemic-abandoned
desk (2020); office wall
I joined Netflix in 2014, a company at the forefront of cloud computing with an attractive [work culture]. It was the most challenging job among those I interviewed for. On the Netflix Java/Linux/EC2 stack there were no working mixed-mode flame graphs, no production safe dynamic tracer, and no PMCs: All tools I used extensively for advanced performance analysis. How would I do my job? I realized that this was a challenge I was best suited to fix. I could help not only Netflix but all customers of the cloud. Since then I've done just that. I developed the original JVM changes to allow [mixed-mode flame graphs], I pioneered using [eBPF for observability] and helped develop the [front-ends and tools], and I worked with Amazon to get [PMCs enabled] and developed tools to use them. Low-level performance analysis is now possible in the cloud, and with it I've helped Netflix save a very large amount of money, mostly from service teams using flame graphs. There is also now a flourishing industry of observability products based on my work. Apart from developing tools, much of my time has been spent helping teams with performance issues and evaluations. The Netflix stack is more diverse than I was expecting, and is explained in detail in the [Netflix tech blog]: The production cloud is AWS EC2, Ubuntu Linux, Intel x86, mostly Java with some Node.js (and other languages), microservices, Cassandra (storage), EVCache (caching), Spinnaker (deployment), Titus (containers), Apache Spark (analytics), Atlas (monitoring), FlameCommander (profiling), and at least a dozen more applications and workloads (but no 3rd party agents in the BaseAMI). The Netflix CDN runs FreeBSD and NGINX (not Linux: I published a Netflix-approved [footnote] in my last book to explain why). This diverse environment has always provided me with interesting things to explore, to understand, analyze, debug, and improve. I've also used and helped develop many other technologies for debugging, primarily perf, Ftrace, eBPF (bcc and bpftrace), PMCs, MSRs, Intel vTune, and of course, [flame graphs] and [heat maps]. Martin Spier and I also created [Flame Scope] while at Netflix, to analyze perturbations and variation in profiles. I've also had the chance to do other types of work. For 18 months I joined the [CORE SRE team] rotation, and was the primary contact for Netflix outages. It was difficult and fascinating work. I've also created internal training materials and classes, apart from my books. I've worked with awesome colleagues not just in cloud engineering, but also in open connect, studio, DVD, NTech, talent, immigration, HR, PR/comms, legal, and most recently ANZ content. Last time I quit a job, I wanted to share publicly the reasons why I left, but I ultimately did not. I've since been asked many times why I resigned that job (not unlike [The Prisoner]) along with much speculation (none true). I wouldn't want the same thing happening here, and having people wondering if something bad happened at Netflix that caused me to leave: I had a great time and It's a great company! I'm thankful for the opportunities and support I've had, especially from my former managers [Coburn] and [Ed]. I'm also grateful for the support for my work by other companies, technical communities, social communities (Twitter, HackerNews), conference organizers, and all who have liked my work, developed it further, and shared it with others. Thank you. I hope my last two books, [Systems Performance 2nd Ed] and [BPF Performance Tools] serve Netflix well in my absence and everyone else who reads them. I'll still be posting here in my next job. More on that soon... [work culture]: /blog/2017-11-13/brilliant-jerks.html [work culture2]: https://www.slideshare.net/kevibak/netflix-culture-deck-77978007 [Systems Performance 2nd Ed]: /systems-performance-2nd-edition-book.html [BPF Performance Tools]: /bpf-performance-tools-book.html [mixed-mode flame graphs]: http://techblog.netflix.com/2015/07/java-in-flames.html [eBPF for observability]: /blog/2015-05-15/ebpf-one-small-step.html [front-ends and tools]: /Perf/bcc_tracing_tools.png [front-ends and tools2]: /blog/2019-01-01/learn-ebpf-tracing.html [PMCs enabled]: /blog/2017-05-04/the-pmcs-of-ec2.html [CORE SRE team]: /blog/2016-05-04/srecon2016-perf-checklists-for-sres.html [footnote]: /blog/images/2022/netflixcdn.png [Coburn]: https://www.linkedin.com/in/coburnw/ [Ed]: https://www.linkedin.com/in/edwhunter/ [Netflix tech blog]: https://netflixtechblog.com/ [Why Don't You Use ...]: /blog/2022-03-19/why-dont-you-use.html [The Prisoner]: https://en.wikipedia.org/wiki/The_Prisoner [flame graphs]: /flamegraphs.html [heat maps]: /heatmaps.html [Flame Scope]: https://netflixtechblog.com/netflix-flamescope-a57ca19d47bb

April 15, 2022 12:00 AM

April 09, 2022

Brendan Gregg: TensorFlow Library Performance

A while ago I helped a colleague, Vadim, debug a performance issue with TensorFlow in an unexpected location. I thought this was a bit interesting so I've been meaning to share it; here's a rough post of the details. ## 1. The Expert's Eye Vadim had spotted something unusual in this CPU flamegraph (redacted); do you see it?:

I'm impressed he found it so quickly, but then if you look at enough flame graphs the smaller unusual patterns start to jump out. In this case there's an orange tower (kernel code) that's unusual. The cause I've highlighted here. 10% of total CPU time in page faults. At Netflix, 10% of CPU time somewhere unexpected can be a large costly issue across thousands of server instances. We'll use flame graphs to chase down the 1%s. ## 2. Why is it Still Faulting? Browsing the stack trace shows these are from __memcpy_avx_unaligned(). Well, at least that makes sense: memcpy would be faulting in a data segment mappings. But this process had been up and running for hours, and shouldn't still be doing so much page fault growth of its RSS. You see that early on when segments and the heap are fresh and don't have mappings yet, but after hours they are mostly faulted in (mapped to physical memory) and you see the faults dwindle. Sometimes processes page fault a lot because they call mmap()/munmap() to add/remove memory. I used my eBPF tools to trace them ([mmapsnoop.py]) but there was no activity. So how is it still page faulting? Is it doing madvise() and dropping memory? A search of madvise in the flame graph showed it was 0.8% of CPU, so it definitely was, and madvise() was calling zap_page_range() that was calling the faults. (Click on the [flame graph] and try Ctrl-F searching for "madvise" and zooming in.) ## 3. Premature Optimization I read the kernel code related to madvise() and zap_page_range() from mm/advise.c. That showed it calls zap_page_range() for the MADV_DONTNEED flag. (I could have also traced sys_madvise() flags using kprobes/tracepoints). This seemed to be a premature optimization gone bad: The allocator was calling dontneed on memory pages that it did in fact need. The dontneed dropped the virtual to physical mapping, which would soon afterwards cause a page fault to bring the mapping back. ## 4. Allocator Issue I suggested looking into the allocator, and Vadim said it was jemalloc, a configure option. He rebuilt with glibc, and the problem was fixed! Here's the fixed flame graph:
Initial testing showed only a 3% win (can be verified by the flame graphs). We were hoping for 10%! [mmapsnoop.py]: https://github.com/brendangregg/bpf-perf-tools-book/blob/master/originals/Ch07_Memory/mmapsnoop.py [flame graph]: /blog/images/2022/cpuflamegraph.tensorflow0-red.svg

April 09, 2022 12:00 AM

April 05, 2022

Matthew Garrett: Bearer tokens are just awful

As I mentioned last time, bearer tokens are not super compatible with a model in which every access is verified to ensure it's coming from a trusted device. Let's talk about that in a bit more detail.

First off, what is a bearer token? In its simplest form, it's simply an opaque blob that you give to a user after an authentication or authorisation challenge, and then they show it to you to prove that they should be allowed access to a resource. In theory you could just hand someone a randomly generated blob, but then you'd need to keep track of which blobs you've issued and when they should be expired and who they correspond to, so frequently this is actually done using JWTs which contain some base64 encoded JSON that describes the user and group membership and so on and then have a signature associated with them so whenever the user presents one you can just validate the signature and then assume that the contents of the JSON are trustworthy.

One thing to note here is that the crypto is purely between whoever issued the token and whoever validates the token - as far as the server is concerned, any client who can just show it the token is just fine as long as the signature is verified. There's no way to verify the client's state, so one of the core ideas of Zero Trust (that we verify that the client is in a trustworthy state on every access) is already violated.

Can we make things not terrible? Sure! We may not be able to validate the client state on every access, but we can validate the client state when we issue the token in the first place. When the user hits a login page, we do state validation according to whatever policy we want to enforce, and if the client violates that policy we refuse to issue a token to it. If the token has a sufficiently short lifetime then an attacker is only going to have a short period of time to use that token before it expires and then (with luck) they won't be able to get a new one because the state validation will fail.

Except! This is fine for cases where we control the issuance flow. What if we have a scenario where a third party authenticates the client (by verifying that they have a valid token issued by their ID provider) and then uses that to issue their own token that's much longer lived? Well, now the client has a long-lived token sitting on it. And if anyone copies that token to another device, they can now pretend to be that client.

This is, sadly, depressingly common. A lot of services will verify the user, and then issue an oauth token that'll expire some time around the heat death of the universe. If a client system is compromised and an attacker just copies that token to another system, they can continue to pretend to be the legitimate user until someone notices (which, depending on whether or not the service in question has any sort of audit logs, and whether you're paying any attention to them, may be once screenshots of your data show up on Twitter).

This is a problem! There's no way to fit a hosted service that behaves this way into a Zero Trust model - the best you can say is that a token was issued to a device that was, around that time, apparently trustworthy, and now it's some time later and you have literally no idea whether the device is still trustworthy or if the token is still even on that device.

But wait, there's more! Even if you're nowhere near doing any sort of Zero Trust stuff, imagine the case of a user having a bunch of tokens from multiple services on their laptop, and then they leave their laptop unlocked in a cafe while they head to the toilet and whoops it's not there any more, better assume that someone has access to all the data on there. How many services has our opportunistic new laptop owner gained access to as a result? How do we revoke all of the tokens that are sitting there on the local disk? Do you even have a policy for dealing with that?

There isn't a simple answer to all of these problems. Replacing bearer tokens with some sort of asymmetric cryptographic challenge to the client would at least let us tie the tokens to a TPM or other secure enclave, and then we wouldn't have to worry about them being copied elsewhere. But that wouldn't help us if the client is compromised and the attacker simply keeps using the compromised client. The entire model of simply proving knowledge of a secret being sufficient to gain access to a resource is inherently incompatible with a desire for fine-grained trust verification on every access, but I don't see anything changing until we have a standard for third party services to be able to perform that trust verification against a customer's policy.

Still, at least this means I can just run weird Android IoT apps through mitmproxy, pull the bearer token out of the request headers and then start poking the remote API with curl. It may all be broken, but it's also got me a bunch of bug bounty credit, so, it;s impossible to say if its bad or not,

(Addendum: this suggestion that we solve the hardware binding problem by simply passing all the network traffic through some sort of local enclave that could see tokens being set and would then sequester them and reinject them into later requests is OBVIOUSLY HORRIFYING and is also probably going to be at least three startup pitches by the end of next week)

comment count unavailable comments

April 05, 2022 06:59 AM

Kees Cook: security things in Linux v5.10

Previously: v5.9

Linux v5.10 was released in December, 2020. Here’s my summary of various security things that I found interesting:

AMD SEV-ES
While guest VM memory encryption with AMD SEV has been supported for a while, Joerg Roedel, Thomas Lendacky, and others added register state encryption (SEV-ES). This means it’s even harder for a VM host to reconstruct a guest VM’s state.

x86 static calls
Josh Poimboeuf and Peter Zijlstra implemented static calls for x86, which operates very similarly to the “static branch” infrastructure in the kernel. With static branches, an if/else choice can be hard-coded, instead of being run-time evaluated every time. Such branches can be updated too (the kernel just rewrites the code to switch around the “branch”). All these principles apply to static calls as well, but they’re for replacing indirect function calls (i.e. a call through a function pointer) with a direct call (i.e. a hard-coded call address). This eliminates the need for Spectre mitigations (e.g. RETPOLINE) for these indirect calls, and avoids a memory lookup for the pointer. For hot-path code (like the scheduler), this has a measurable performance impact. It also serves as a kind of Control Flow Integrity implementation: an indirect call got removed, and the potential destinations have been explicitly identified at compile-time.

network RNG improvements
In an effort to improve the pseudo-random number generator used by the network subsystem (for things like port numbers and packet sequence numbers), Linux’s home-grown pRNG has been replaced by the SipHash round function, and perturbed by (hopefully) hard-to-predict internal kernel states. This should make it very hard to brute force the internal state of the pRNG and make predictions about future random numbers just from examining network traffic. Similarly, ICMP’s global rate limiter was adjusted to avoid leaking details of network state, as a start to fixing recent DNS Cache Poisoning attacks.

SafeSetID handles GID
Thomas Cedeno improved the SafeSetID LSM to handle group IDs (which required teaching the kernel about which syscalls were actually performing setgid.) Like the earlier setuid policy, this lets the system owner define an explicit list of allowed group ID transitions under CAP_SETGID (instead of to just any group), providing a way to keep the power of granting this capability much more limited. (This isn’t complete yet, though, since handling setgroups() is still needed.)

improve kernel’s internal checking of file contents
The kernel provides LSMs (like the Integrity subsystem) with details about files as they’re loaded. (For example, loading modules, new kernel images for kexec, and firmware.) There wasn’t very good coverage for cases where the contents were coming from things that weren’t files. To deal with this, new hooks were added that allow the LSMs to introspect the contents directly, and to do partial reads. This will give the LSMs much finer grain visibility into these kinds of operations.

set_fs removal continues
With the earlier work landed to free the core kernel code from set_fs(), Christoph Hellwig made it possible for set_fs() to be optional for an architecture. Subsequently, he then removed set_fs() entirely for x86, riscv, and powerpc. These architectures will now be free from the entire class of “kernel address limit” attacks that only needed to corrupt a single value in struct thead_info.

sysfs_emit() replaces sprintf() in /sys
Joe Perches tackled one of the most common bug classes with sprintf() and snprintf() in /sys handlers by creating a new helper, sysfs_emit(). This will handle the cases where kernel code was not correctly dealing with the length results from sprintf() calls, which might lead to buffer overflows in the PAGE_SIZE buffer that /sys handlers operate on. With the helper in place, it was possible to start the refactoring of the many sprintf() callers.

nosymfollow mount option
Mattias Nissler and Ross Zwisler implemented the nosymfollow mount option. This entirely disables symlink resolution for the given filesystem, similar to other mount options where noexec disallows execve(), nosuid disallows setid bits, and nodev disallows device files. Quoting the patch, it is “useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts.” (i.e. for when /proc/sys/fs/protected_symlinks isn’t a big enough hammer.) Chrome OS uses this option for its stateful filesystem, as symlink traversal as been a common attack-persistence vector.

ARMv8.5 Memory Tagging Extension support
Vincenzo Frascino added support to arm64 for the coming Memory Tagging Extension, which will be available for ARMv8.5 and later chips. It provides 4 bits of tags (covering multiples of 16 byte spans of the address space). This is enough to deterministically eliminate all linear heap buffer overflow flaws (1 tag for “free”, and then rotate even values and odd values for neighboring allocations), which is probably one of the most common bugs being currently exploited. It also makes use-after-free and over/under indexing much more difficult for attackers (but still possible if the target’s tag bits can be exposed). Maybe some day we can switch to 128 bit virtual memory addresses and have fully versioned allocations. But for now, 16 tag values is better than none, though we do still need to wait for anyone to actually be shipping ARMv8.5 hardware.

fixes for flaws found by UBSAN
The work to make UBSAN generally usable under syzkaller continues to bear fruit, with various fixes all over the kernel for stuff like shift-out-of-bounds, divide-by-zero, and integer overflow. Seeing these kinds of patches land reinforces the the rationale of shifting the burden of these kinds of checks to the toolchain: these run-time bugs continue to pop up.

flexible array conversions
The work on flexible array conversions continues. Gustavo A. R. Silva and others continued to grind on the conversions, getting the kernel ever closer to being able to enable the -Warray-bounds compiler flag and clear the path for saner bounds checking of array indexes and memcpy() usage.

That’s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.11.

© 2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

April 05, 2022 12:01 AM

March 31, 2022

Matthew Garrett: ZTA doesn't solve all problems, but partial implementations solve fewer

Traditional network access controls work by assuming that something is trustworthy based on some other factor - for example, if a computer is on your office network, it's trustworthy because only trustworthy people should be able to gain physical access to plug something in. If you restrict access to your services to requests coming from trusted networks, then you can assert that it's coming from a trusted device.

Of course, this isn't necessarily true. A machine on your office network may be compromised. An attacker may obtain valid VPN credentials. Someone could leave a hostile device plugged in under a desk in a meeting room. Trust is being placed in devices that may not be trustworthy.

A Zero Trust Architecture (ZTA) is one where a device is granted no inherent trust. Instead, each access to a service is validated against some policy - if the policy is satisfied, the access is permitted. A typical implementation involves granting each device some sort of cryptographic identity (typically a TLS client certificate) and placing the protected services behind a proxy. The proxy verifies the device identity, queries another service to obtain the current device state (we'll come back to that in a moment), compares the state against a policy and either pass the request through to the service or reject it. Different services can have different policies (eg, you probably want a lax policy around whatever's hosting the documentation for how to fix your system if it's being refused access to something for being in the wrong state), and if you want you can also tie it to proof of user identity in some way.

From a user perspective, this is entirely transparent. The proxy is made available on the public internet, DNS for the services points to the proxy, and every time your users try to access the service they hit the proxy instead and (if everything's ok) gain access to it no matter which network they're on. There's no need to connect to a VPN first, and there's no worries about accidentally leaking information over the public internet instead of over a secure link.

It's also notable that traditional solutions tend to be all-or-nothing. If I have some services that are more sensitive than others, the only way I can really enforce this is by having multiple different VPNs and only granting access to sensitive services from specific VPNs. This obviously risks combinatorial explosion once I have more than a couple of policies, and it's a terrible user experience.

Overall, ZTA approaches provide more security and an improved user experience. So why are we still using VPNs? Primarily because this is all extremely difficult. Let's take a look at an extremely recent scenario. A device used by customer support technicians was compromised. The vendor in question has a solution that can tie authentication decisions to whether or not a device has a cryptographic identity. If this was in use, and if the cryptographic identity was tied to the device hardware (eg, by being generated in a TPM), the attacker would not simply be able to obtain the user credentials and log in from their own device. This is good - if the attacker wanted to maintain access to the service, they needed to stay on the device in question. This increases the probability of the monitoring tooling on the compromised device noticing them.

Unfortunately, the attacker simply disabled the monitoring tooling on the compromised device. If device state was being verified on each access then this would be noticed before too long - the last data received from the device would be flagged as too old, and the requests would no longer satisfy any reasonable access control policy. Instead, the device was assumed to be trustworthy simply because it could demonstrate its identity. There's an important point here: just because a device belongs to you doesn't mean it's a trustworthy device.

So, if ZTA approaches are so powerful and user-friendly, why aren't we all using one? There's a few problems, but the single biggest is that there's no standardised way to verify device state in any meaningful way. Remote Attestation can both prove device identity and the device boot state, but the only product on the market that does much with this is Microsoft's Device Health Attestation. DHA doesn't solve the broader problem of also reporting runtime state - it may be able to verify that endpoint monitoring was launched, but it doesn't make assertions about whether it's still running. Right now, people are left trying to scrape this information from whatever tooling they're running. The absence of any standardised approach to this problem means anyone who wants to deploy a strong ZTA has to integrate with whatever tooling they're already running, and that then increases the cost of migrating to any other tooling later.

But even device identity is hard! Knowing whether a machine should be given a certificate or not depends on knowing whether or not you own it, and inventory control is a surprisingly difficult problem in a lot of environments. It's not even just a matter of whether a machine should be given a certificate in the first place - if a machine is reported as lost or stolen, its trust should be revoked. Your inventory system needs to tie into your device state store in order to ensure that your proxies drop access.

And, worse, all of this depends on you being able to put stuff behind a proxy in the first place! If you're using third-party hosted services, that's a problem. In the absence of a proxy, trust decisions are probably made at login time. It's possible to tie user auth decisions to device identity and state (eg, a self-hosted SAML endpoint could do that before passing through to the actual ID provider), but that's still going to end up providing a bearer token of some sort that can potentially be exfiltrated, and will continue to be trusted even if the device state becomes invalid.

ZTA doesn't solve all problems, and there isn't a clear path to it doing so without significantly greater industry support. But a complete ZTA solution is significantly more powerful than a partial one. Verifying device identity is a step on the path to ZTA, but in the absence of device state verification it's only a step.

comment count unavailable comments

March 31, 2022 11:06 PM

March 23, 2022

Matthew Garrett: AMD's Pluton implementation seems to be controllable

I've been digging through the firmware for an AMD laptop with a Ryzen 6000 that incorporates Pluton for the past couple of weeks, and I've got some rough conclusions. Note that these are extremely preliminary and may not be accurate, but I'm going to try to encourage others to look into this in more detail. For those of you at home, I'm using an image from here, specifically version 309. The installer is happy to run under Wine, and if you tell it to "Extract" rather than "Install" it'll leave a file sitting in C:\\DRIVERS\ASUS_GA402RK_309_BIOS_Update_20220322235241 which seems to have an additional 2K of header on it. Strip that and you should have something approximating a flash image.

Looking for UTF16 strings in this reveals something interesting:

Pluton (HSP) X86 Firmware Support
Enable/Disable X86 firmware HSP related code path, including AGESA HSP module, SBIOS HSP related drivers.
Auto - Depends on PcdAmdHspCoreEnable build value
NOTE: PSP directory entry 0xB BIT36 have the highest priority.
NOTE: This option will NOT put HSP hardware in disable state, to disable HSP hardware, you need setup PSP directory entry 0xB, BIT36 to 1.
// EntryValue[36] = 0: Enable, HSP core is enabled.
// EntryValue[36] = 1: Disable, HSP core is disabled then PSP will gate the HSP clock, no further PSP to HSP commands. System will boot without HSP.

"HSP" here means "Hardware Security Processor" - a generic term that refers to Pluton in this case. This is a configuration setting that determines whether Pluton is "enabled" or not - my interpretation of this is that it doesn't directly influence Pluton, but disables all mechanisms that would allow the OS to communicate with it. In this scenario, Pluton has its firmware loaded and could conceivably be functional if the OS knew how to speak to it directly, but the firmware will never speak to it itself. I took a quick look at the Windows drivers for Pluton and it looks like they won't do anything unless the firmware wants to expose Pluton, so this should mean that Windows will do nothing.

So what about the reference to "PSP directory entry 0xB BIT36 have the highest priority"? The PSP is the AMD Platform Security Processor - it's an ARM core on the CPU package that boots before the x86. The PSP firmware lives in the same flash image as the x86 firmware, so the PSP looks for a header that points it towards the firmware it should execute. This gives a pointer to a "directory" - a list of different object types and where they're located in flash (there's a description of this for slightly older AMDs here). Type 0xb is treated slightly specially. Where most types contain the address of where the actual object is, type 0xb contains a 64-bit value that's interpreted as enabling or disabling various features - something AMD calls "soft fusing" (Intel have something similar that involves setting bits in the Firmware Interface Table). The PSP looks at the bits that are set here and alters its behaviour. If bit 36 is set, the PSP tells Pluton to turn itself off and will no longer send any commands to it.

So, we have two mechanisms to disable Pluton - the PSP can tell it to turn itself off, or the x86 firmware can simply never speak to it or admit that it exists. Both of these imply that Pluton has started executing before it's shut down, so it's reasonable to wonder whether it can still do stuff. In the image I'm looking at, there's a blob starting at 0x0069b610 that appears to be firmware for Pluton - it contains chunks that appear to be the reference TPM2 implementation, and it broadly decompiles as valid ARM code. It should be viable to figure out whether it can do anything in the face of being "disabled" via either of the above mechanisms.

Unfortunately for me, the system I'm looking at does set bit 36 in the 0xb entry - as a result, Pluton is disabled before x86 code starts running and I can't investigate further in any straightforward way. The implication that the user-controllable mechanism for disabling Pluton merely disables x86 communication with it rather than turning it off entirely is a little concerning, although (assuming Pluton is behaving as a TPM rather than having an enhanced set of capabilities) skipping any firmware communication means the OS has no way to know what happened before it started running even if it has a mechanism to communicate with Pluton without firmware assistance. In that scenario it'd be viable to write a bootloader shim that just faked up the firmware measurements before handing control to the OS.

The bit 36 disabling mechanism seems more solid? Again, it should be possible to analyse the Pluton firmware to determine whether it actually pays attention to a disable command being sent. But even if it chooses to ignore that, if the PSP is in a position to just cut the clock to Pluton, it's not going to be able to do a lot. At that point we're trusting AMD rather than trusting Microsoft, but given that you're also trusting AMD to execute the code you're giving them to execute, it's hard to avoid placing trust in them.

Overall: I'm reasonably confident that systems that ship with Pluton disabled via setting bit 36 in the soft fuses are going to disable it sufficiently hard that the OS can't do anything about it. Systems that give the user an option to enable or disable it are a little less clear in that respect, and it's possible (but not yet demonstrated) that an OS could communicate with Pluton anyway. However, if that's true, and if the firmware never communicates with Pluton itself, the user could install a stub loader in UEFI that mimicks the firmware behaviour and leaves the OS thinking everything was good when it absolutely is not.

So, assuming that Pluton in its current form on AMD has no capabilities outside those we know about, the disabling mechanisms are probably good enough. It's tough to make a firm statement on this before I have access to a system that doesn't just disable it immediately, so stay tuned for updates.

comment count unavailable comments

March 23, 2022 08:42 AM

March 19, 2022

Brendan Gregg: Why Don't You Use ...

Working for a famous tech company, I get asked a lot "Why don't you use technology X?" X may be an application, programming language, operating system, hypervisor, processor, or tool. It may be because: - It performs poorly. - It is too expensive. - It is not open source. - It lacks features. - It lacks a community. - It lacks debug tools. - It has serious bugs. - It is poorly documented. - It lacks timely security fixes. - It lacks subject matter expertise. - It's developed for the wrong audience. - Our custom internal solution is good enough. - Its longevity is uncertain: Its startup may be dead or sold soon. - We know under NDA of a better solution. - We know other bad things under NDA. - Key contributors told us it was doomed. - It made sense a decade ago but doesn't today. - It made false claims in articles/blogs/talks and has lost credibility. - It tolerates brilliant jerks and has no effective CoC. - Our lawyers won't accept its T&Cs or license. It's rarely because we don't know about it or don't understand it. Sometimes we do know about it, but have yet to find time to check it out. For big technical choices, it is often the result of an internal evaluation that involved various teams, and for some combination of the above reasons. Companies typically do not share these internal product evaluations publicly.

It's easier to say what you do use and why, than what you don't use and why not.
I used to be the outsider asking the big companies, and their silence would drive me nuts. Do they not know about this technology? Why wouldn't they use it?...I finally get it now having seen it from the other side. They are usually well aware of various technologies but have reasons not to use them which they usually won't share. ## Why companies won't say why **Private reasons**: - It is too expensive. - We know under NDA of a better solution. - We know other bad things under NDA. - Key contributors told us it was doomed. These have all happened to me. Technology X may be too expensive because we're using another technology with a special discount that's confidential. If the reasons are under NDA, then I also cannot share it. In one case I was interested in a technology, only to have the CEO of the company that developed it, under NDA, tell me that he was abandoning it. They had not announced this publicly. I've also privately chatted with key technology contributors at conferences who are looking for something different to work on, because they believe their own technology is doomed. At some companies, whatever reason may just be considered competitive knowledge and thus confidential. **Complicated reasons**: - It performs poorly. - Our lawyers won't accept its T&Cs or license ([example]). - Our custom internal solution is good enough. - It made sense a decade ago but doesn't today. - It tolerates brilliant jerks and has no effective CoC. - It lacks subject matter expertise. - It's developed for the wrong audience. These are complicated and time consuming to explain properly, and it may not be a good use of engineering time. If you rushed a quick answer instead, you can put the company in a worse position than by saying nothing: that of providing a _weak_ explanation. Those vested in the technology will find it easy to attack your response – and have more time, energy, and interest in doing so – which can make your company look worse. I've found one of these reasons is common: Discovering that a product was built without subject matter expertise. I've seen startups that do nothing new and nothing well in a space that's crying for new or better solutions. I usually find out later that the development team has no prior domain experience. It's complicated to explain, even to the developer team, as they lack the background to best understand why they missed the mark. A common reason I've seen at smaller companies is when an internal solution is good enough. Adopting new technologies isn't "free": It takes (their limited) engineering time away from other projects, it adds technical debt, and it may add security risk (agents running as root). In this case, the other technology just doesn't have enough features or improved performance to justify a switch. **Safer reasons**: - It is poorly documented. - It lacks features. - It lacks debug tools. - It has serious bugs. - etc. Those are objective and easy to discuss, right? Sort of. There are usually multiple reasons not to use a product, which might include all three: "safe", "complicated", and "private." It can be misleading to only mention the safe reasons. Also, if you do, and they get fixed, people will feel let down that you don't switch. **Bad reasons**: - Not invented here syndrome. - Pride/ego. - Corporate politics. - etc. I didn't include them in the above list, but there can be some poor reasons behind technology choices. At large tech companies (with many staff and their collective expertise) I'd expect that most technical choices will have valid reasons, even if the company won't share them, and poor reasons are the exception and not the rule. People think it's poor technical choices ten times more than it actually is. Just as an example, I've been asked many times why my company uses FreeBSD for its CDN, with the assumption that there must be some dumb reason we didn't choose Linux. No, we do regular Linux vs FreeBSD production tests (which I've helped analyze) and FreeBSD is faster for that workload. For a long time we just avoided this question. I finally got approval to mention it in a footnote in my last book (SysPerf2 page 124). The only sure way to find out why a company doesn't use a given technology is to join that company and then ask the right team. Otherwise, try asking questions that are safer to answer. For example: "Are you aware of technology X?," and "Are you aware that technology X claims to be better at Y?". This can at least rule out ignorance. Update: This was discussed on [Hacker News], which included some other reasons, including: - Engineers don't know X and can't hire X engineers (certainly true at smaller companies). - The cost of switching outweighs the benefits. - Legacy. And I liked the comment about "...random drive-by comments where the person is really just advertising x, rather than being genuinely curious." [example]: /blog/2014-05-17/free-as-in-we-own-your-ip.html [Hacker News]: https://news.ycombinator.com/item?id=30755861

March 19, 2022 12:00 AM

March 15, 2022

Lucas De Marchi: Tracepoints and kernel modules

While debugging, it’s invaluable to be able to trace functions in the Linux kernel. There are several tools for that task: perf, ftrace (with helper tool like trace-cmd), bpftrace, etc.

For my own debug of kernel internals and in the last few years to debug the i915 module for Intel graphics, I usually use tracepoints or dynamic tracepoints (with perf probe) as the trigger for something I want to trace.

One question I get some times is “how do I trace the initial probe of a kernel module?”. At that time you (usually) don’t yet have the module loaded and as a consequence, the symbols of that module are still not available for using tracepoints. For example:

# trace-cmd list -e '^i915.*'
#

One thing to realize is that a lot of time we don’t want to debug the module loading, but rather the module binding to a device. These are 2 separate things, that happen in a cascade of events. It probably also adds to the confusion that the userspace tools to load the modules are not named very consistently: modprobe and insmod, and they end up calling the [f]init_module syscall.

The various buses available in the kernel have mechanisms for stopping the autoprobe (and hence binding to a device), when a module is loaded. With i915, all we have to do is to set a few things in the /sys/bus/pci/:

# echo 0 > /sys/bus/pci/drivers_autoprobe
# modprobe i915

With the i915 module, but not attached to any device, we are ready to attach to any tracepoint or create new dynamic probes. Let’s suppose I want to debug the i915_driver_probe()() function, and any function called by it during the initialization. This is one of the functions to initialize the GPU in i915 called when we are binding to a device.

F=i915_driver_probe

echo 0 > /sys/bus/pci/drivers_autoprobe 
modprobe i915

cd /sys/kernel/debug/tracing
cat /dev/null >  trace
echo $F > set_graph_function
echo _printk > set_graph_notrace
echo 10 > max_graph_depth
echo function_graph > current_tracer
echo 1 > tracing_on

echo 0000:03:00.0 | tee /sys/bus/pci/drivers/i915/bind
cp trace /tmp/trace.txt

echo 0 > tracing_on
cat /dev/null >  trace

With the snippet above we will start tracing whenever function i915_driver_probe() is executed. We also set a few additional parameters: set the maximum call depth to graph, disable graphing on printk() since it’s usually not very interesting. Depending on the size of your trace you may also need to increase the buffer_size_kb to avoid the ring buffer to rotate, or have something to pump data out off the ringbuffer to a file.

Even after we enable tracing by echo’ing 1 to the tracing_on file, nothing happens as we are not automatically binding to the device. In the snippet above we bind it manually, by writting the pci slot to the /sys/bus/pci/drivers/i915/bind file. We then should start seeing the function to be traced:

# head /tmp/trace.txt
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
  3)               |  i915_driver_probe [i915]() {
  3)               |    __devm_drm_dev_alloc() {
  3)               |      __kmalloc() {
  3)               |        kmalloc_order_trace() {
  3)               |          kmalloc_order() {
  3)               |            __alloc_pages() {

Another thing very useful to do is to attach to a specific tracepoint. For example, to trace the MMIO read/write we can do the commands below. Now I’m using trace-cmd directly and omitting the device bind, that we should do after start recording:

# trace-cmd record -e i915:i915_reg_rw
Hit Ctrl^C to stop recording
^C
# trace-cmd report
...
tee-2239  [009]   778.693172: i915_reg_rw:          read reg=0x51000, len=4, val=(0xc1900, 0x0)
tee-2239  [009]   778.693174: i915_reg_rw:          write reg=0xc6204, len=4, val=(0x10251000, 0x0)
tee-2239  [009]   778.693796: i915_reg_rw:          read reg=0xa26c, len=4, val=(0x4, 0x0)
tee-2239  [009]   778.693805: i915_reg_rw:          read reg=0xd00, len=4, val=(0x14, 0x0)
tee-2239  [009]   778.693808: bprint:               intel_gt_init_clock_frequency: 0000:03:00.0 Using clock frequency: 19200kHz, period: 53ns, wrap: 113816ms
tee-2239  [009]   778.693817: i915_reg_rw:          read reg=0x913c, len=4, val=(0xff00, 0x0)
tee-2239  [009]   778.693825: i915_reg_rw:          read reg=0x9144, len=4, val=(0xff00, 0x0)
tee-2239  [009]   778.693908: i915_reg_rw:          read reg=0x9134, len=4, val=(0xff, 0x0)
tee-2239  [009]   778.693918: i915_reg_rw:          read reg=0x9118, len=4, val=(0xd02, 0x0)
tee-2239  [009]   778.693933: i915_reg_rw:          read reg=0x9140, len=4, val=(0x30005, 0x0)
tee-2239  [009]   778.693941: i915_reg_rw:          read reg=0x911c, len=4, val=(0x3000000, 0x0)

March 15, 2022 05:00 PM

March 12, 2022

Linux Plumbers Conference: Proposed Microconferences

We are pleased to announce the first batch of proposed microconferences at the Linux Plumbers Conference (LPC) 2022:

The call for microconferences proposal will close on Saturday, 2. April 2022. The slots are filling up fast so we strongly encourage everyone thinking about submitting a microconference to do so as soon as possible!

LPC 2022 is currently planned to take place in Dublin, Ireland from 12 September to 14 September. For details about the location, co-location with other events see our website and social media for updates.

We do hope that LPC 2022 will be mainly an in-person event. Ideally, microconference runners should be willing and able to attend in person.

March 12, 2022 01:03 PM

February 17, 2022

Dave Airlie (blogspot): optimizing llvmpipe vertex/fragment processing.

Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite threading rasterization, fragment shading and blending stages, never did anything else while those were happening.

I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.

At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.

The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.

Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.

Emma Anholt ran some benchmarks on the results with the paraview traces and got

  • pv-waveletvolume fps +13.9279% +/- 4.91667% (n=15)
  • pv-waveletcountour fps +67.8306% +/- 11.4762% (n=3)
which seems like a good return on the investment.

I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits.

February 17, 2022 03:33 AM

February 11, 2022

Linux Plumbers Conference: CFP Open – Refereed Presentations

The Call for Refereed Presentation Proposals for the 2022 edition of the Linux Plumbers Conference (LPC) is now open.  We plan to hold LPC in Dublin, Ireland on September 12-14 in conjunction with The Linux Foundation Open Source Summit. 

If an in-person conference should prove to be impossible due to the circumstances at that time, Linux Plumbers will switch to a virtual-only conference. Submitters should ideally be able to give their presentation in person if circumstances permit, although presenting remotely will be possible in either case. Please see our website or social media for regular updates.

Refereed Presentations are 45 minutes in length and should focus on a specific aspect of the “plumbing” in a Linux system. Examples of Linux plumbing include core kernel subsystems, init systems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problem statements, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

The Refereed Presentations track will be running throughout all three days of the conference. Note that the current Linux Plumbers Refereed track may overlap with the Open Source Summit.

Linux Plumbers Conference Program Committee members will be reviewing all submitted proposals.  High-quality submissions that cannot be accepted due to the limited number of slots will be forwarded to both the Open Source Summit and to organizers of suitable Linux Plumbers Microconferences for further consideration.

To submit a Refereed Track Presentation proposal follow the instructions here [1]

Submissions are due on or before 11:59PM UTC on Sunday, June 12, 2022.

[1] https://lpc.events/event/16/abstracts/

February 11, 2022 07:24 PM

February 04, 2022

Linux Plumbers Conference: CFP Open – Microconferences

We are pleased to announce the call for papers (cfp) for microconferences at the Linux Plumbers Conference (LPC) 2022.

LPC 2022 is currently planned to take place in Dublin, Ireland from 12 September to 14 September. For details about the location, co-location with other events see our website and social media for updates.

We do hope that LPC 2022 will be mainly an in-person event. Ideally, microconference runners should be willing and able to attend in person.

As the name suggests, LPC is concerned with Linux plumbing encompassing topics from kernel and userspace. A microconference is a set of sessions organized around a particular topic. The topic can be a kernel subsystem or a specific problem area in either kernel or userspace.

A microconference is supposed to be research and development in action and an abstract for a microconference should be thought of as a set of research questions and problem statements.

The sessions in each microconference are expected to address specific problems and should generate new ideas, solutions, and patches. Sessions should be focussed on discussion. Presentations should always aim to aid or kick off a discussion. If your presentation feels like a talk we would recommend to consider submitting to the LPC refereed track.

In past years microconferences were organized around topics such as security, scalability, energy efficiency, toolchains, containers, printing, system boot, Android, scheduling, filesystems, tracing, or real-time. The LPC microconference track is open to a wide variety of topics as long as it is focussed, concerned with interesting problems, and is related to open source and the wider Linux ecosystem. We are happy about a wide range of topics!

A microconference submission should outline the overall topic and list key people and problems which can be discussed. The list of problems and specific topics in a microconference can be continously updated until fairly late. This will allow microconferences to cover topics that pop up after submission and to address new developments or problems.

Microconferences that have been at previous LPCs should list results and accomplishments in the submission and should make sure to cover follow-up work and new topics.

After a microconference has been accepted, microconference organizers are expected to write a short blogpost for the LPC website to announce and advertise their topic.

February 04, 2022 03:48 PM

February 03, 2022

Pete Zaitcev: Cura on Fedora is dead, use Slic3r

Was enjoying my Prusa i3S for a few months, but had to use my Lulzbot Mini today, and it was something else.

In the past, I used the cura-lulzbot package. It went through difficult times, with a Russian take-over and Qtfication. But I persisted in suffering, because, well, it was turnkey and I was a complete novice.

So, I went to install Cura on Fedora 35, and found that package cura-lulzbot is gone. Probably failed to build, and with spot no longer at Red Hat, nobody was motivated enough to keep it going.

The "cura" package is the Ultimaker Cura. It's an absolute dumpster fire of pretend open source. Tons of plug-ins, but basic materials are missing. I print in BASF Ultrafuse ABS, but the nearest available material is the PC/ABS mix.

The material problem is fixable with configuration, but a more serious problem is that UI is absolutely bonkers with crazy flashing - and it does not work. They have menus that cannot be reached: as I move cursor into a submenu, it disappears. Something seriously broken in Qt on F35.

BTW, OpenSCAD suffers from incorrect refresh too on F35. It's super annoying, but at least it works, mostly.

Fortunately, "dnf remove cura" also removes 743 trash packages that it pulls in.

Then, I installed Slic3r, and that turned out to be pretty dope. It's a well put together package, and it has a graphical UI nowadays, operation of which is mostly bug-free and makes sense.

However, my first print popped off. As it turned out, Lulzbot requires the initial sequence that auto-levels it, and I missed that. I could extract it from my old dot files, but in the end I downloaded a settings package from Lulzbot website.

February 03, 2022 12:55 AM

January 20, 2022

Linux Plumbers Conference: Welcome to the 2022 Linux Plumbers Conference

Planning for the 2022 Linux Plumbers Conference is well underway. The hope is to be in Dublin co-located with OSS EU (although with hopefully non-overlapping dates). However, the Linux Foundation is still negotiating for a suitable venue so we can’t fully confirm the location yet.

There is an outside (and hopefully receding) chance that we may have to go back to being fully on-line this year, but if that happens, we’ll be sure to alert you through the usual channels of this blog and twitter.

January 20, 2022 05:55 PM

January 17, 2022

Matthew Garrett: Boot Guard and PSB have user-hostile defaults

Compromising an OS without it being detectable is hard. Modern operating systems support the imposition of a security policy or the launch of some sort of monitoring agent sufficient early in boot that even if you compromise the OS, you're probably going to have left some sort of detectable trace[1]. You can avoid this by attacking the lower layers - if you compromise the bootloader then it can just hotpatch a backdoor into the kernel before executing it, for instance.

This is avoided via one of two mechanisms. Measured boot (such as TPM-based Trusted Boot) makes a tamper-proof cryptographic record of what the system booted, with each component in turn creating a measurement of the next component in the boot chain. If a component is tampered with, its measurement will be different. This can be used to either prevent the release of a cryptographic secret if the boot chain is modified (for instance, using the TPM to encrypt the disk encryption key), or can be used to attest the boot state to another device which can tell you whether you're safe or not. The other approach is verified boot (such as UEFI Secure Boot), where each component in the boot chain verifies the next component before executing it. If the verification fails, execution halts.

In both cases, each component in the boot chain measures and/or verifies the next. But something needs to be the first link in this chain, and traditionally this was the system firmware. Which means you could tamper with the system firmware and subvert the entire process - either have the firmware patch the bootloader in RAM after measuring or verifying it, or just load a modified bootloader and lie about the measurements or ignore the verification. Attackers had already been targeting the firmware (Hacking Team had something along these lines, although this was pre-secure boot so just dropped a rootkit into the OS), and given a well-implemented measured and verified boot chain, the firmware becomes an even more attractive target.

Intel's Boot Guard and AMD's Platform Secure Boot attempt to solve this problem by moving the validation of the core system firmware to an (approximately) immutable environment. Intel's solution involves the Management Engine, a separate x86 core integrated into the motherboard chipset. The ME's boot ROM verifies a signature on its firmware before executing it, and once the ME is up it verifies that the system firmware's bootblock is signed using a public key that corresponds to a hash blown into one-time programmable fuses in the chipset. What happens next depends on policy - it can either prevent the system from booting, allow the system to boot to recover the firmware but automatically shut it down after a while, or flag the failure but allow the system to boot anyway. Most policies will also involve a measurement of the bootblock being pushed into the TPM.

AMD's Platform Secure Boot is slightly different. Rather than the root of trust living in the motherboard chipset, it's in AMD's Platform Security Processor which is incorporated directly onto the CPU die. Similar to Boot Guard, the PSP has ROM that verifies the PSP's own firmware, and then that firmware verifies the system firmware signature against a set of blown fuses in the CPU. If that fails, system boot is halted. I'm having trouble finding decent technical documentation about PSB, and what I have found doesn't mention measuring anything into the TPM - if this is the case, PSB only implements verified boot, not measured boot.

What's the practical upshot of this? The first is that you can't replace the system firmware with anything that doesn't have a valid signature, which effectively means you're locked into firmware the vendor chooses to sign. This prevents replacing the system firmware with either a replacement implementation (such as Coreboot) or a modified version of the original implementation (such as firmware that disables locking of CPU functionality or removes hardware allowlists). In this respect, enforcing system firmware verification works against the user rather than benefiting them.
Of course, it also prevents an attacker from doing the same thing, but while this is a real threat to some users, I think it's hard to say that it's a realistic threat for most users.

The problem is that vendors are shipping with Boot Guard and (increasingly) PSB enabled by default. In the AMD case this causes another problem - because the fuses are in the CPU itself, a CPU that's had PSB enabled is no longer compatible with any motherboards running firmware that wasn't signed with the same key. If a user wants to upgrade their system's CPU, they're effectively unable to sell the old one. But in both scenarios, the user's ability to control what their system is running is reduced.

As I said, the threat that these technologies seek to protect against is real. If you're a large company that handles a lot of sensitive data, you should probably worry about it. If you're a journalist or an activist dealing with governments that have a track record of targeting people like you, it should probably be part of your threat model. But otherwise, the probability of you being hit by a purely userland attack is so ludicrously high compared to you being targeted this way that it's just not a big deal.

I think there's a more reasonable tradeoff than where we've ended up. Tying things like disk encryption secrets to TPM state means that if the system firmware is measured into the TPM prior to being executed, we can at least detect that the firmware has been tampered with. In this case nothing prevents the firmware being modified, there's just a record in your TPM that it's no longer the same as it was when you encrypted the secret. So, here's what I'd suggest:

1) The default behaviour of technologies like Boot Guard or PSB should be to measure the firmware signing key and whether the firmware has a valid signature into PCR 7 (the TPM register that is also used to record which UEFI Secure Boot signing key is used to verify the bootloader).
2) If the PCR 7 value changes, the disk encryption key release will be blocked, and the user will be redirected to a key recovery process. This should include remote attestation, allowing the user to be informed that their firmware signing situation has changed.
3) Tooling should be provided to switch the policy from merely measuring to verifying, and users at meaningful risk of firmware-based attacks should be encouraged to make use of this tooling

This would allow users to replace their system firmware at will, at the cost of having to re-seal their disk encryption keys against the new TPM measurements. It would provide enough information that, in the (unlikely for most users) scenario that their firmware has actually been modified without their knowledge, they can identify that. And it would allow users who are at high risk to switch to a higher security state, and for hardware that is explicitly intended to be resilient against attacks to have different defaults.

This is frustratingly close to possible with Boot Guard, but I don't think it's quite there. Before you've blown the Boot Guard fuses, the Boot Guard policy can be read out of flash. This means that you can drop a Boot Guard configuration into flash telling the ME to measure the firmware but not prevent it from running. But there are two problems remaining:

1) The measurement is made into PCR 0, and PCR 0 changes every time your firmware is updated. That makes it a bad default for sealing encryption keys.
2) It doesn't look like the policy is measured before being enforced. This means that an attacker can simply reflash modified firmware with a policy that disables measurement and then make a fake measurement that makes it look like the firmware is ok.

Fixing this seems simple enough - the Boot Guard policy should always be measured, and measurements of the policy and the signing key should be made into a PCR other than PCR 0. If an attacker modified the policy, the PCR value would change. If an attacker modified the firmware without modifying the policy, the PCR value would also change. People who are at high risk would run an app that would blow the Boot Guard policy into fuses rather than just relying on the copy in flash, and enable verification as well as measurement. Now if an attacker tampers with the firmware, the system simply refuses to boot and the attacker doesn't get anything.

Things are harder on the AMD side. I can't find any indication that PSB supports measuring the firmware at all, which obviously makes this approach impossible. I'm somewhat surprised by that, and so wouldn't be surprised if it does do a measurement somewhere. If it doesn't, there's a rather more significant problem - if a system has a socketed CPU, and someone has sufficient physical access to replace the firmware, they can just swap out the CPU as well with one that doesn't have PSB enabled. Under normal circumstances the system firmware can detect this and prompt the user, but given that the attacker has just replaced the firmware we can assume that they'd do so with firmware that doesn't decide to tell the user what just happened. In the absence of better documentation, it's extremely hard to say that PSB actually provides meaningful security benefits.

So, overall: I think Boot Guard protects against a real-world attack that matters to a small but important set of targets. I think most of its benefits could be provided in a way that still gave users control over their system firmware, while also permitting high-risk targets to opt-in to stronger guarantees. Based on what's publicly documented about PSB, it's hard to say that it provides real-world security benefits for anyone at present. In both cases, what's actually shipping reduces the control people have over their systems, and should be considered user-hostile.

[1] Assuming that someone's both turning this on and actually looking at the data produced

comment count unavailable comments

January 17, 2022 04:37 AM

January 09, 2022

Matthew Garrett: Pluton is not (currently) a threat to software freedom

At CES this week, Lenovo announced that their new Z-series laptops would ship with AMD processors that incorporate Microsoft's Pluton security chip. There's a fair degree of cynicism around whether Microsoft have the interests of the industry as a whole at heart or not, so unsurprisingly people have voiced concerns about Pluton allowing for platform lock-in and future devices no longer booting non-Windows operating systems. Based on what we currently know, I think those concerns are understandable but misplaced.

But first it's helpful to know what Pluton actually is, and that's hard because Microsoft haven't actually provided much in the way of technical detail. The best I've found is a discussion of Pluton in the context of Azure Sphere, Microsoft's IoT security platform. This, in association with the block diagrams on page 12 and 13 of this slidedeck, suggest that Pluton is a general purpose security processor in a similar vein to Google's Titan chip. It has a relatively low powered CPU core, an RNG, and various hardware cryptography engines - there's nothing terribly surprising here, and it's pretty much the same set of components that you'd find in a standard Trusted Platform Module of the sort shipped in pretty much every modern x86 PC. But unlike Titan, Pluton seems to have been designed with the explicit goal of being incorporated into other chips, rather than being a standalone component. In the Azure Sphere case, we see it directly incorporated into a Mediatek chip. In the Xbox Series devices, it's incorporated into the SoC. And now, we're seeing it arrive on general purpose AMD CPUs.

Microsoft's announcement says that Pluton can be shipped in three configurations:as the Trusted Platform Module; as a security processor used for non-TPM scenarios like platform resiliency; or OEMs can choose to ship with Pluton turned off. What we're likely to see to begin with is the former - Pluton will run firmware that exposes a Trusted Computing Group compatible TPM interface. This is almost identical to the status quo. Microsoft have required that all Windows certified hardware ship with a TPM for years now, but for cost reasons this is often not in the form of a separate hardware component. Instead, both Intel and AMD provide support for running the TPM stack on a component separate from the main execution cores on the system - for Intel, this TPM code runs on the Management Engine integrated into the chipset, and for AMD on the Platform Security Processor that's integrated into the CPU package itself.

So in this respect, Pluton changes very little; the only difference is that the TPM code is running on hardware dedicated to that purpose, rather than alongside other code. Importantly, in this mode Pluton will not do anything unless the system firmware or OS ask it to. Pluton cannot independently block the execution of any other code - it knows nothing about the code the CPU is executing unless explicitly told about it. What the OS can certainly do is ask Pluton to verify a signature before executing code, but the OS could also just verify that signature itself. Windows can already be configured to reject software that doesn't have a valid signature. If Microsoft wanted to enforce that they could just change the default today, there's no need to wait until everyone has hardware with Pluton built-in.

The two things that seem to cause people concerns are remote attestation and the fact that Microsoft will be able to ship firmware updates to Pluton via Windows Update. I've written about remote attestation before, so won't go into too many details here, but the short summary is that it's a mechanism that allows your system to prove to a remote site that it booted a specific set of code. What's important to note here is that the TPM (Pluton, in the scenario we're talking about) can't do this on its own - remote attestation can only be triggered with the aid of the operating system. Microsoft's Device Health Attestation is an example of remote attestation in action, and the technology definitely allows remote sites to refuse to grant you access unless you booted a specific set of software. But there are two important things to note here: first, remote attestation cannot prevent you from booting whatever software you want, and second, as evidenced by Microsoft already having a remote attestation product, you don't need Pluton to do this! Remote attestation has been possible since TPMs started shipping over two decades ago.

The other concern is Microsoft having control over the firmware updates. The context here is that TPMs are not magically free of bugs, and sometimes these can have security consequences. One example is Infineon TPMs producing weak RSA keys, a vulnerability that could be rectified by a firmware update to the TPM. Unfortunately these updates had to be issued by the device manufacturer rather than Infineon being able to do so directly. This meant users had to wait for their vendor to get around to shipping an update, something that might not happen at all if the machine was sufficiently old. From a security perspective, being able to ship firmware updates for the TPM without them having to go through the device manufacturer is a huge win.

Microsoft's obviously in a position to ship a firmware update that modifies the TPM's behaviour - there would be no technical barrier to them shipping code that resulted in the TPM just handing out your disk encryption secret on demand. But Microsoft already control the operating system, so they already have your disk encryption secret. There's no need for them to backdoor the TPM to give them something that the TPM's happy to give them anyway. If you don't trust Microsoft then you probably shouldn't be running Windows, and if you're not running Windows Microsoft can't update the firmware on your TPM.

So, as of now, Pluton running firmware that makes it look like a TPM just isn't a terribly interesting change to where we are already. It can't block you running software (either apps or operating systems). It doesn't enable any new privacy concerns. There's no mechanism for Microsoft to forcibly push updates to it if you're not running Windows.

Could this change in future? Potentially. Microsoft mention another use-case for Pluton "as a security processor used for non-TPM scenarios like platform resiliency", but don't go into any more detail. At this point, we don't know the full set of capabilities that Pluton has. Can it DMA? Could it play a role in firmware authentication? There are scenarios where, in theory, a component such as Pluton could be used in ways that would make it more difficult to run arbitrary code. It would be reassuring to hear more about what the non-TPM scenarios are expected to look like and what capabilities Pluton actually has.

But let's not lose sight of something more fundamental here. If Microsoft wanted to block free operating systems from new hardware, they could simply mandate that vendors remove the ability to disable secure boot or modify the key databases. If Microsoft wanted to prevent users from being able to run arbitrary applications, they could just ship an update to Windows that enforced signing requirements. If they want to be hostile to free software, they don't need Pluton to do it.

(Edit: it's been pointed out that I kind of gloss over the fact that remote attestation is a potential threat to free software, as it theoretically allows sites to block access based on which OS you're running. There's various reasons I don't think this is realistic - one is that there's just way too much variability in measurements for it to be practical to write a policy that's strict enough to offer useful guarantees without also blocking a number of legitimate users, and the other is that you can just pass the request through to a machine that is running the appropriate software and have it attest for you. The fact that nobody has actually bothered to use remote attestation for this purpose even though most consumer systems already ship with TPMs suggests that people generally agree with me on that)

comment count unavailable comments

January 09, 2022 02:35 AM

January 04, 2022

Pete Zaitcev: PyPI is not trustworthy

I was dealing with a codebase S at work that uses a certain Python package N (I'll name it in the end, because its identity is so odious that it will distract from the topic at hand). Anyhow, S failed tests because N didn't work on my Fedora 35. That happened because S installed N with pip(1), which pulls from PyPI, and the archive at PyPI contained broken code.

The code for N in its source repository was fine, only PyPI was bad.

When I tried to find out what happened, it turned out that there is no audit trail for the code in PyPI. In addition, it is not possible to contact listed maintainers of N in PyPI, and there is no way to report the problem: the problem tracking system of PyPI is all plastered with warnings not to use it for problems with packages, but only with PyPI software itself.

By fuzzy-matching provided personal details with git logs, I was able to contact the maintainers. To my great surprise, two out of three even responded, but they disclaimed any knowledge of what went on.

So, an unknown entity was able to insert a certain code into a package at PyPI, and pip(1) was downloading it for years. This only came to light because the inserted code failed on my Fedora test box.

At this point I can only conclude that PyPI is not trustworthy.

Oh, yeah. The package N is actually nose. I am aware that it was dead and unmaintained, and nobody should be using it anymore, least of all S. I'm working on it.

January 04, 2022 08:34 PM

December 31, 2021

Matthew Garrett: Update on Linux hibernation support when lockdown is enabled

Some time back I wrote up a description of my proposed (and implemented) solution for making hibernation work under Linux even within the bounds of the integrity model. It's been a while, so here's an update.

The first is that localities just aren't an option. It turns out that they're optional in the spec, and TPMs are entirely permitted to say they don't support them. The only time they're likely to work is on platforms that support DRTM implementations like TXT. Most consumer hardware doesn't fall into that category, so we don't get to use that solution. Unfortunate, but, well.

The second is that I'd ignored an attack vector. If the kernel is configured to restrict access to PCR 23, then yes, an attacker is never able to modify PCR 23 to be in the same state it would be if hibernation were occurring and the key certification data will fail to validate. Unfortunately, an attacker could simply boot into an older kernel that didn't implement the PCR 23 restriction, and could fake things up there (yes, this is getting a bit convoluted, but the entire point here is to make this impossible rather than just awkward). Once PCR 23 was in the correct state, they would then be able to write out a new swap image, boot into a new kernel that supported the secure hibernation solution, and have that resume successfully in the (incorrect) belief that the image was written out in a secure environment.

This felt like an awkward problem to fix. We need to be able to distinguish between the kernel having modified the PCRs and userland having modified the PCRs, and we need to be able to do this without modifying any kernels that have already been released[1]. The normal approach to determining whether an event occurred in a specific phase of the boot process is to "cap" the PCR - extend it with a known value that indicates a transition between stages of the boot process. Any events that occur before the cap event must have occurred in the previous stage of boot, and since the final PCR value depends on the order of measurements and not just the contents of those measurements, if a PCR is capped before userland runs, userland can't fake the same PCR value afterwards. If Linux capped a PCR before userland started running, we'd be able to place a measurement there before the cap occurred and then prove that that extension occurred before userland had the opportunity to interfere. We could simply place a statement that the kernel supported the PCR 23 restrictions there, and we'd be fine.

Unfortunately Linux doesn't currently do this, and adding support for doing so doesn't fix the problem - if an attacker boots a kernel that doesn't cap a PCR, they can just cap it themselves from userland. So, we're faced with the same problem: booting an older kernel allows the system to be placed in an identical state to the current kernel, and a fake hibernation image can be written out. Solving this required a PCR that was being modified after kernel code was running, but before userland was started, even with existing kernels.

Thankfully, there is one! PCR 5 is defined as containing measurements related to boot management configuration and data. One of the measurements it contains is the result of the UEFI ExitBootServices() call. ExitBootServices() is called at the transition from the UEFI boot environment to the running OS, and the kernel contains code that executes before it. So, if we measure an assertion regarding whether or not we support restricted access to PCR 23 into PCR 5 before we call ExitBootServices(), this will prevent userspace from spoofing us (because userspace will only be able to extend PCR 5 after the firmware extended PCR 5 in response to ExitBootServices() being called). Obviously this depends on the firmware actually performing the PCR 5 extension when ExitBootServices() is called, but if firmware's out of spec then I don't think there's any real expectation of it being secure enough for any of this to buy you anything anyway.

My current tree is here, but there's a couple of things I want to do before submitting it, including ensuring that the key material is wiped from RAM after use (otherwise it could potentially be scraped out and used to generate another image afterwards) and, uh, actually making sure this works (I no longer have the machine I was previously using for testing, and switching my other dev machine over to TPM 2 firmware is proving troublesome, so I need to pull another machine out of the stack and reimage it).

[1] The linear nature of time makes feature development much more frustrating

comment count unavailable comments

December 31, 2021 03:36 AM

December 24, 2021

Paul E. Mc Kenney: Parallel Programming: December 2021 Update

It is past time for another release of Is Parallel Programming Hard, And, If So, What Can You Do About It?. But first, what is the difference between an edition and a release?

The main difference is the level of validation. For example, during the several months leading up to the second edition, I read the entire book, fixing issues as I found them. So where an edition is a completed work, a release is primarily for the benefit of people who would like to see a recent snapshot of this book, but without having to install and run LaTeX and its dependencies.

Having cleared that up, here are the big-animal changes since the second edition:


  1. A lot of detailed work to make the new ebook-sized PDF usable, along with other formatting improvements, courtesy of Akira Yokosawa, SeongJae Park, and Balbir Singh.
  2. The nq build now places quick quizzes at the end of each chapter, courtesy of Akira Yokosawa.
  3. There are a few new sections in the “Locking” chapter, the “Putting It All Together” chapter, the “Advanced Synchronization: Memory Ordering” chapter, and the “Important Questions” appendix.
  4. There is now more validation information associated with much of the sample code throughout the book.
  5. The “RCU Usage” section has received a much-needed upgrade.
  6. There is now an index and a list of acronyms, courtesy of Akira Yokosawa.
  7. A new install_latex_package.sh script, courtesy of SeongJae Park.
  8. Greatly improved error handling, including scripts that check for common LaTeX formatting mistakes, courtesy of Akira Yokosawa.


Yang Lu and Zhouyi Zhou are translating the Second Edition to Chinese, and would greatly appreciate additional help.

Everyone mentioned above contributed a great many wordsmithing fixes, as did Chin En Lin, Elad Lahav, Zhouyi Zhou, and GitHub user “rootbeer”. A grateful “thank you” to everyone who contributed!

December 24, 2021 08:22 PM

December 23, 2021

Pete Zaitcev: Adventures in tech support

OVH was pestering me about migrating my VPS from its previous range to the new (and more expensive) one. I finally agreed to that. Migrated the VM to the new host, it launches with no networking. Not entirely unexpected, but it gets better.

The root cause is the DHCP server at OVH returning a lease with netmask /32. In that situation, it's not possible to add a default route, because the next hop is outside of the netmask.

Seems like a simple enough problem, so I filed a ticket in OVH support, basically saying "your DHCP server supplies incorrect netmask, please fix it." Their ultimate answer was, and I quote:

Please note that the VPS IP and our failover IPs are a static IPs, configuring the DHCP server may cause a network issues.

I suppose I knew what I was buying when I saw the price for these OVH VMs. It was too cheap to be true. Still, disappointing.

December 23, 2021 08:17 PM

December 20, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: Removing CONFIG_RCU_FAST_NO_HZ

The CONFIG_RCU_FAST_NO_HZ Kconfig option was added many years ago to improve energy efficiency for systems having significant numbers of short bursts of idle time. Prior to the addition of CONFIG_RCU_FAST_NO_HZ, RCU would insist on keeping a given idle CPU's scheduling-clock tick enabled until all of that CPU's RCU callbacks had been invoked. On certain types of battery-powered embedded systems, these few additional scheduling-clock ticks would consume up to 40% of the battery lifetime. The people working on such systems were not amused, and were not shy about letting me know of their dissatisfaction with RCU's life choices. Please note that “letting me know” did not take the form of flaming me on LKML. Instead, they called me on the telephone and yelled at me.

Given that history, why on earth would I even be thinking about removing CONFIG_RCU_FAST_NO_HZ, let alone queuing a patch series intended for the v5.17 merge window???

The reason is that everyone I know of who builds their kernels with CONFIG_RCU_FAST_NO_HZ=y also boots those systems with each and every CPU designated as a rcu_nocbs CPU. With this combination, CONFIG_RCU_FAST_NO_HZ=y is doing nothing but placing a never-taken branch in the fastpath to and from idle. Such systems should therefore run slightly faster and with slightly better battery lifetime if their kernels were instead built with CONFIG_RCU_FAST_NO_HZ=n, which would get rid of that never-taken branch.

But given that battery-powered embedded folks badly wanted CONFIG_RCU_FAST_NO_HZ=y, and given that they are no longer getting any benefit from it, why on earth haven't they noticed?

The have not noticed because rcu_nocbs CPUs do not invoke their own RCU callbacks. This work is instead delegated to a set of per-CPU rcuoc kthreads, with a smaller set of rcuog kthreads managing those callbacks and requesting grace periods as needed. By default, these rcuoc and rcuog kthreads are not bound, which allows both the scheduler (and for that matter, the systems administrator) to take both performance and energy efficiency into account and to run those kthreads wherever is appropriate at any given time. In contrast, non-rcu_nocbs CPUs will always run their own callbacks, even if that means powering up an inconveniently placed portion of the system at an inconvenient time. This includes CONFIG_RCU_FAST_NO_HZ=y kernels, whose only advantage is that they power up inconveniently placed portions of systems at inconvenient times only 25% as often as would a non-rcu_nocbs CPU in a CONFIG_RCU_FAST_NO_HZ=n kernel.

In short, the rcu_nocbs CPUs' practice of letting the scheduler decide where to run the callbacks is especially helpful on asymmetric systems (AKA big.LITTLE systems), as shown by data collected by Dietmar Eggeman and Robin Randhawa. This point is emphasized by the aforementioned fact that everyone I know of who builds their kernels with CONFIG_RCU_FAST_NO_HZ=y also boots those systems with each and every CPU designated as a rcu_nocbs CPU.

So if no one is getting any benefit from building their kernels with CONFIG_RCU_FAST_NO_HZ=y, why keep that added complexity in the Linux kernel? Why indeed, and hence the patch series intended for the v5.17 merge window.

So if you know of someone who is getting significant benefit from CONFIG_RCU_FAST_NO_HZ=y who could not get that benefit from booting with rcu_nocbs CPUs, this would be a most excellent time to let me know!

December 20, 2021 10:14 PM

December 13, 2021

Vegard Nossum: Using C-Reduce to debug LaTeX errors

My wife is currently writing her HDR thesis (in France, this is an "accreditation to supervise research"). As part of this, she asked me if it would be possible to split her bibliography into two parts: one containing her own publications and another for the rest of her references.

After a tiny bit of searching, I found this stackoverflow answer: https://tex.stackexchange.com/a/407363

However, the answer uses the biblatex package, and my wife was using plain BibTeX. (Dun dun duuun!)

No matter, we can probably switch to biblatex, right? We only had about 6k lines of LaTeX source code and 3k lines worth of BibTeX data, how hard could it be?

To cut a slightly long story short, I ended up using the biber backend; see https://tex.stackexchange.com/a/25702/ for a great overview of the various LaTeX bibliography tools and packages. Biber can use the same bibliography (.bib) file, but has a few differences in how it does its processing and default formatting. Which in turn means that slightly different bits of your bibliography file end up getting output in the final document.

Because of the way that LaTeX bibliographies work, the data from your .bib file doesn't end up getting output directly into your document -- there is a roundabout way where you first have to run LaTeX (pdflatex in my case) to find out which references are used, then you have to run biber to (I think) extract the relevant references from your .bib into an auxiliary .bbl file, and then finally you run pdflatex once more to have it actually include the data from the .bbl into your "References" section.

The bottom line is: This whole process means that if you have a weird .bib entry that perhaps contains some special bit of a markup and that markup cannot be used in the final context where the bibliography appears in the output document, you will get errors. Errors that just point to your main LaTeX file where your \printbibliography command is. Which, for a 3k lines long .bib file is rather inconvenient to debug.

So what do you do when your pdflatex run ends with something like this:

[]\OT1/bch/m/n/12 BSI. |   []. AIS 20 / AIS
! Extra }, or forgotten \endgroup.
\UL@stop ...z@ \else \UL@putbox \fi \else \egroup
\egroup \UL@putbox \fi \if...
l.474

?

Enter C-Reduce...

C-Reduce

According to its website, "C-Reduce is a tool that takes a large C, C++, or OpenCL file that has a property of interest (such as triggering a compiler bug) and automatically produces a much smaller C/C++ file that has the same property".

But how is that relevant here, given that we are dealing with LaTeX? Well, it turns out that C-Reduce can also work with non-C/C++ files, meaning that we now have a way to "reduce" our document (or, well, bibliography file) until it contains ONLY the bits that are actually causing us problems.

The way C-Reduce works is that it takes two inputs: an "interestingness" test (which is really just a shell script) and the file that you would like to reduce. The interestingness test should return either 0 (success) or 1 (failure) depending on whether the document C-Reduce gave it has the property you are searching for.

In our case, the property we want is that LaTeX prints the error we originally encountered. We can find all those errors simply by grepping the pdflatex log file. Note that the first pdflatex run, as well as the biber run, will both succeed without errors, as the error only appears when the bibliography is actually printed in the final document:

$ pdflatex main
[...]
$ biber main
[...]
$ pdflatex -interaction=nonstopmode main
[...]
$ grep -A1 '^!' main.log
! Extra }, or forgotten \endgroup.
\UL@stop ...z@ \else \UL@putbox \fi \else \egroup
--
! Extra }, or forgotten \endgroup.
\UL@stop ... \UL@putbox \fi \else \egroup \egroup
--
! Missing } inserted.
<inserted text>
--
! Missing } inserted.
<inserted text>
--
! Undefined control sequence.
\namepartfamily ->\addtext
--
! Undefined control sequence.
<argument> \addtext

Since we want the errors to remain the same, we can make our interestingness test check that this output remains stable. A quick way to do that is to just hash the output of the command above and ensure the hash doesn't change:

$ grep -A1 '^!' main.log | sha1sum
8ab121373e6b0232f8789f093db4bf20f3bb32c9 -

In the interestingness test shell script we'd then put:

[ "$(grep -A1 '^!' main.log | sha1sum | cut -d ' ' -f 1)" == "8ab121373e6b0232f8789f093db4bf20f3bb32c9" ] || exit 1

This will succeed when the grep output is what we expect -- and return 1 when it changes.

It can be worth playing with different combinations of grep options. The ones I found most useful in this kind of context are:

If there are contextual clues that should remain the same (for example the []\OT1/bch/m/n/12 BSI. | []. AIS 20 / AIS line in the original error I got), then you can adjust the grep command accordingly.

Muti-file projects

C-Reduce only knows how to reduce a single file at a time, which poses a small problem for our multi-file project. However, it's merely a small problem, and it's easy to solve. C-Reduce will start your interestingness test shell script in a new (temporary) directory every time, so all we need to do is to copy in the extra files at the start of the script. In my case I only needed the main .tex file (as the file I was minimizing was the .bib file, and C-Reduce will take care to get that one for you on its own):

# get whatever extra files you need to build
cp /home/vegard/hdr/main.tex .

That said, it can be worthwhile to hand-optimize your document a little bit at the start to reduce the compilation times of files that you know are irrelevant to the error and which won't be reduced by C-Reduce. In my particular case, chapters were split out into separate files and it was easy enough to comment out the lines that said \input{chapter1}, etc. -- meaning that we don't actually need C-Reduce to compile the full document every run; I already knew it was a problem with the line that said \printbibliography right at the end of the document. However, removing the citations meant that the printed bibliography would be empty as well, so I also had to add \nocite{*}, which includes all bibliography entries whether they are cited or not.

Running C-Reduce

Putting it all together:

$ cat test.sh
#! /bin/bash

# error out by default
set -e

# get whatever extra files you need to build
cp /home/vegard/hdr/main.tex .

# try to compile the document
pdflatex main
biber main
pdflatex -interaction=nonstopmode main

# check that the original errors are still present
[ "$(grep -A1 '^!' main.log | sha1sum | cut -d ' ' -f 1)" == "8ab121373e6b0232f8789f093db4bf20f3bb32c9" ] || exit 1

We can then run C-Reduce with:

creduce --not-c test.sh bibliography.bib

After about 20 minutes, the 3,400-line bibliography.bib had been reduced down to about 47 lines where it was quite easy to spot the problems by hand: \addtext around an author name, a stray ~ in a journal name, and a stray # in a month name.

Conclusion

C-Reduce was not made for LaTeX or BibTeX, but was surprisingly efficient at locating hard-to-find sources of compilation errors. It's true that writing interestingness tests can be unintuitive (AKA "Why is my testcase empty?"). Fortunately, I've used C-Reduce quite a bit in the past for C and C++ so it was straightforward to see how to apply it to this particular problem.

One interesting thing to note is that we didn't ask the tool to fix our problem, quite the opposite: We asked it to remove as much as possible that didn't have anything to do with the errors we were seeing, effectively isolating the problem to just the problematic few lines of code.

In general I think isolation is a very powerful debugging technique. It brings clarity to a problem where you can only see the symptoms. That's why stackoverflow generally asks for "MWEs" (Minimal Working Examples) -- remove confounding variables and everything that is immaterial to the problem at hand; get to the essence of the thing.

On Twitter, some people pointed out a couple of other tools that are like C-Reduce in that they can also minimize files/testcases:

I didn't try either of these tools for this specific problem, but I have used halfempty in the past and it's a good tool that's worth getting familiar with. A few years ago I did a simple benchmark of C-Reduce vs. halfempty on C++ source and -- without putting too much into this simplistic comparison -- I think the main takeaway was that halfempty seems to have the potential to be faster when run on fewer cores.

December 13, 2021 12:12 PM

December 03, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: Creating Branches For the -rcu Tree

Several people have expressed interest in how I go about creating the topic branches in the -rcu tree. So when I created branches earlier this week, I actually kept track.

But why bother with topic branches? In my case, reason is to allow anyone reviewing RCU patches to focus on a particular topic and to be able to keep that topic “in cache” while reviewing all of that topic's patches. Even more important, it makes it easy for reviewers to completely ignore patches that are not of interest to them.

Preparation

If you intend to follow along by actually executing the commands (highly recommended), first run “git checkout dev.2021.11.30b” in your clone of the -rcu tree.

Before creating branches, you need some essential tools. First, you absolutely must be able to see what you are doing. For this, there is the gitk tool, or, for those forswearing GUIs, the tig tool. If you cannot see what you are doing, your picture of your repository will diverge from that of git, and git always wins. If you can tolerate GUIs at all, install gitk. It will change your life.

Once you have gitk or tig installed, use it to review the commits that you are looking to organize into branches. This allows checking for bugs and also helps identify potential topics. Pro tip: Expand the tool's window to the full height of the screen. Another pro tip: Restrict gitk to the commits of interest, in my case by running “gitk v5.16-rc1..&”.

From here on out, I will just say “gitk”. If you are smart enough to use a non-GUI tool like tig, you are smart enough to translate. ;–)

Sorting Commits Into Topic Branches

The next step is to run this script, placing the output in a file:

#!/bin/bash

git log --pretty=format:"%h %s" $1 | \
        awk '{ print NR, "pick " $0 }' | \
        sort -k1n

I have this script in my bin directory under the name git-rebase-prep.sh, so I simply type:

git-rebase-prep.sh v5.16-rc1..dev.2021.11.30b > /tmp/rebase.txt

The first five lines of this file are as follows:

126 pick 12637ec9a1505 tools/memory-model:  Document locking corner cases
125 pick 5665cde49ec08 tools/memory-model: Make judgelitmus.sh note timeouts
124 pick 2e8007e79af5b tools/memory-model: Make cmplitmushist.sh note timeouts
123 pick 7a31810649915 tools/memory-model: Make judgelitmus.sh identify bad macros
122 pick 4e11469f67f2e tools/memory-model: Make judgelitmus.sh detect hard deadlocks

The first column is the commit number, with lower numbers meaning newer commits (that is, numbered in “git log” order), but the commits are ordered with the older commits first (that is, in the order that “git rebase” expects them). The second column is the commit's SHA-1 ID, and the remaining columns are the commit-log subject line.

Edit this file in your choice of editor, but in a window that is the full height of your monitor. Adjust your editor so that you have access two two different regions of this file, for example, in vim, type control-W followed by the letter “s”. The upper pane will eventually have topic-branch names each followed by the commits in that topic branch.

This time was a bit of a special case because this past merge window's change to the CONFIG_PREEMPT_DYNAMIC Kconfig option broke all non-preemptible CONFIG_SMP=n torture-test scenarios. (The default choice of CONFIG_PREEMPT_DYNAMIC=y forces CONFIG_PREEMPTION=y, which breaks tests of Tiny RCU and Tiny SRCU.) Therefore, the first “topic branch” is a single commit that adds CONFIG_PREEMPT_DYNAMIC=n to all affected torture-test scenarios. All RCU-related topic branches are then placed on top of this commit, thus allowing each topic branch to be torture-tested separately.

This means that the first few lines of the “/tmp/rebase.txt” file are as follows:

Underneath:
26 pick 37cf3c8c1ebda rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios

@@@

And this line has been removed from the remainder of this file. The “@@@” is a marker separating the topic-branched commits from those still waiting to be classified.

From my scan of the commits, I tentatively created the following topic branches, so that the first few lines of the “/tmp/rebase.txt” file are now as follows:

Underneath:
26 pick 37cf3c8c1ebda rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios

clocksource.2021.11.30c
doc.2021.11.30c
exp.2021.11.30c
fastnohz.2021.11.30c
fixes.2021.11.30c
nocb.2021.11.30c
nolibc.2021.11.30c
rcutorture.2021.11.30c
srcu.2021.11.30c
tasks.2021.11.30c
torture.2021.11.30c
torturescript.2021.11.30c
Merge above.
kcsan.2021.11.30c (merge)
lkmm.2021.11.30c (merge, then merge lkmm-dev)
clocksource.2021.11.30c (merge)
@@@

In other words, we will cherry-pick the first commit, rebase 11 branches on top of it (clocksource.2021.11.30c through torturescript.2021.11.30c, that is, the commits subject to some form of torture testing), and then merge the last ten of these branches together. The kcsan commits will be rebased on top of v5.16-rc1, as will the lkmm commits (but with the lkmm-dev commits rebased on top of the lkmm commits). Each of the kcsan, lkmm, and lkmm-dev branches will be merged separately on top of the ten-way merge point, and finally the clocksource.2021.11.30c branch will be merged on top of all of that. There are always a few commits not ready for the next merge window or that are to be upstreamed elsewhere, and these will be rebased on top of the clocksource.2021.11.30c merge point.

Please feel free to run “gitk v5.16-rc1..dev.2021.11.30c” in order to get a firm picture of the desired state.

The next step is to go through the remaining commits and assign them to branches. This process brought home the fact that there are no kcsan commits this time around (but there were plenty last time and likely will be plenty more next time), so I removed the “kcsan.2021.11.30c (merge)” line from /tmp/rebase.txt. In addition, there were not all that many commits combined for the rcutorture.2021.11.30c and torture.2021.11.30c topic branches, so I merged them all into torture.2021.11.30c. This is why the numbers in the first column of each commit line in /tmp/rebase.txt are important: In some cases, it is necessary to rearrange the topic branches, and it is sometimes important to get the commits in the correct order.

An this point, the /tmp/rebase.txt file looks (almost) like this.

Time to start actually creating the topic branches!

Create Topic Branches

We need two gitk instances, one for the old state (“gitk v5.16-rc1..dev.2021.11.30b”) and another for the new state which we will launch shortly:

git checkout v5.16-rc1
git cherry-pick 37cf3c8c1ebda
gitk v5.16-rc1..&

The first of the above commands put us at the desired v5.16-rc1 base commit for the topic branches, which just so happens to be where dev.2021.11.30b is based. Important safety tip: Don't try creating branches while also moving the pile of commits to some other base commit at the same time. There are just too many things that can go wrong when you try that.

The second of the above commands rebases (or “cherry-picks”) the aforementioned CONFIG_PREEMPT_DYNAMIC-compensation commit. The final command launches the other gitk instance that will show us the evolving state of the topic branches.

Note that my rebased CONFIG_PREEMPT_DYNAMIC-compensation commit happened to have the SHA-1 commit ID 8c0abfd6d2f6b0221194241ac2908751a2a0385f. If you are following along, you will need to change the git rebase argument for --onto to the corresponding SHA-1 commit ID in your tree. If you instead use my ID, everything will work, except that you will be rebasing on my rebased CONFIG_PREEMPT_DYNAMIC-compensation commit instead of your own. Rebasing some on mine and some on yours will probably actually work, but I leave the decision as to whether to carry out that experiment to your better judgment.

I suggest placing the old-state gitk at the left of your screen, the rebase.txt and your command window in the middle, and the new-state gitk at the right of your screen. Being able to see everything all the time is key to successful creation of topic branches.

The next set of commands rebases the clocksource-related commits on top of the rebased CONFIG_PREEMPT_DYNAMIC-compensation commit:

git branch clocksource.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a clocksource.2021.11.30c

The -i argument to the “git rebase” command specifies interactive mode, which will place you in an editor with a very long series of commits. In vim, type “c}” to delete all of those commits and enter vim insertion mode. Then copy and paste the two lines following clocksource.2021.11.30c from the rebase.txt file. Hit the “escape” key to exit vim insertion mode and then strip the commit numbers from all of the lines, for example, by using the “:1,.s/^[0-9]* //vim command. Write out the file and exit the editor (in vim, your choice of ZZ, :x, :wq, and probably others as well), at which point git will commence rebasing.

Don't forget to hit the <F5> key in the new gitk window after the completion of each rebase command.

The next several branches are constructed in a similar manner:

git branch doc.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a doc.2021.11.30c
git branch exp.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a exp.2021.11.30c
git branch fastnohz.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a fastnohz.2021.11.30c
git branch fixes.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a fixes.2021.11.30c
git branch nocb.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a nocb.2021.11.30c
git branch nolibc.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a nolibc.2021.11.30c

For each “git rebase” command, copy and paste the corresponding section of the rebase.txt file. Don't forget to hit <F5> in the new-state gitk window after running each “git rebase” command.

Remember that “almost” I mentioned above about the rebase.txt file? We now get to one of the deviations. At this point, I belatedly realized that there was only one SRCU commit (441a467cd9793 “srcu: Prevent redundant __srcu_read_unlock() wakeup”) and that it would be better to add that single commit to the fixes.2021.11.30c topic branch:

git checkout fixes.2021.11.30c
git cherry-pick 441a467cd9793

Note that I ignored commit ordering in this case. It is safe to do so because this is the only commit in the whole pile that touches kernel/rcu/srcutiny.c, it touches no other file, and there are no other types of dependencies against the other commits. Some times you get lucky!

Next, the rest of the RCU-related branches:

git branch tasks.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a tasks.2021.11.30c
git branch torture.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a torture.2021.11.30c
git branch torturescript.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a torturescript.2021.11.30c

Once again, copy and paste the commits from the corresponding section of the rebase.txt file and once again do not forget to hit <F5> after each “git rebase” command.

It is now time to do a merge! Unfortunately, I messed up the copy-and-paste. Twice. And then forgot that “git reset --hard” takes the current branch with it. Again, more than once. Then I accidentally included clocksource.2021.11.30c in the merge.

Perhaps building branches late in the day was a mistake, but here are the mistaken commands:

git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \
        fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \
        torture.2021.11.30c torturescript.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git checkout -B torturescript.2021.11.30c 90b21bcfb2846625c4f2e4a0bf5969543cef8ba7
git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \
        fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \
        torture.2021.11.30c torturescript.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git checkout -B torturescript.2021.11.30c 90b21bcfb2846625c4f2e4a0bf5969543cef8ba7
git checkout 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \
        fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \
        torture.2021.11.30c torturescript.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f

If nothing else, this sorry story shows that you can recover from mistakes. And making sure to hit <F5> in the new-state gitk means that all the commits are right there, enabling you to piece things back together.

But I suggest that you instead execute the following single command:

git checkout 8c0abfd6d2f6b0221194241ac2908751a2a0385f

Then hit <F5> in the new-state gitk window and verify that each topic branch has its branch name in place, that none of those branch names are in bold face, and that none of the little per-commit circles are yellow, except for the v5.16-rc1 commit. If all of that is in place, it is time to do the merge:

git merge doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c fixes.2021.11.30c \
        nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c torture.2021.11.30c \
        torturescript.2021.11.30c

This will put you into your editor, where you can fill out a commit log for the merge. This is what I did, documenting the topics:

Merge branches 'doc.2021.11.30c', 'exp.2021.11.30c', 'fastnohz.2021.11.30c', 'fixes.2021.11.30c', 'nocb.2021.11.30c', 'nolibc.2021.11.30c', 'tasks.2021.11.30c', 'torture.2021.11.30c' and 'torturescript.2021.11.30c' into HEAD
    
doc.2021.11.30c: Documentation updates.
exp.2021.11.30c: Expedited-grace-period fixes.
fastnohz.2021.11.30c: Remove CONFIG_RCU_FAST_NO_HZ.
fixes.2021.11.30c: Miscellaneous fixes.
nocb.2021.11.30c: No-CB CPU updates.
nolibc.2021.11.30c: Tiny in-kernel library updates.
tasks.2021.11.30c: RCU-tasks updates, including update-side scalability.
torture.2021.11.30c: Torture-test in-kernel module updates.
torturescript.2021.11.30c: Torture-test scripting updates.

The text starting with “Merge branches” that extends up to but not including the blank line is a single long line, just so you know.

Yet again, be sure to hit <F5> in the new-state gitk to see the results of the merge. Copy the SHA-1 commit ID of this merge point somewhere, keeping in mind that clicking on a commit in gitk places that commit's SHA-1 ID in your clipboard. Alternatively, you can use bash variables: “mergepoint=`git rev-parse HEAD`

The next task is to rebase the various Linux-kernel memory model (LKMM) commits, as always substituting the commits from the corresponding portions of the rebase.txt file:

git branch lkmm.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto v5.16-rc1 d069e38e66dbfeca541d7a2f9790ef8e8d3c911a lkmm.2021.11.30c
git branch lkmm-dev.2021.11.30c d069e38e66dbfeca541d7a2f9790ef8e8d3c911a
git rebase --onto c438b7d860b4c1acb4ebff6d8d946d593ca5fe1e v5.16-rc1 lkmm-dev.2021.11.30c

Don't forget <F5>!

The next step is to check out the previous merge point (you will need to substitute the SHA-1 ID recorded above) and do the remaining merges:

git checkout 32e5555b62e6a5f7304b13470cd1881a49a38d18
git merge lkmm.2021.11.30c
git merge lkmm-dev.201.11.30c
git merge lkmm-dev.2021.11.30c
git merge clocksource.2021.11.30c

Finally, rebase the five remaining commits from the “On top:” section of the rebase.txt file:

git branch dev.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto ce410b77746063e1c03f6e5bf85d68cca3e1c7ea \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a dev.2021.11.30c

Alternatively, you might instead choose to check out the new branch and then cherry-pick the five commits as follows:

git checkout -b dev.2021.11.30c
git cherry-pick 2a1d7ed8553da ca507a89ffbb9 0ebafffe561ac e824a258c37d4 74f780ac279b0

This checkout-and-cherry-pick operation is completely equivalent to the prior branch-and-rebase operation.

Either way, as always, don't forget to hit <F5> in your new-state gitk.

Check Your Work!

And finally, the most important thing is to check your work:

git diff dev.2021.11.30b

If these diffs are non-empty, there was some mistake somewhere in the process.

Possible Complications

This topic-branch exercise went quite smoothly and took about an hour to complete. Sometimes life is harder:

  1. There might be rebase and/or merge conflicts. I normally take this as a hint that my assignment of commits to topic branches was suboptimal. The commit numbers assigned by the git-rebase-prep.sh script are quite helpful when rejuggling commits. Yes, sometimes the resulting conflicts must be resolved manually, but that is a blog post for another day.
  2. One of the topic branches might depend on another. In this case, I rebase the dependent topic branch on top of the topic branch that it depends on. Looking at it another way, sometimes the topic branches are arranged in series instead of in parallel.
  3. A commit might depend on a number of topics, and the rest of the topics might depend on that commit. This is rare, but it does happen. The trick in this case is to create and merge the topic branches that the commit depends on, cherry-pick that commit on top of the resulting merge point, and then create and merge the remaining topic branches on top of that commit.

If you made it all the way down here, thank you for your time and attention, and I hope that this material was helpful. If you remember nothing else from this blog post, let it be gitk (or tig, as the case may be). Being able to see what you are doing allows you to learn to git much faster and allows you to safely take on much more complicated git-related tasks.

December 03, 2021 01:46 AM

November 24, 2021

Dave Airlie (blogspot): video decode: crossing the streams

I was interested in how much work a vaapi on top of vulkan video proof of concept would be.

My main reason for being interested is actually video encoding, there is no good vulkan video encoding demo yet, and I'm not experienced enough in the area to write one, but I can hack stuff. I think it is probably easier to hack a vaapi encode to vulkan video encode than write a demo app myself.

With that in mind I decided to see what decode would look like first. I talked to Mike B (most famous zink author) before he left for holidays, then I ignored everything he told me and wrote a super hack.

This morning I convinced zink vaapi on top anv with iris GL doing the presents in mpv to show me some useful frames of video. However zink vaapi on anv with zink GL is failing miserably (well green jellyfish).

I'm not sure how much more I'll push on the decode side at this stage, I really wanted it to validate the driver side code, and I've found a few bugs in there already.

The WIP hacks are at [1]. I might push on to encode side and see if I can workout what it entails, though the encode spec work is a lot more changeable at the moment.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/zink-video-wip

November 24, 2021 03:39 AM

November 15, 2021

Dave Airlie (blogspot): h264: more AMD hw worked on

Previously I mentioned having AMD VCN h264 support. Today I added initial support for the older UVD engine[1]. This is found on chips from Vega back to SI.

I've only tested it on my Vega so far.

I also worked out the "correct" answer to the how to I send the reset command correctly, however the nvidia player I'm using as a demo doesn't do things that way yet, so I've forked it for now[2].

The answer is to use vkCmdControlVideoCodingKHR to send a reset the first type a session is used. However I can't see how the app is meant to know this is necessary, but I've asked the appropriate people.

The initial anv branch I mentioned last week is now here[3].


[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-uvd-h264

[2] https://github.com/airlied/vk_video_samples/tree/radv-fixes

[3] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode

November 15, 2021 05:17 AM

November 12, 2021

Dave Airlie (blogspot): h264 video decoding: i-frames strike back

Last week I mentioned I had the basics of h264 decode using the proposed vulkan video on radv. This week I attempted to do the same thing with Intel's Mesa vulkan driver "anv".

Now I'd previously unsuccessfully tried to get vaapi on crocus working but got sidetracked back into other projects. The Intel h264 decoder hasn't changed a lot between ivb/hsw/gen8/gen9 era. I ported what I had from crocus to anv and started trying to get something to decode on my WhiskeyLake.

I wrote the code pretty early on, figured out all the things I had to send the hardware.

The first anv side bridge to cross was Vulkan is doing H264 Picture level decode API, so it means you get handed the encoded slice data. However to program the Intel hw you need to decode the slice header. I wrote a slice header decoder in some common code. The other thing you need to give the intel hw is a number of bits of slice header, which in some encoding schemes is rounded to bytes and in some isn't. Slice headers also have a 3-byte header on them, which Intel hardware wants you to discard or skip before handing it to it.

Once I'd fixed up that sort of thing in anv + crocus, I started getting grey I-frames decoded with later B/P frames using the grey frames as references so you'd see this kinda wierd motion.

That was I think 3 days ago. I've have stared at this intently for those 3 days blaming everything from bitstream encoding to rechecking all my packets (not enough times though). I had someone else verify they could see grey frames.

Today after a long discussion about possibilities, I was randomly comparing a frame from the intel-vaapi-driver and from crocus, and I spotted a packet header, the docs say is 34 dwords long, but intel-vaapi was only encoding 16 dwords, I switched crocus to explicitly state a 16-dword length and I started seeing my I-frames.

Now the B/P frames still have issues. I don't think I'm getting the ref frames logic right yet, but it felt like a decent win after 3 days of staring at it.

The crocus code is [1]. The anv code isn't cleaned up enough to post a pointer to yet, enterprising people might find it. Next week I'll clean it all up, and then start to ponder upstream paths and shared code for radv + anv. Then h265 maybe.

[1]https://gitlab.freedesktop.org/airlied/mesa/-/tree/crocus-media-wip

November 12, 2021 03:04 AM

November 05, 2021

Dave Airlie (blogspot): What do you know about video decoding/encoding?

A few weeks ago I watched Victor's excellent talk on Vulkan Video. This made me question my skills in this area. I'm pretty vague on video processing hardware, I really have no understanding of H264 or any of the standards. I've been loosely following the Vulkan video group inside of Khronos, but I can't say I've understood it or been useful.

radeonsi has a gallium vaapi driver, that talks to firmware driver encoder on the hardware, surely copying what it is programming can't be that hard. I got an mpv/vaapi setup running and tested some videos on that setup just to get comfortable. I looked at what sort of data was being pushed about.

The thing is the firmware is doing all the work here, the driver is mostly just responsible for taking semi-parsed h264 bitstream data structures and giving them in memory buffers to the fw API. Then the resulting decoded image should be magically in a buffer.

I then got the demo nvidia video decoder application mentioned in Victor's talk.

I ported the code to radv in a couple of days, but then began a long journey into the unknown. The firmware is quite expectant on exactly what it wants and when it wants it. After fixing some interactions with the video player, I started to dig.

Now vaapi and DXVA (Windows) are context based APIs. This means they are like OpenGL, where you create a context, do a bunch of work, and tear it down, the driver does all the hw queuing of commands internally. All the state is held in the context. Vulkan is a command buffer based API. The application records command buffers and then enqueues those command buffers to the hardware itself.

So the vaapi driver works like this for a video

create hw ctx, flush, decode, flush, decode, flush, decode, flush, decode, flush, destroy hw ctx, flush

However Vulkan wants things to be more like

Create Session, record command buffer with (begin, decode, end) send to hw, (begin, decode, end), send to hw, End Sesssion

There is no way at the Create/End session time to submit things to the hardware.

After a week or two of hair removal and insightful irc chats I stumbled over a decent enough workaround to avoid the hw dying and managed to decode a H264 video of some jellyfish.

The work is based on bunch of other stuff, and is in no way suitable for upstreaming yet, not to mention the Vulkan specification is only beta/provisional so can't be used anywhere outside of development.

The preliminary code is in my gitlab repo here[1]. It has a start on h265 decode, but it's not working at all yet, and I think the h264 code is a bit hangy randomly.

I'm not sure where this is going yet, but it was definitely an interesting experiment.

[1]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-prelim-decode

November 05, 2021 07:31 AM

November 03, 2021

Paul E. Mc Kenney: What Memory Model Should the Rust Language Use?

UNDER CONSTRUCTION



This blog post discusses a few alternative Rust-language memory models. I hope that this discussion is of value to the Rust community, but in the end, it is their language, so it is also their choice of memory model.

This discussion takes the Rust fearless-concurrency viewpoint, tempered by discussions I have observed and participated in while creating this blog series. Different members of that community of course have different viewpoints, and thus might reasonably advocate for different choices. Those who know me will understand that these viewpoints differ significantly from my own. However, my viewpoint is dictated by my long-standing privilege of living at the edge of what is possible in terms of performance, scalability, real-time response, energy efficiency, robustness, and much else besides. Where I live, significant levels of fear are not just wise, but also necessary for survival. To borrow an an old saying from aviation, there are old pilots, and there are bold pilots, but there are no old bold pilots.

Nevertheless, I expect that my more than three decades of experience with concurrency, my work on the C/C++ memory model (memory_order_consume notwithstanding), and my role as lead maintainer of the Linux-kernel memory model (LKMM) will help provide a good starting point for the more work-a-day situation that I believe that the Rust community wishes to support.

But Doesn't Rust Already Have a Memory Model?

Kinda sorta?

Some in academia have assumed that Rust's memory model will be based on that of C/C++, for example, see here. And the Rustnomicon agrees, though the author of that text does not appear to be very happy about it:

Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not due to this model being particularly excellent or easy to understand. Indeed, this model is quite complex and known to have several flaws. Rather, it is a pragmatic concession to the fact that everyone is pretty bad at modeling atomics. At very least, we can benefit from existing tooling and research around the C/C++ memory model. (You'll often see this model referred to as "C/C++11" or just "C11". C just copies the C++ memory model; and C++11 was the first version of the model but it has received some bugfixes since then.)

Trying to fully explain the model in this book is fairly hopeless. It's defined in terms of madness-inducing causality graphs that require a full book to properly understand in a practical way. If you want all the nitty-gritty details, you should check out the C++ specification. Still, we'll try to cover the basics and some of the problems Rust developers face.

The C++ memory model is fundamentally about trying to bridge the gap between the semantics we want, the optimizations compilers want, and the inconsistent chaos our hardware wants. We would like to just write programs and have them do exactly what we said but, you know, fast. Wouldn't that be great?

I would argue that compiler optimizations are the source of much more inconsistent chaos than hardware would ever dream of providing, but that argument has been going on for many years and I doubt that we will be able to settle it here.

Leaving that argument aside, entertaining though it might be, the wording of this passage of the Rustnomicon certainly invites alternatives to the C/C++ memory model. Let's see what we can do.

Where to Start?

Let's first eliminate a number of candidate memory models. Rust's presumed portability goals rules out any of the hardware memory models. Rust's growing connection to the deep embedded world rules out the memory models of dynamic languages such as Java and Javascript. The growing number of memory models derived from the C/C++ memory model are represented by the actual C/C++ memory model, so we will not consider those variants separately.

This leaves the aforementioned C/C++ memory model and of course LKMM, the latter inspired by the Rust communities ambitions within the Linux kernel. Because this blog series focuses on the Linux kernel, this post will start with LKMM, move to the C/C++ memory model, and end with a specific recommendation.

Linux-Kernel Memory Model (LKMM)

This section will consider aspects of LKMM, starting with the most fear-inducing aspects and moving to the more work-a-day aspects.

Control dependencies are not the most fearless portion of LKMM, in fact, in my role as lead maintainer of LKMM, I need people using control dependencies not to merely feel fearful, but instead to feel downright terrified. Control dependencies are fragile and easy for control-dependency-oblivious compiler optimizations to destroy. Therefore, for the time being, control dependencies should be excluded from any Rust memory model. Perhaps the day will come when we have a good way to communicate the intent of control dependencies to compilers, but today is not that day.

Address and data dependencies carried by pointers are not free of risk, but they do provide dependency-oblivious compilers significantly less scope for destructive optimizations. They should thus require much less fear than do control dependencies, little though that might be saying. Nevertheless, address and data dependencies are mainly used in conjunction with RCU, and we do not yet have a known good way of expressing all RCU use cases within the confines of Rust's ownership model. Therefore, address and data dependencies should also be excluded from Rust's memory model until such time as significant RCU use cases are handled or some other clear and present use case militates for address and data dependencies. It would be increasingly reasonable to permit control and data dependencies within Rust unsafe mode should use of RCU within Rust increase, keeping in mind that things like epoch-based reclamation (EBR) are particular classes of implementations of RCU.

At first glance, it seems entirely reasonable to countenance use of READ_ONCE() and WRITE_ONCE() within prospective Linux-kernel Rust code, but this post is discussing Rust in general, not just Rust in the Linux kernel. And the fact is that address, data, and control dependencies are the only things that prevent the Linux kernel's use of READ_ONCE() and WRITE_ONCE() from resulting in out-of-thin-air (OOTA) behavior. Except that these operations (along with the Linux kernel's unordered atomic read-modify-write (RMW) operations) are implemented so as to prevent the compiler from undertaking code-motion optimizations that might otherwise reorder such operations. Furthermore, all of the underlying hardware memory models of which I am aware preserve dependency orderings. Therefore, one might expect that these unordered operations might reasonably be part of a Rust memory model.

Unfortunately, one imporant benefit of a memory model is the tooling that analyzes concurrent code fragments, and if this tooling is to exclude OOTA behaviors, it is absolutely necessary for that tooling to understand dependencies. Except that we have already excluded such dependencies from the Rust memory model.

Therefore, the Rust memory model should restrict its support for Linux-kernel atomic operations to those that provide ordering. These would be the value-returning non-relaxed read-modify-write (RMW) atomic operations along with the _acquire() and _release() variants of non-value-returning RMW atomic operations. It might also make sense to allow combinations of unordered RMW operations with combination memory barriers, for example, atomic_inc() followed by smp_mb__after_atomic(), but it would make even more sense to wrapper these in combination as a single Rust-accessible primitive. This combined Rust primitive would no longer be unordered, and could thus be included as an ordered unit in the Rust memory model. Alternatively, unordered atomics might be relegated to Rust's unsafe mode.

It is hard to imagine a useful Rust memory model that excludes locking.

Thus, starting from LKMM, we arrive at a model that supports ordered atomic operations and locking, possibly including unordered atomics in unsafe mode.

C/C++ Memory Model

This section will instead start from the C/C++ memory model.

Because memory_order_relaxed accesses can yield OOTA results, they seem inconsistent with Rust's fearless-concurrency goals. These cannot actually occur in practice with any implementation that I am aware of, but fearless concurrency requires accurate tools, and this accuracy requires excluding even the theoretical possibility of OOTA results. Another reasonable approach would permit use of memory_order_relaxed only within Rust unsafe code.

The memory_order_consume is primarily useful in conjunction with RCU. In addition, in all implementations that I am aware of, memory_order_consume is simply promoted to memory_order_acquire. It therefore seems unnecessary to include memory_order_consume within a Rust memory model. As before, a reasonable alternative would permit its use only within Rust unsafe code.

In contrast, memory_order_acquire, memory_order_release, and memory_order_acq_rel, are all easily analyzed and have been heavily used in practice for decades. Although the celebrated memory_order_seq_cst (sequentially consistent) memory ordering can be difficult to analyze, its strong ordering is absolutely required by some use cases, including a sadly large fraction of concurrent algorithms published by academics and industrial researchers. In addition, there is a decades-long tradition of proofs and tooling handling sequential consistency, difficulties aside. Use of all four of these orderings should therefore be permitted from safe Rust code.

With the C/C++ memory model as it was with LKMM, it is hard to imagine a useful Rust memory model that excludes locking.

Thus, starting from the C/C++ memory model, we also arrive at a model that supports ordered atomic operations and locking, possibly along with unordered atomic operations in unsafe mode.

Recommendation

The fact that starting with LKMM got us to roughly the same place as did starting with the C/C++ memory model should give us considerable confidence in that destination. Why the “roughly”? Because there are subtle differences, as can be seen by comparing the C and C++ standards to LKMM.

One reaction to this situation would be to go through these memory models one corner case at a time, and for each corner case making a choice for Rust, now and forever. Perhaps a better approach is to pick one and adapt as needed based on real situations as they arise. At this time, there are far more projects living within the C/C++ memory model than within LKMM. Therefore, despite my role as lead maintainer of LKMM, it is my painful duty to recommend the following, based on the C/C++ memory model:

  1. Locking may be used from both safe and unsafe modes.
  2. Atomic operations using memory_order_acquire, memory_order_release, memory_order_acq_rel, and memory_order_seq_cst may be used from both safe and unsafe modes.
  3. Atomic operations using memory_order_relaxed and memory_order_consume may be only from unsafe mode.
  4. All atomic operations in Rust code should be marked, that is, Rust should avoid following the C/C++ practice of interpreting an unmarked mention of an atomic variable as a memory_order_seq_cst access to that variable. Requiring marking allows accesses to concurrently accessed shared variables to be identified at a glance, and also makes the choice of any default memory_order value much less pressing.
This provides explicit support for the operations and orderings that have a long history of heavy use and and for which analysis tools have long been available. It also provides provisional support of operations and orderings that, while also having a long history of heavy use, are beyond the current state of the art for accurate and complete analysis.

Other Options

One could argue that memory_order_relaxed also has many simple use cases, for but one example, constructing distributed counters. However, given the difficulty verifying its full range of use cases, unsafe mode seems the safest bet at the moment. Should any of the current efforts to more straightforwardly verify memory_order_relaxed accesses bear fruit, then perhaps memory_order_relaxed might be permitted in Rust safe mode.

Finally, one could argue that memory_order_consume should be excluded entirely, rather than simply being relegated to unsafe mode. However, some Rust libraries already use RCU internally, and the ability to flag RCU pointer traversals could prove helpful should Rust someday fully support address and data dependencies.

In contrast, as noted earlier, the other four memory_order enum members are heavily used and routinely analyzed. It is therefore reasonable to permit use of memory_order_acquire, memory_order_release, memory_order_acq_rel, and memory_order_seq_cst within Rust safe code.

What Does This Mean for the Linux Kernel?

As it turns out, not much.

Please keep in mind that the Linux kernel currently interoperates with the full range of memory models used by arbitrary userspace code written in arbitrary languages when running on a wide range of hardware memory models. Any adjustments required are currently handled by architecture specific code, for example, in the system-call layer or in exception entry/exit code. Strong ordering is also provided deeper within the Linux kernel, with but one example being the context-switch code, which must provide full ordering when processes migrate from one CPU to another.

It turns out that Rust code in Linux has wrappers around any C code that it might invoke, and it is through these wrappers that Rust code will make use of the non-Rust portions of the Linux kernel. These wrappers will therefore contain calls to whichever memory-ordering primitives might be required at any particular point in Rust's memory-model evolution.

But Does Rust Really Need to Define Its Memory Model Right Now?

This is of course not my decision.

But it is only fair to point out that the longer the Rust community waits, the more the current “kinda sorta” choice of the full C/C++ memory model will become the actual long-term choice. Including those portions of the C/C++ memory model that have proven troublesome, and that are therefore likely to be subject to change.

My recommendation is therefore to adopt the untroublesome portions of the C/C++ memory model in safe mode, and the rest in unsafe mode. This approach allows people to write work-a-day concurrent algorithms in Rust and to be confident that the resulting code will still work in the future. It also allows those needing to live more dangerously to confine their code to unsafe blocks so as to attract an appropriate level of skepticism and attention.

But again, this decision rests not with me, but with the Rust communty.

History

November 4, 2021: Unmarked Rust-language accesses are never atomic operations plus some wordsmithing.

November 03, 2021 05:14 PM

November 01, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: Waiting for Grace Periods From NMI Handlers

Suppose that you had a state machine implemented by NMI handlers, and that some of the transitions in this state machine need to wait for an RCU grace period to elapse. How could these state transitions be implemented?

Before we start, let's dispense with a couple of silly options. First, we clearly are not going to invoke synchronize_rcu() from within an NMI handler. This function can sleep, and we cannot sleep even within non-threaded interrupt handlers, let alone within NMI handlers. Second, we are not going to invoke call_rcu() from within an NMI handler. This function disables interrupts to exclude concurrent invocations on a single CPU, and disabling interrupts does nothing to keep NMI handlers from running. Worse yet, when running on rcu_nocbs CPUs (which offload callback invocations to rcuo kthreads), call_rcu() acquires a lock. Therefore, invoking call_rcu() from within an NMI handler would at best result in deadlock, and might also result in corruption of RCU's lists of callbacks.

So what can we do?

One approach would be to use a self-spawning RCU callback, that is, a callback that invokes call_rcu() on itself. (Yes, this is perfectly legal, just as it is with timer handlers. See the function rcu_torture_fwd_prog_cb() in kernel/rcu/rcutorture.c of v5.14 of the Linux kernel for an example.) This callback could also increment a counter that could be checked by the NMI handler. When the NMI handler wished to defer a state transition until the end of a future RCU grace period, it could transition to an additional “waiting for RCU” state while recording the value of that counter. Later, when the counter had advanced sufficiently, a subsequent NMI handler could complete the transition to the desired state.

Of course, it is not sufficient to wait for the counter to advance just once. The reason is that the initial NMI might have occurred just before the RCU callback executed, and the next NMI might happen just afterwards. Yes, the counter has advanced, but almost no time has elapsed, much less a full RCU grace period. It is instead necessary to wait for the counter to advance by two, and also to have the needed memory barriers in place.

But there is a better way that makes use of RCU's existing counters and memory barriers. RCU provides these four functions for this purpose, two of which are usable from NMI handlers:


  1. get_state_synchronize_rcu(), which was first used in v4.10 in 2015, returns a “cookie” that can be checked later. SRCU provides a similar get_state_synchronize_srcu() API.
  2. start_poll_synchronize_rcu() returns a cookie as well, but also ensures that the needed RCU grace period gets scheduled. Unfortunately, this last requires locks to be acquired, which precludes its use in an NMI handler. SRCU provides a similar start_poll_synchronize_srcu() API, which was first used in v5.14 in 2021.
  3. poll_state_synchronize_rcu() takes a cookie from the first two functions and returns true if the corresponding RCU grace period has elapsed. SRCU provides a similar poll_state_synchronize_srcu() API, which was first used in v5.14 in 2021.
  4. cond_synchronize_rcu(), which was first used in v4.10 in 2015, also takes a cookie from the first two functions, but waits (if needed) until the corresponding RCU grace period has elapsed. Unfortunately, the possibility of waiting precludes this function's use in an NMI handler.

So the first NMI handler can invoke get_state_synchronize_rcu() to obtain a cookie, then transition to the additional state. Later NMI handlers residing in this additional state could pass that cookie to poll_state_synchronize_rcu(), completing the transition if this function returns true. On busy systems, RCU grace periods are being initiated by many other things, so that there is almost always a grace period in progress, but if this must work on quiet systems, the aforementioned self-spawning callback could be used to force the issue when needed.

Of course, RCU has made internal use of grace-period polling for a very long time, starting prior to the beginning of the Linux-kernel git repository in 2005.

In short, NMI handlers can now work with both RCU and SRCU grace periods without the need to invent counters or to worry about memory-ordering issues!

November 01, 2021 07:40 PM

October 15, 2021

Paul E. Mc Kenney: Verification Challenges

You would like to do some formal verification of C code? Or you would like a challenge for your formal-verification tool? Either way, here you go!
 
 
 


  1. Stupid RCU Tricks: rcutorture Catches an RCU Bug, also known as Verification Challenge 1.
  2. Verification Challenge 2: RCU NO_HZ_FULL_SYSIDLE.
  3. Verification Challenge 3: cbmc.
  4. Verification Challenge 4: Tiny RCU.
  5. Verification Challenge 5: Uses of RCU.
  6. Verification Challenge 6: Linux-Kernel Tree RCU.
  7. Verification Challenge 7: Heavy Modifications to Linux-Kernel Tree RCU.

October 15, 2021 06:22 PM

October 13, 2021

Paul E. Mc Kenney: TL;DR: Memory-Model Recommendations for Rusting the Linux Kernel

These recommendations assume that the initial Linux-kernel targets for Rust developers are device drivers that do not have unusual performance and scalability requirements, meaning that wrappering of small C-language functions is tolerable. (Please note that most device drivers fit into this category.) It also assumes that the main goal is to reduce memory-safety bugs, although other bugs might be addressed as well. Or, Murphy being Murphy, created as well. But that is a risk in all software development, not just Rust in the Linux kernel.

Those interested in getting Rust into Linux-kernel device drivers sooner rather than later should look at the short-term recommendations, while those interested in extending Rust's (and, for that matter, C's) concurrency capabilities might be more interested in the long-term recommendations.

Short-Term Recommendations

The goal here is to allow the Rust language considerable memory-model flexibility while also providing Rust considerable freedom in what it might be used for within the Linux kernel. The recommendations are as follows:

  1. Provide wrappers around the existing Linux kernel's locking, MMIO, I/O-barrier, and I/O-access primitives. If I was doing this work, I would add wrappers incrementally as they were actually needed, but I freely admit that there are benefits to providing a full set of wrappers from the get-go.
  2. Either wrapper READ_ONCE() or be careful to take Alpha's, Itanium's, and ARMv8's requirements into account as needed. The situation with WRITE_ONCE() appears more straightforward. The same considerations apply to the atomic_read() and atomic_set() family of primitives.
  3. Atomic read-modify-write primitives should be made available to Rust programs via wrappers around C code. But who knows? Maybe it is once again time to try out intrinsics. Again, if I was doing the work, I would add wrappers/implementations incrementally.
  4. Although there has been significant discussion surrounding how sequence locking and RCU might be handled by Rust code, more work appears to be needed. For one thing, it is not clear that people proposing solutions are aware of the wide range of Linux-kernel use cases for these primitives. In the meantime, I recommend keeping direct use of sequence locking and RCU in C code, and providing Rust wrappers for the resulting higher-level APIs. In this case, it might be wise to make good use of compiler directives in order to limit Rust's ability to apply code-motion optimizations. However, if you need sequence-locking or RCU Rust-language wrappers, there are a number that have been proposed. (Except that you should first take another look or three at wrappering higher-level APIs. Yes, APIs are hard, but there are huge benefits to proper APIs!)
  5. Control dependencies should be confined to C code. If needed, higher-level APIs whose C-language implementations require control dependencies can be wrappered for Rust use. But my best guess is that it will be some time before Rust code needs control dependencies.

Taking this approach should avoid memory-model-mismatch issues between Rust and C code in the Linux kernel. And discussions indicate that much (and maybe all) of the wrappering work called out in the first three items above has already been done.

Of course, situations that do not fit this set of recommendations can be addressed on a case-by-case basis. I would of course be happy to help. Specifically, in my role as lead Linux-kernel RCU maintainer:

  1. My first reaction to submission of Rust wrappers for the RCU API will be to ask hard questions about the possibility of higher-level APIs.
  2. If higher-level APIs are infeasible, I will look carefully at which RCU use cases are supported by the proposed wrappering. I am working to improve documentation of the range of RCU use cases in order to help submitters to do this as well. (This work will likely be helpful elsewhere as well.)
  3. Again, if higher-level APIs are infeasible, I will look carefully at how the Rust wrappers are helping to find bugs. This will clearly require me to continue learning Rust, as this requires me to have a detailed view of Rust's ownership mechanisms and how things like reference-counting use cases work around limitations in these mechanisms.

This procedure should help buy the time required for me to learn more about the Rust language and for the Rust community to learn more about RCU and its many use cases.

Long-Term Recomendations

This section takes a more utopian view. What would a perfect Rust sequence-locking implementation/wrapper look like? Here are some off-the-cuff desiderata, none of which are met by the current C-code Linux-kernel implementation:

  1. Use of quantities computed from sequence-lock-protected variables in a failed reader should result in a warning. But please note that reliably associating variables with sequence locks may not be easy. One approach suggested in response to this series is to supply a closure containing the read-side critical section, thus restricting such leakage to (unsafe?) side effects.
  2. Improper access to variables not protected by the sequence lock should result in a warning. But please note that it is quite difficult to define "improper" in this context, let alone detect it in real code.
  3. Data races in failed sequence-lock readers should not cause failures. But please note that this is extremely difficult in general if the data races involve non-volatile C-language accesses in the reader. For example, the compiler would be within its rights to refetch the value after the old value had been checked. (This is why the sequence-locking post suggests marked accesses to sequence-lock-protected variables.)
  4. Data races involving sequence-locking updaters are detected, for example, via KCSAN.

As noted elsewhere, use of a wrapper around the existing C-language implementation allows C and Rust to use the same sequence lock. This might or might not prove to be important.

Similarly, what would a perfect Rust RCU wrapper look like? Again, here are some off-the-cuff desiderata, similarly unmet by existing C-code Linux-kernel implementations:

  1. Make call_rcu() and friends cause the specified object to make an end-of-grace-period transition in other threads' ownership from readers to unowned. Again, not an immediate transition at the time of call_rcu() invocation, but rather a deferred transition that takes place at the end of some future grace period.
  2. Some Rust notion of type safety would be useful in slab caches flagged as SLAB_TYPESAFE_BY_RCU.
  3. Some Rust notion of existence guarantee would be useful for RCU readers.
  4. Detect pointers to RCU-protected objects being improperly leaked from RCU read-side critical sections, where proper leakage is possible via locks and reference counters. Perhaps an emphasis on closures is at least part of the answer.
  5. Detect mishandling of dependency-carrying pointers returned by rcu_dereference() and friends. Or perhaps introduce some marking such pointers to that the compiler will avoid breaking the ordering. See the address/data dependency post for a list of C++ working papers moving towards this goal.

There has recently been some work attempting to make the C compiler understand control dependencies. Perhaps Rust could make something useful happen in this area, perhaps by providing markings that allow the compiler to associate the reads with the corresponding control-dependent writes.

Perhaps Rust can better understand more of the ownership schemes that are used within the Linux kernel.

Rationale

Please note that middle-term approaches are likely to be useful, for example, those that simply provide wrappers around the C-language RCU and sequence-locking implementations. However, such approaches also carry risks, especially during that time when the intersections of the set of people deeply understanding Rust with the set of people deeply understanding RCU and sequence locking is the empty set. For example, one risk that became apparent during the effort to add RCU to the C++ standard is that people new to RCU and sequence locking will latch onto the first use case that they encounter and focus solely on that use case. This has also been a recurring issue for concurrency experts who encounter sequence locking and RCU for the first time. This of course means that I need to do a better job of documenting RCU, its use cases, and the relationships between them.

This is not to say that per-use-case work is pointless. In fact, such work can be extremely valuable, if nothing else, in helping to build understanding of the overall problem. It is also quite possible that current RCU and sequence-locking use cases will resemble those of the future in the same way that the venerable "while" loop resembles the wide panoply of interator-like constructs found not only in modern languages, including those written in C in the Linux kernel, in other words, perhaps Rust will focus on specific fearless-concurrency-friendly RCU/sequence-locking use cases. Except that the Linux kernel still contains "while" loops, as does a great deal of other software, which suggests that the low-level RCU and sequence-locking APIs will always be required. There will also be questions as to whether a given new-age use case is there to help developers using RCU and sequence locking on the one hand or whether its main purpose is instead to work around a shortcoming in Rust on the other. "This should be good clean fun!" ;-)

Experience indicates that it will take significant time to sort out all of these issues. We should therefore make this time available by proceeding initially as described in the short-term recommendations.

Criteria for judging proposals include safety, performance, scalability, build time, development cost, maintenance effort, and so on, not necessarily in that order.

History

October 18, 2021: Add "Rationale" section.
October 20, 2021: Add a few expansions and clarifications.
October 21, 2021: RCU-specific recommendations.

October 13, 2021 09:08 PM

October 06, 2021

Paul E. Mc Kenney: Rusting the Linux Kernel: Summary and Conclusions

We have taken a quick trip through history, through a number of the differences between the Linux kernel and the C/C++ memory models, sequence locks, RCU, ownership, zombie pointers, and KCSAN. I give a big "thank you" to everyone who has contributed to this discussion, both publicly and privately. It has been an excellent learning experience for me, and I hope that it has also been helpful to all of you.

To date, Can Rust Code Own Sequence Locks? has proven the most popular by far. Porting Linux-kernel code using sequence locking to Rust turns out to be trickier than one might expect, in part due to the inherently data-racy nature of this synchronization primitive.

So what are those advocating use of Rust within the Linux kernel to do?

The first thing is to keep in mind that the Linux kernel comprises tens of millions of lines of code. One disadvantage of this situation is that the Linux kernel is not going to be ported to much of anything quickly or easily. One corresponding advantage is that it allows Rust-for-Linux developers to carefully pick their battles, focusing first on those portions of the Linux kernel where conversion to Rust might do the most good. Given that order-of-magnitudes performance wins are unlikely, the likely focus is likely to be on proof of concept and on reduced bug rates. Given Rust's heavy focus on bugs stemming from undefined behavior, and given the Linux kernel's use of the -fno-strict-aliasing and -fno-strict-overflow compiler command-line options, bug-reduction choices will need to be made quite carefully. In addition, order-of-magnitude bug-rate differences across the source base provides a high noise floor that makes it more difficult to measure small bug-reduction rates. But perhaps enabling the movement of code into mainline and out of staging can be another useful goal. Or perhaps there is an especially buggy driver out there somewhere that could make good use of some Rust code.

Secondly, careful placement of interfaces between Rust and C code is necessary, especially if there is truth to the rumors that Rust does not inline C functions without the assistance of LTO. In addition, devilish details of Linux-kernel synchronization primitives and dependency handling may further constrain interface boundaries.

Finally, keep in mind that the Linux kernel is not written in standard C, but rather in a dialect that relies on gcc extensions, code-style standards, and external tools. Mastering these extensions, standards, and tools will of course require substantial time and effort, but on the other hand this situation provides precedent for Rust developers to also rely on extensions, style standards, and tools.

But what about the question that motivated this blog in the first place? What memory model should Linux-kernel Rust code use?

For Rust outside of the Linux kernel, the current state of compiler backends and the path of least resistance likely leads to something resembling the C/C++ memory model. Use of Rust in the Linux kernel is therefore not a substitute for continued participation in the C/C++ standards committees, much though some might wish otherwise.

Rust non-unsafe code does not depend much on the underlying memory model, at least assuming that unsafe Rust code and C code avoids undermining the data-race-free assumption of non-unsafe code. The need to preserve this assumption was in fact inspired the blog post discussing KCSAN.

In contrast, Rust unsafe code must pay close attention to the underlying memory model, and in the case of the Linux kernel, the only reasonable choice is of course the Linux-kernel memory model. That said, any useful memory-ordering tool will also need to pay attention to safe Rust code in order to correctly evaluate outcomes. However, there is substantial flexibility, depending on exactly where the interfaces are placed:


  1. As noted above, forbidding use of Rust unsafe code within the Linux kernel would render this memory-model question moot. Data-race freedom for the win!
  2. Allowing Rust code to access shared variables only via atomic operations with acquire (for loads), release (for stores) or stronger ordering would allow Rust unsafe code to use that subset of the Linux-kernel memory model that mirrors the C/C++ memory model.
  3. Restricting use of sequence locks, RCU, control dependencies, lockless atomics, and non-standard locking (e.g., stores to shared variables within read-side reader-writer-locking critical sections) to C code would allow Rust unsafe code to use that subset of the Linux-kernel memory model that closely (but sadly, not exactly) mirrors the C/C++ memory model.
  4. Case-by-case relaxation of the restrictions called out in the preceding pair of items pulls in the corresponding portions of the Linux-kernel memory model. For example, adding sequence locking, but only in cases where readers access only a limited co-located set of objects might permit some of the simpler Rust sequence-locking implementations to be used (see for example the comments to the sequence-locking post).
  5. Insisting that Rust be able to do everything that Linux-kernel C code currently does pulls in the entirety of the Linux-kernel memory model, including those parts that have motivated many of my years of C/C++ standards-committee fun and excitement.

With the first three options, a Rust compiler adhering to the C/C++ memory model will work just fine for Linux-kernel C code. In contrast, the last two options would require code-style restrictions (preferably automated), just as they are in current Linux-kernel C code. See the recommendations post for more detail.

In short, choose wisely and be very careful what you wish for! ;-)

History

October 7, 2021: Added atomic-operation-only option to memory-model spectrum.
October 8, 2021: Fix typo noted by Miguel Ojeda
October 12, 2021: Self-review.
October 13, 2021: Add reference to recommendations post.

October 06, 2021 09:48 PM

Paul E. Mc Kenney: Can the Kernel Concurrency Sanitizer Own Rust Code?

Given the data-race-freedom guarantees of Rust's non-unsafe code, one might reasonably argue that there is no point in the Kernel Concurrency Sanitizer (KCSAN) analyzing such code. However, the Linux kernel is going to need unsafe Rust code, and although there has been significant progress in formal verification of such code, one of the leading researchers in this area says “we hope to eventually develop formal methods that can be put directly in the hands of programmers”. We would be wise to take careful note of the word “eventually”. Furthermore, even given unanticipated universal acclamation of Rust within the Linux kernel community combined with equally unanticipated advances in C-to-Rust translation capabilities, a significant fraction of the existing tens of millions of lines of Linux-kernel C code will persist for some time to come. Both the unsafe Rust code and the C code can interfere with Rust non-unsafe code, and furthermore there are special cases where safe code can violate unsafe code's assumptions. Therefore, run-time analysis of Rust code (safe code included) is likely to be able to find issues that compile-time analysis cannot.

As a result, Rust seems unlikely to render KCSAN obsolete any time soon. This suggests that Rust code should include KCSAN instrumentation, both via the compiler and via atomic and volatile primitives, as is currently the case for the C compilers and primitives currently in use. The public KCSAN interface is as follows:


  1. __kcsan_check_access(): This function makes a relevant memory access known to KCSAN. It takes a pointer to the object being accessed, the size of the object, and an access type (for example, zero for read, KCSAN_ACCESS_WRITE for write, and KCSAN_ACCESS_ATOMIC for atomic read-modify-write).
  2. The __tsan_{read,write}{1,2,4,8}() family of functions serve the same purpose as __kcsan_check_access(), but the size and access type are implicit in the function name instead of being passed as arguments, which can provide better performance. This family of functions is used by KCSAN-enabled compilers.
  3. ASSERT_EXCLUSIVE_ACCESS() is invoked from C code and causes KCSAN to complain if there is a concurrent access to the specified object. A companion ASSERT_EXCLUSIVE_ACCESS_SCOPED() expands the scope of the concurrent-access complaint to the full extent of the compound statement invoking this macro.
  4. ASSERT_EXCLUSIVE_WRITER() is invoked from C code and causes KCSAN to complain if there is a concurrent write to the specified object. A companion ASSERT_EXCLUSIVE_WRITER_SCOPED() expands the scope of the concurrent-writer complaint to the full extent of the compound statement invoking this macro.
  5. ASSERT_EXCLUSIVE_BITS() is similar to ASSERT_EXCLUSIVE_WRITER(), but focuses KCSAN's attention on changes to bits set in the mask passed as this macro's second argument.

These last three categories of interface members are designed to be directly invoked in order to check concurrency designs. See for example the direct call to ASSERT_EXCLUSIVE_WRITER() from the rcu_cpu_starting() function in Linux-kernel RCU. It would of course be quite useful for these interface members to be available from within Rust code.

KCSAN also provides a data_race() primitive that suppresses data-race diagnostics, but allows the compiler free rein on optimization. This is used in the Linux kernel for things like infrequent diagnostics that are not part of the overall concurrency design.

KCSAN has helped me fix a number of embarrassing bugs in Linux-kernel RCU that might otherwise have inconvenience end users, so they just might be helpful to people introducing Rust code into the Linux kernel. It is therefore quite encouraging that Marco Elver (KCSAN lead developer and maintainer) points out off-list that the rustc compiler already has sanitizer support, which should suffice not only for KCSAN, but for the equally helpful Kernel Address Sanitizer (KASAN) as well. He also notes that some care is required to pass in the relevant -mllvm options to rustc and to ensure that it is not attempting to link against compiler-rt because doing so is not appropriate when building the Linux kernel. Marco also noted that KCSAN relies on the front-end to properly handle volatile accesses, and Miguel Ojeda kindly confirmed that it does. So there is hope!

For more information on KCSAN and its reason for being:

  1. "Concurrency bugs should fear the big bad data-race detector" (Part 1 and Part 2).
  2. Who's afraid of a big bad optimizing compiler?.
  3. Calibrating your fear of big bad optimizing compilers.
  4. Sections 4.3.4 ("Accessing Shared Variables"), 15.2 ("Tricks and Traps"), and 15.3 ("Compile-Time Consternation") of perfbook.


History

October 7, 2021: Add current state of KCSAN/Rust integration. Clarify Rust module per Gary Guo email.
October 12, 2021: Self-review.
October 28, 2021: Add data_race() primitive.

October 06, 2021 07:11 PM

October 05, 2021

Paul E. Mc Kenney: Will Your Rust Code Survive the Attack of the Zombie Pointers?

Some of the previous posts in this series have been said to be quite difficult, so I figured I owed you all an easy one. And the zombie-pointer problem really does have a trivial solution, at least in the context of the Linux kernel. In other environments, all bets are off.

But first, what on earth is a zombie pointer?

To answer that question, let's start with a last-in-first-out (LIFO) stack. We don't know when this algorithm was invented or who invented it, but a very similar algorithm was described in R. J. Treiber's classic 1986 technical report (IBM Almaden Research Center RJ 5118), and it was referred to as prior art in (expired) US Patent 3,886,525, which was filed in 1973 (hat trick to Maged Michael). Fun though it might be to analyze this code's original IBM assembly-language implementation, let's instead look at a C11 implementation:

 1 struct node_t* _Atomic top;
 2
 3 void list_push(value_t v)
 4 {
 5   struct node_t *newnode = (struct node_t *) malloc(sizeof(*newnode));
 6
 7   set_value(newnode, v);
 8   newnode->next = atomic_load(&top);
 9   do {
10     // newnode->next may have become invalid
11   } while (!atomic_compare_exchange_weak(&top, &newnode->next, newnode));
12 }
13
14 void list_pop_all()
15 {
16   struct node_t *p = atomic_exchange(&top, NULL);
17
18   while (p) {
19     struct node_t *next = p->next;
20      
21     foo(p);
22     free(p);
23     p = next;
24   }
25 }

Those used to the Linux-kernel cmpxchg() atomic operation should keep in mind that when the C11 atomic_compare_exchange_weak() fails, it writes the current value referenced by its first argument to the pointer passed as its second argument. In other words, instead of passing atomic_compare_exchange_weak() the new value, you instead pass it a pointer to the new value. Each time atomic_compare_exchange_weak() fails, it updates the pointed-to new value. This sometimes allows this function to be used in a tight loop, as in line 11 above.

With these atomic_compare_exchange_weak() semantics in mind, let's step through execution of list_push(C) on a stack initially containing objects A and B:

  1. Line 5 allocates memory for the new object C.
  2. Line 7 initializes this newly allocated memory.
  3. Line 8 sets the new object's ->next pointer to the current top-of-stack pointer.
  4. Line 11 invokes atomic_compare_exchange_weak(), which atomically compares the value of newnode->next to the value of top, and if they are equal, stores the pointer newnode into top and returns true. (As noted before, if the comparison is instead not-equal, atomic_compare_exchange_weak() instead stores the value of top into newnode->next and returns false, repeating until such time as the pointers compare equal.)
  5. One way or another, the stack ends up containing objects C, A, and B, ordered from the top of stack down.

Oddly enough, this code would still work even if line 8 were omitted, however, that would result in a high probability that the first call to atomic_compare_exchange_weak() would fail, which would not be so good for common-case performance. It would also result in compile-time complaints about uninitialized variables, so line 8 is doubly unlikely to be omitted in real life.

While we are at it, let's step through list_pop_all():

  1. Line 16 atomically stores NULL into top and returns its previous value. After this lines executes, the stack is empty and its previous contents are referenced by local variable p.
  2. Lines 18-24 execute for each object on list p:
    1. Line 19 prefetches a pointer to the next object.
    2. Line 21 passes the current object to function foo().
    3. Line 22 frees the current object.
    4. Line 23 advances to the next object.

Thus far, we have seen no zombie pointers. Let's therefore consider the following sequence of events, with the stack initially containing objects A and B:

  1. Thread 1 starts pushing object C, but is interrupted, preempted, otherwise impeded just after executing line 8.
  2. Thread 2 pops the entire list and frees all of the objects. Object C's ->next pointer is now invalid because it points to now-freed memory formerly known as object A.
  3. Thread 2 pushes an object, and happens to allocate memory for it at the same address as the old object A, so let's call it A'.
  4. Object C's ->next pointer is still invalid as far as an omniscient compiler is concerned, but from an assembly language viewpoint happens to reference the type-compatible object A'. This pointer is now a "zombie pointer" that has come back from the dead, or that has at least come to have a more entertaining form of invalidity.
  5. Thread 1 resumes and executes the atomic_compare_exchange_weak() on line 11. Now this primitive, like its Linux-kernel counterpart cmpxchg(), operates not on the compiler's idea of what the pointer is, but rather on the actual bits making up that pointer. Therefore, that atomic_compare_exchange_weak() succeeds despite the fact that the object C's ->next pointer is invalid.
  6. Thread 1 has thus completed its push of object C, and the resulting stack looks great from an assembly-language viewpoint. But the pointer from object C to A' is a zombie pointer, and compilers are therefore within their rights to do arbitrarily strange things with it.

As noted earlier, this has a trivial solution within the Linux kernel. Please do not peek too early. ;-)

But what can be done outside of the Linux kernel?

The traditional solution has been to hide the allocator from the compiler. After all, if the compiler cannot see the calls to malloc() and free(), it cannot figure out that a given pointer is invalid (or, in C-standard parlance, indeterminate). Creating your own allocator with its own function names suffices, and back in the day you often needed to create your own allocator if performance and scalability was important to you.

However, this would also deprive Rust of information that it needs to check pointer operations. So maybe Rust needs to know about allocation and deallocation, but needs to be very careful not to pass this information on to the optimizer. Assuming that this even makes sense in the context of the implementation of Rust.

Perhaps code working with invalid and zombie pointers needs to be carefully written in C. And perhaps there is some way to write it safely in unsafe Rust.

The final approach is to wait until backend support for safe handling of invalid/zombie pointers arrives, and then add Rust syntax as appropriate. Here is some documentation of current efforts towards adding such support to C++:

  1. CPPCON 2020 presentation
  2. Pointer lifetime-end zap (informational/historical)
  3. Pointer lifetime-end zap proposed solutions

So there is good progress, but it might still be some time before this is supported by either the standard or by compilers.

History

October 6, 2021: Expand description per feedback from Jonathan Corbet
October 12, 2021: Self-review.

October 05, 2021 06:40 PM

October 04, 2021

Paul E. Mc Kenney: How Much of the Kernel Can Rust Own?

Rust concurrency makes heavy use of ownership and borrowing. The purpose of this post is not to give an exposition of Rust's capabilities and limitations in this area, but rather to give a series of examples of ownership in the Linux kernel.

The first example involves Linux-kernel per-CPU variables. In some cases, such variables are protected by per-CPU locks, for example, a number of fields in the per-CPU rcu_data structure are used by the kernel threads that manage grace periods for offloaded callbacks, and these fields are protected by the ->nocb_gp_lock field in the same instance of that same structure. In other cases, access to a given per-CPU variable is permitted only by the corresponding CPU, and even then only if that CPU has disabled preemption. For example, the per-CPU rcu_data structure's ->ticks_this_gp field may be updated only from the corresponding CPU, and only when preemption is disabled. In the particular case, preemption is disabled as a side-effect of having disabled interrupts.

The second example builds on the first. In kernels built with CONFIG_RCU_NOCB_CPU=n, the per-CPU rcu_data structure's ->cblist field may be updated from the corresponding CPU, and only when preemption is disabled. However, it is also allowed from some other CPU when the corresponding CPU has been taken offline, but only when that other CPU that is orchestrating the offlining of the corresponding CPU.

(What about kernels built with CONFIG_RCU_NOCB_CPU=y? They must also acquire a ->nocb_lock that is also contained within the per-CPU rcu_data structure.)

The third example moves to the non-per-CPU rcu_node structures, which are arranged into a combining tree that processes events from an arbitrarily large number of CPUs while maintaining bounded lock contention. Each rcu_node> structure has a ->qsmask bitmask that tracks which of its children need to report a quiescent state for the current grace period, and a ->gp_tasks pointer that, when non-NULL, references a list of tasks that blocked while in an RCU read-side critical section that blocks the current grace period. The children of each leaf rcu_node structure are the rcu_data structures feeding into that rcu_node structure. Finally, there is a singleton rcu_state structure that contains a ->gp_seq field that numbers the current grace period and also indicates whether or not it is in progress.

We now have enough information to state the ownership rule for the rcu_state structure's ->gp_seq field. This field may be updated only if all of the following hold:


  1. The current CPU holds the root rcu_node structure's ->lock.
  2. All rcu_node structures' ->qsmask fields are zero.
  3. All rcu_node structures' ->gp_tasks pointers are NULL.

This might seem excessively ornate, but it is the natural consequence of RCU's semantics and design:

  1. The next grace period is not permitted to start until after the previous one has ended. This is a design choice that allows RCU to function well in the face of update-side overload from large numbers of CPUs.
  2. A grace period is not allowed to end until all CPUs have been observed in a quiescent state, that is, until all rcu_node structures' ->qsmask fields have become zero.
  3. A grace period is not allowed to end until all tasks that were preempted while executing within a critical section blocking that grace period have resumed and exited their critical sections, that is, until all rcu_node structures' ->gp_tasks pointers have become NULL.
  4. If multiple CPUs would like to start a grace period at the same time, there has to be something that works out which CPU is actually going to do the starting, and the last part of that something is the root rcu_node structure's ->lock.

Trust me, there are far more complex ownership models in the Linux kernel!

The fourth example involves per-CPU data that is protected by a reader-writer lock. A CPU is permitted to both read and update its own data only if it read-holds the lock. A CPU is permitted to read other CPUs' data only if it write-holds the lock. That is right, CPUs are allowed to write when read-holding the lock and and are allowed to read while write-holding the lock. Perhaps reader-writer locks should instead have been called exclusive-shared locks, but it is some decades too late now!

The fifth and final example involves multiple locks, for example, when some readers must traverse the data structure from an interrupt handler and others must sleep while traversing that same structure. One way to implement this is to have one interrupt-disabled spinlock (for readers in interrupt handlers) and a mutex (for readers that must sleep during the traversal). Updaters must then hold both locks.

So how on earth does the current C-language Linux kernel code keep all this straight???

One indispensable tool is the assertion, which in the Linux kernel includes WARN(), BUG(), and friends. For example, RCU uses these to verify the values of the aforementioned ->qsmask and ->gp_tasks fields during between-grace-periods traversals of the rcu_node combining tree.

Another indispensable tool is lockdep, which, in addition to checking for deadlocks, allows code to verify that conditions are right. A few of the many available lockdep assertions are shown below:

  1. lockdep_assert_held(): Complains if the specified lock is not held by the current CPU/task.
  2. lockdep_assert_held_write(): Complains if the specified reader-writer lock is not write-held by the current CPU/task.
  3. lockdep_assert_held_read():: Complains if the specified reader-writer lock is not read-held by the current CPU/task.
  4. lockdep_assert_none_held_once(): Complains if the current CPU/task holds any locks.
  5. lockdep_assert_irqs_disabled(): Complains if the current CPU has interrupts enabled.
  6. rcu_read_lock_held(): Complains if the current CPU/task is not within an RCU read-side critical section.

Of course, none of these assertions are going to do anything for you unless you make use of them. On the other hand, some of the more complex ownership criteria are going to be quite difficult for compilers or other tooling to intuit without significant help from the developer.

A more entertaining ownership scenario is described in Section 5.4.6 ("Applying Exact Limit Counters") of perfbook. Chapter 8 ("Data Ownership") describes several other ownership scenarios.

History

October 12, 2021: Self-review.

October 04, 2021 11:37 PM

Paul E. Mc Kenney: Can Rust Code Own RCU?

Read-copy update (RCU) vaguely resembles a reader-writer lock [1], but one in which readers do not exclude writers. This change in semantic permits RCU readers to be exceedingly fast and scalable. In fact, in the most aggressive case, rcu_read_lock() and rcu_read_unlock() generate no code, and rcu_dereference() emits but a single load instruction. This most aggressive case is achieved in production via Linux-kernel CONFIG_PREEMPT_NONE=y builds. Additional information on RCU is presented at the end of this post.

Can Rust Ownership be Adapted to RCU?

The fact that RCU readers do not exclude writers opens the door to data races between RCU readers and updaters. This in turn suggests that least some RCU use cases are likely to pose challenges to Rust's ownership semantics, however, discussions with Wedson Almeida Filho at Linux Plumbers Conference suggested that these semantics might be flexed a little bit, perhaps in a manner similar to the way that Linux-kernel RCU flexes lockdep's semantics. In addition, the data races between the read-side rcu_dereference() primitive and the update-side rcu_assign_pointer() primitive might reasonably be resolved by having the Rust implementation of these two primitives use unsafe mode. Or, perhaps better yet, the Rust implementation of these two primitives might simply wrapper the Linux-kernel primitives in order to get the benefit of existing tools such as the aforementioned lockdep and the sparse static-analysis checker. One downside of this approach is potential performance degradation due to the extra function call implied by the wrappering. Longer term, LTO might eliminate this function call, but if the initial uses of Rust are confined to performance-insensitive device drivers, the extra overhead should not be a problem.

Linux-Kernel Tooling and RCU

Here are a few examples of these Linux-kernel tools:

  1. A pointer can be marked __rcu, which will cause the sparse static-analysis checker to complain if that pointer is accessed directly, instead of via rcu_dereference() and rcu_assign_pointer(). This checking can help developers avoid accidental destruction of the address and data dependencies on which many RCU use cases depend.
  2. In most RCU use cases, the rcu_dereference() primitive that accesses RCU-protected pointers must itself be protected by rcu_read_lock(). The Linux-kernel lockdep facility has been therefore adapted to complain about unprotected uses of rcu_dereference().
  3. Passing the same pointer twice in quick succession to a pair of call_rcu() invocations is just as bad as doing the same with a pair of kfree() invocations: Both situations are a double free. The Linux kernel's debug-objects facility checks for this sort of abuse.

This is by no means a complete list. Although I am by no means claiming that these sorts of tools provide the same degree of safety as does Rust safe mode, it is important to understand that Rust unsafe mode is not to be compared to standard C, but rather to the dialect of C used in the Linux kernel, augmented by the associated tools, processes, and coding guidelines.

RCU Use Cases

There are in fact Rust implementations of variants of RCU, including:

  1. crossbeam-rs
  2. ArcSwapAny
I currently have no opinion on whether or not these could be used within the Linux kernel, and any such opinion is irrelevant. You see, Rust's use of RCU within the Linux kernel would need to access RCU-protected data structures provided by existing C code. In this case, it will not help for Rust code to define its own RCU. It must instead interoperate with the existing Linux-kernel RCU implementations.

But even interoperating with Linux-kernel RCU is not trivial. To help with this task, this section presents a few RCU use cases within the Linux kernel.

The first use case is the textbook example where once an RCU-protected object is exposed to readers, its data fields remain constant. This approach allows Rust's normal read-only mode of ownership to operate as intended. The pointers will of course change as objects are inserted or deleted, but these are handled by rcu_dereference() and rcu_assign_pointer() and friends. These special primitives might be implemented using unsafe Rust code or C code, as the case might be.

In-place updates of RCU-protected objects are sometimes handled using the list_replace_rcu() primitive. In this use case, a new version of the object is allocated, the old version is copied to the new version, any required updates are carried out on the new version, and then list_replace_rcu() is used to make the new version available to RCU readers. Readers then see either the old version or the new version, but either way they see an object with constant data fields, again allowing Rust ownership to work as intended.

However, one complication arises for objects removed from an RCU-protected data structure. Such objects are not owned by the updater until after a grace period elapses, and they can in fact be accessed by readers in the meantime. It is not clear to me how Rust would handle this. For purposes of comparison, within the Linux kernel, the Kernel Concurrency Sanitizer (KCSAN) handles this situation using dynamic runtime checking. With this checking enabled, when an updater takes full ownership of a structure before readers are done with it, KCSAN reports a data race. These issues should be most immediately apparent to anyone attempting to create a Rust wrapper for the call_rcu() function.

Because RCU readers do not exclude RCU updaters, it is possible for an RCU reader to upgrade itself to an updater while traversing an RCU-protected data structure. This is usually handled by a lock embedded in the object itself. From what I understand, the easiest way to apply Rust ownership to these situations is to make the RCU-protected object contain a structure that in turn contains the data protected by the lock. People wishing to apply Rust to these sorts of situations are invited to review the Linux-kernel source code to check for possible complications, for example, cases where readers locklessly read fields that are updated under the protection of the lock, and cases where multiple locks provide one type of protection or another to overlapping subsets of fields in the RCU-protected object.

Again because RCU readers do not exclude RCU updaters, there can also be lockless in-place updates (either from readers or updaters) to RCU-protected objects, for example, to set flags, update statistics, or to provide type-safe memory for any number of lockless algorithms (for example, using SLAB_TYPESAFE_BY_RCU). Rust could presumably use appropriate atomic or volatile operations in this case.

One important special case of RCU readers locklessly reading updated-in-place data is the case where readers cannot be allowed to operate on an object that has been removed from the RCU-protected data structure. These situations normally involve a per-object flag and lock. Updaters acquire the object's lock, remove the object from the structure, set the object's flag, and release the lock. Readers check the flag, and if the flag is set, pretend that they did not find the corresponding object. This class of use cases illustrate how segregating read-side and update-side fields of an RCU-protected object can be non-trivial.

RCU is sometimes used to protect combinations of data structures, sometimes involving nested RCU readers. In some cases, a given RCU read-side critical section might span a great many functions across several translation units.

It is not unusual for RCU use cases to include a search function that is invoked both by readers and updaters. This function will therefore sometimes be within an RCU read-side critical section and other times be protected by the update-side lock (or by some other update-side synchronization mechanism).

Sequence locking is sometimes used in conjunction with RCU, so that RCU protects traversal of the data structure and sequence locking detects updates profound enough to cause problems for the RCU traversals. The poster boy for this use case is in the Linux kernel's directory-entry cache. In this case, certain sequences of rename operations could fool readers into traversing pathnames that never actually existed. Using the sequence lock named rename_lock to protect such traversals allows readers to reject such bogus pathnames. More information on the directory-entry cache is available in Neil Brown's excellent Linux Weekly News series (part 1, part 2, part 3). One key take-away of this use case is that sequence locking is sometimes used to protect arbitrarily large data structures.

Sequence locking can be used to easily provide readers with a consistent view of a data structure, for a very wide range of definitions of "consistent". However, all of these techniques allow updaters to block readers. There are a number of other approaches to read-side consistency both within and across data structures, including the Issaquah Challenge.

One final use case is phased state change, where RCU readers check a state variable and take different actions depending on the current state. Updaters can change the state variable, but must wait for the completion of all readers that might have seen the old value before taking further action. The rcu-sync functionality (rcu_sync_init() and friends) implements a variant of RCU-protected phased state change. This use case is notable in that there are no linked data structures and nothing is ever freed.

This is by no means an exhaustive list of RCU use cases. RCU has been in the Linux kernel for almost 20 years, and during that time kernel developers have come up with any number of new ways of applying RCU to concurrency problems.

Rust RCU Options

The most straightforward approach is to provide Rust wrappers for the relevant C-language RCU API members. As noted earlier, wrappering the low-overhead read-side API members will introduce function-call overhead in non-LTO kernels, but this should not be a problem for performance-insensitive device drivers. However, fitting RCU into Rust's ownership model might not be as pretty as one might hope.

A more utopian approach would be to extend Rust's ownership model, one way or another, to understand RCU. One approach would be for Rust to gain "type safe" and "guaranteed existence" ownership modes, which would allow Rust to better diagnose abuses of the RCU API. For example, this might help with pointers to RCU-protected objects being erroneously leaked from RCU read-side critical sections.

So how might a pointer to an RCU-protected object be non-erroneously leaked from an RCU read-side critical section, you ask? One way is to acquire a reference via a reference count within that object before exiting that critical section. Another way is to acquire a lock within that object before exiting that critical section.

So what does the Linux kernel currently do to find this type of pointer leak? One approach is to enable KCSAN and build the kernel with CONFIG_RCU_STRICT_GRACE_PERIOD=y, and to run this on a small system. Nevertheless, this might be one area where Rust could improve Linux-kernel diagnostics, albeit not without effort.

For More Information

Additional information on RCU may be found here:

  1. Linux-kernel documentation, with special emphasis on Linux-kernel RCU's requirements.
  2. The following sections of perfbook discuss other aspects of RCU:

    1. Section 9.5 ("Read-Copy Update").
    2. Section 10.3 ("Read-Mostly Data Structures").
    3. Section 13.5 ("RCU Rescues").
    4. Section 15.4.2 ("RCU [memory-ordering properties ]").
  3. The RCU API, 2019 Edition.
  4. Userspace RCU.
  5. Folly-library RCU.
  6. Section 9.6.3.3 of perfbook lists several other RCU implementations.
  7. Proposed Wording for Concurrent Data Structures: Read-Copy-Update (RCU) (C++ Working Paper).

RCU's semantics ordered by increasing formality:

  1. The "Fundamental Requirements" section of the aforementioned Linux-kernel RCU requirements. Extremely informal, but you have to start somewhere!
  2. RCU Semantics: A First Attempt. Also quite informal, but with a bit more math involved.
  3. Verifying Highly Concurrent Algorithms with Grace (extended version). Formal semantics expressed in separation logic.
  4. Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel. Executable formal semantics expressed in Cat language. This paper also includes a proof that the traditional "wait for all pre-existing readers" semantic is equivalent to these executable formal semantics within the context of the Linux-kernel memory model.
  5. Linux-kernel memory model (LKMM), see especially the Sync-rcu and Sync-srcu relations in file linux-kernel.cat.

Furthermore, significant portions of Linux-kernel RCU have been subjected to mechanical proofs of correctness:

  1. Lihao Liang applied the C Bounded Model Checker (CBMC) to Tree RCU (paper).
  2. Michalis Kokologiannakis applied Nidhugg to Tree RCU (paper).
  3. Lance Roy applied CBMC to Classic SRCU, represented by Linux-kernel commit 418b2977b343 ("rcutorture: Add CBMC-based formal verification for SRCU"). Sadly, the scripting has not aged well.

For more information, see Verification Challenge 6.

For those interested in the userspace RCU library, there have been both manual and mechanical proofs of correctness. A manual proof of correctness of userspace RCU in the context of a linked list has also been carried out.

Endnotes

[1]  At its core, RCU is nothing more nor less than a way of waiting for already-started things to finish. Within the Linux kernel, something to be waited on is delimited by rcu_read_lock() and rcu_read_unlock(). For their part, synchronize_rcu() and call_rcu() do the waiting, synchronously and asynchronously, respectively.

It turns out that you can do quite a lot with this simple capability, including approximating some reader-writer-lock use cases, emulating a number of reference-counting use cases, providing type-safe memory, and providing existence guarantees, this last sometimes being pressed into service as a poor-man's garbage collector. Section 9.3.3 ("RCU Usage") of perfbook provides more detail on these and other RCU use cases.


History

October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post.
October 13, 2021: Add RCU-protected phased state change use case.
October 18, 2021: Add read/update common code and Tasseroti citation.

October 04, 2021 07:41 PM

October 02, 2021

Paul E. Mc Kenney: Can Rust Code Own Sequence Locks?

What Are Sequence Locks?

Sequence locks vaguely resemble reader-writer locks except that readers cannot block writers. If a writer runs concurrently with a reader, that reader is forced to retry its critical section. If a reader successfully completes its critical section (as in there were no concurrent writers), then its accesses will have been data-race free. Of course, an incessant stream of writers could starve all readers, and the Linux-kernel sequence-lock implementation provides extensions to deal with this in cases where incessant writing is a possibility. The usual approach is to instead arrange things so that writers are never incessant.

In the simple case where readers loop until they succeed, with no provisions for incessant writers, a reader looks like this:

do {
  seq = read_seqbegin(&test_seqlock);
  /* read-side access. */
} while (read_seqretry(&test_seqlock, seq));

A writer looks like this:

write_seqlock(&test_seqlock);
  /* Update */
write_sequnlock(&test_seqlock);

The read-side access can run concurrently with the writer's update, which is not consistent with the data-race-free nature of non-unsafe Rust. How can this be handled?

Linux-Kernel Sequence-Lock Use Cases

First, please note that if Rust-language sequence-locking readers are to interoperate with C-language sequence-locking updaters (or vice versa), then the Rust-language and C-language implementations must be compatible. One easy way to achieve such compatibility is to provide Rust code wrappers around the C-language implementation, or perhaps better yet, around higher-level C-language functions making use of sequence locking. Of course, if Rust-language use of sequence locks only ever accesses data that is never accessed by C-language code, then Rust could provide its own implementation of sequence locking. As always, choose carefully!

Next, let's look at a few classes of sequence-locking use cases within the Linux kernel.

One common use case is the textbook case where the read-side loop picks up a few values, all within a single function. For example, sequence locking is sometimes used to fetch a 64-bit value on a 32-bit system (see the sock_read_timestamp() function in include/net/sock.h). The remainder of this section looks at some additional use cases that receive significant Linux-kernel use:

  1. Readers can gather data from multiple sources, including functions in other translation units that are invoked via function pointers. See for example the timer_cs_read() function in arch/sparc/kernel/time_32.c.
  2. Readers often perform significant computations. See for example the badblocks_check() function in block/badblocks.c.
  3. Readers can contain sequence-lock writers for that same sequence lock, as is done in the sdma_flush() function in drivers/infiniband/hw/hfi1/sdma.c. Although one could reasonably argue that this could be structured in other ways, this approach does avoid an added check, which could be valuable on fastpaths.
  4. The call to read_seqretry() is sometimes buried in a helper function, for example, in the sdma_progress() function in drivers/infiniband/hw/hfi1/sdma.h and the follow_dotdot_rcu() function in fs/namei.c.
  5. Readers sometimes use memcpy(), as might be expected by those suggesting that sequence-lock readers should clone/copy the data. See for example the neigh_hh_output() function in include/net/neighbour.h.
  6. Sequence-locking readers are often used in conjunction with RCU readers, as is described in the Can Rust Code Own RCU? posting.

Some of these use cases impose significant constraints on sequence-locking implementations. If the constraints rule out all reasonable Rust implementations, one approach would be to provide different Rust implementations for different use cases. However, it would be far better to avoid further API explosion!

Rust Sequence-Lock Options

One approach would be for all sequence-lock critical sections to be marked unsafe, but this removes at least some of the rationale for switching from C to Rust. In addition, some believe that this approach can pose severe Rust-syntax challenges in some use cases.

Another approach is to require all sequence-lock critical sections to use marked accesses. This avoids the data races and presumably also the need for unsafe, but it prevents the kernel concurrency sanitizer (KCSAN) from locating data races involving sequence-lock write-side critical sections. Or might, if it were not for the use of kcsan_flat_atomic_begin() and kcsan_flat_atomic_end() in sequence-locking read-side primitives and the use of kcsan_nestable_atomic_begin() and kcsan_nestable_atomic_end() in sequence-locking write-side primitives. So perhaps Rust could make use of similar hooks.

Yet another approach is to place the data referenced by readers into a separate structure and to switch back and forth between a pair of instances of such structures. This has been attempted in the Linux kernel with unsatisfactory results. In fact, in those cases where it is feasible to use multiple instances of a separate structure, Linux-kernel RCU is usually better choice.

One more approach would be creating something like a "tentatively readable" Rust ownership class that could directly handle sequence-locking read-side critical sections. For example, if a quantity computed from a read within a failed critical section were used subsequently, Rust could emit a warning. Hopefully without much in the way of false positives. (We can dream, can't we?)

One might instead search the Linux-kernel source tree for uses of sequence locking and to note that it is used only by a very few device drivers:

     16 drivers/dma-buf/
      1 drivers/firmware/efi/
      5 drivers/gpu/drm/
      2 drivers/gpu/drm/amd/amdgpu/
      2 drivers/gpu/drm/i915/gem/
     15 drivers/gpu/drm/i915/gt/
      5 drivers/hwmon/
     72 drivers/infiniband/hw/hfi1/
     11 drivers/md/
     15 drivers/net/ethernet/mellanox/mlx4/
     19 drivers/net/ethernet/mellanox/mlx5/core/lib/

These device drivers (or at least those portions of them that make use of sequence locking) could then be relegated to C code.

However, the use of sequence locking has been increasing, and it is likely that it will continue to increase, including within device drivers. Longer term, Rust's ownership model should therefore be extended to cover sequence locking in a natural and safe manner.

For More Information

For more on sequence locking, see the seqlock.rst documentation. Sequence locking is also covered in Sections 9.4 ("Sequence Locking") and 13.4 ("Sequence-Locking Specials") of perfbook.

History

October 6, 2021: Updated to add sectioning, Linux-kernel use cases, and updated KCSAN commentary.
October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post.
October 13, 2021: Add a note discussing possible interoperability between Rust-language and C-language sequence locking.

October 02, 2021 12:16 AM

October 01, 2021

Paul E. Mc Kenney: Rusting the Linux Kernel: Compiler Writers Hate Dependencies (OOTA)

Out-of-thin-air (OOTA) values are a troubling theoretical problem, but thus far do not actually occur in practice. In the words of the Bard: "It is a tale told by an idiot, full of sound and fury, signifying nothing."

But what is life without a little sound and fury? Besides, "mere theory" can loom large for those creating the tools required to analyze concurrent software. This post therefore gives a brief introduction to OOTA, and then provides some simple practical solutions, along with some difficult ones.

Here is the canonical OOTA example, involving two threads each having but one statement:

T1: x.store(y.load(mo_relaxed), mo_relaxed);
T2: y.store(x.load(mo_relaxed), mo_relaxed);

Believe it or not, according to the C/C++ memory model, even if both x and y are initially 0, after both threads complete it might be that x==y==42!

LKMM avoids OOTA by respecting dependencies, at least when headed by marked accesses. Plus the dependent access must either be marked or be free of data races. For example:

T1: WRITE_ONCE(x, READ_ONCE(y));
T2: WRITE_ONCE(y, READ_ONCE(x));

If x and y are both initially zero, then they are both guaranteed to stay that way!

There has been great quantities of ink spilled on the OOTA problem, but these three C++ working papers give a reasonable overview and cite a few of the more prominent papers:

  1. P0422R0: Out-of-Thin-Air Execution is Vacuous. This presents a fixed-point analysis that to the best of my knowledge is the first scheme capable of differentiating OOTA scenarios from straightforward reordering scenarios. Unfortunately, this analysis cannot reasonably be applied to automated tooling. This paper also includes a number of references to other work, which unfortunately either impose unnecessary overhead on weakly ordered systems on the one hand or are well outside of what compiler developers are willing to countenance on the other.
  2. P1916R0: There might not be an elegant OOTA fix. This diabolically clever paper shows how back-propagation of undefined behavior can further complicate the OOTA question.
  3. P2215R0: "Undefined behavior" and the concurrency memory model. This paper looks at ways of addressing the issues raised in P1916R0.
  4. P2055R0: A Relaxed Guide to memory_order_relaxed. This paper presents memory_order_relaxed use cases that are believed to avoid the OOTA problem.

As with address, data, and control dependencies, the trivial solution is for Rust to simply promote all READ_ONCE() operations to smp_load_acquire(), but presumably while leaving the C-language READ_ONCE() untouched. This approach increases overhead, but keeps the people building analysis tools happy, or at least happier. Unfortunately, this trivial solution incurs overhead on weakly ordered systems, overhead that is justified only in theory, never in practice.

But if analysis tools are not an issue, then Rust could simply use volatile loads (perhaps even directly using an appropriately Rust-wrapped instance of the Linux kernel's READ_ONCE() primitive), just as the Linux kernel currently does (but keeping architecture-specific requirements firmly in mind). The problem is after all strictly theoretical.

History

October 12, 2021: Self-review. Note that the comments are specific to earlier versions of this blog post.
November 12, 2021: Add Hans Boehm's P2215R0 C++ working paper.

October 01, 2021 11:27 PM

Paul E. Mc Kenney: Rusting the Linux Kernel: Compiler Writers Hate Dependencies (Address/Data)

An address dependency involves a load whose return value directly or indirectly determines the address of a later load or store, which results in the earlier load being ordered before the later load or store. A data dependency involves a load whose return value directly or indirectly determines the value stored by a later store, which results in the load being ordered before the store. These are used heavily by RCU. Although they are not quite as fragile as control dependencies, compilers still do not know about them. Therefore, care is still required, as can be seen in the rcu_dereference.rst Linux-kernel coding guidelines. As with control dependencies, address and data dependencies enjoy very low overheads, but unlike control dependencies, they are used heavily in the Linux kernel via rcu_dereference() and friends.


  1. As with control dependencies, the trivial solution is to promote READ_ONCE() to smp_load_acquire(). Unlike with control dependencies, there are only a few such READ_ONCE() instances, and almost all of them are conveniently located in definitions of the rcu_dereference() family of functions. Because the rest of the kernel might not be happy with the increase in overhead due to such a promotion, it would likely be necessary to provide Rust-specific implementations of rcu_dereference() and friends.
  2. Again as with control dependencies, an even more trivial solution is to classify code containing address and data dependencies as core Linux-kernel code that is outside of Rust's scope. However, given that there are some thousands of instances of rcu_dereference() scattered across the Linux kernel, this solution might be a bit more constraining than many Rust advocates might hope for.
  3. Provide Rust wrappers for the rcu_dereference() family of primitives. This does incur additional function-call overhead, but on the other hand, if initial use of Rust is confined to performance-insensitive device drivers, this added overhead is unlikely to be problem.
  4. But if wrapper overhead nevertheless proves problematic, provide higher-level C-language functions that encapsulate the required address and data dependencies, and that do enough work that the overhead of the wrappering for Rust-language use is insignificant.
  5. And yet again as with control dependencies, the best approach from the Linux-kernel-in-Rust developer's viewpoint is for Rust to enforce the address/data-dependency code-style restrictions documented in the aforementioned rcu_dereference.rst. There is some reason to hope that this enforcement would be significantly easier than for control dependencies.
  6. Wait for compiler backends to learn about address and data dependencies. This might take some time, but there is ongoing work along these lines that is described below.
One might hope that C/C++'s memory_order_consume would correctly handle address and data dependencies, and in fact it does. Unfortunately, in all known compilers, it does so by promoting memory_order_consume to memory_order_acquire, which adds overhead just as surely as does smp_load_acquire(). There has been considerable work done over a period of some years towards remedying this situation, including these working papers:

  1. P0371R1: Temporarily discourage memory_order_consume (for some definition of "temporarily").
  2. P0098R1: Towards Implementation and Use of memory_order_consume, which reviews a number of potential remedies.
  3. P0190R4: Proposal for New memory_order_consume Definition, which selects a solution involving marking pointers carrying dependencies. Note that many Linux-kernel developers would likely demand a compiler command-line argument that caused the compiler to act as if all pointers had been so marked. Akshat Garg prototyped marked dependency-carrying pointers in gcc as part of a Google Summer of Code project.
  4. P0750R1: Consume, which proposes carrying dependencies in an extra bit associated with the pointer. This approach is much more attractive to some compiler developers, but many committee members did not love the resulting doubling of pointer sizes.
  5. P0735R1: Interaction of memory_order_consume with release sequences, which addresses an obscure C/C++ memory-model corner case
There appears to be a recent uptick in interest in a solution to this problem, so there is some hope for progress. However, much more work is needed.

More information on address and data dependencies may be found in Section 15.3.2 ("Address- and Data-Dependency Difficulties") of perfbook.

History

October 12, 2021: Self-review.

October 01, 2021 09:04 PM

Paul E. Mc Kenney: Rusting the Linux Kernel: Compiler Writers Hate Dependencies (Control)

At the assembly language level on many weakly ordered architectures, a conditional branch acts as a very weak, very cheap, but very useful memory-barrier instruction. It orders any load whose return value feeds into the condition codes before all stores that execute after the branch instruction completes, whether the branch is taken or not. ARMv8 also has a conditional-move instruction (CSEL) that provides similar ordering.

Because the ordering properties of the conditional branch involve dependencies from the load to the branch and from the branch to the store, and because the branch is a control-flow instruction, this ordering is said to be due to a control dependency.

Because compilers do not understand them, control dependencies are quite fragile, as documented by the many cautionary tales in the Linux kernel's memory-barriers.txt documentation (search for the "CONTROL DEPENDENCIES" heading). But they are very low cost, so they are used on a few critically important fastpaths in the Linux kernel.

Rust could deal with control dependencies in a number of ways:


  1. The trivial solution is to promote the loads heading the control dependencies to smp_load_acquire(). This works, but adds instruction overhead on some architectures and needlessly limits compiler optimizations on all architectures (but to be fair, ARMv8 does exactly this when built with link-time optimizations). Another difficulty is identifying (whether manually or automatically) exactly which READ_ONCE() calls need to be promoted.
  2. An even more trivial solution is to classify code containing control dependencies as core Linux-kernel code that is outside of Rust's scope. Because there are very few uses of control dependencies in the Linux kernel, Rust would not lose much by taking this approach. In addition, there is the possibility of creating higher-level C-language primitives containing the needed control dependencies which are then wrappered for Rust-language use.
  3. The best approach from the Linux-kernel-in-Rust developer's viewpoint is for Rust to enforce the code style restrictions documented in memory-barriers.txt. However, there is some chance that this approach might prove to be non-trivial.
  4. Wait for compiler backends to learn about control dependencies. This might be a bit of a wait, especially given the difficulty even defining control dependencies within the current nomenclature of the C/C++ standards.

More information on control dependencies may be found in Section 15.3.3 ("Control-Dependency Calamities") of perfbook.

History

October 12, 2021: Self-review.

October 01, 2021 08:24 PM

Paul E. Mc Kenney: Rusting the Linux Kernel: Atomics and Barriers and Locks, Oh My!

One way to reduce the number of occurrences of unsafe in Rust code in Linux is to push the unsafety down into atomic operations, memory barriers, and locking primitives, which are the topic of this post. But first, here are some materials describing LKMM:


  1. P0124R7: Linux-Kernel Memory Model: A C++ standards-committee working paper comparing the C/C++ memory model to LKMM.
  2. Linux Weekly News series on LKMM (Part 1 and Part 2).
  3. The infamous ASPLOS'18 paper entitled Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel (non-paywalled), with a title-based tip of the hat to the irrepressible Mel Gorman.
  4. Chapter 15 of perfbook ("Advanced Synchronization: Memory Ordering").
  5. The Linux kernel's tools/memory-model directory, featuring an executable version of LKMM.

For all of these references, I give a big "Thank You!!!" to my co-authors.

LKMM is not the most complex memory model out there, but neither is it the simplest. In addition, it is in some ways more strict than the C/C++ memory models, which means that strict adherence to coding guidelines is required in order to prevent compiler optimizations from breaking Linux-kernel code. Many of these optimizations are not localized, but are instead scattered hither and yon throughout the compilers, including throughout the compiler backends. The optimizations in the backends are a special challenge to Rust, which seems to take the approach of layering safety on top of (or perhaps within) the compiler frontend. Later posts in this series will look at several pragmatic options available to Rust Linux-kernel code.

There is one piece of good news: Compilers are forbidden from introducing data races into code, at least not into code that is free of undefined behavior.

With all of that out of the way, let's look at Rust's options for dealing with Linux-kernel atomics and barriers and locks.

The first approach is to carefully read the P0124R7: Linux-Kernel Memory Model working paper and even more carefully follow its advice in selecting C/C++ primitives that best match Linux-kernel atomics, barriers, and locks. This approach works well for data whose definition and use is confined to Rust code, and with sufficient care and ongoing attention can also work for atomic operations and memory barriers involving data shared with C code. However, expecting Rust locking primitives to interoperate with Linux-kernel locking primitives might not be a strategy to win. It seems wise to make direct use of the existing Linux-kernel locking primitives, keeping in mind that this means properly wrappering them in order to make Rust ownership work properly. Those who doubt the wisdom of wrappering the C-language Linux-kernel locking primitives should consider the following:

  1. Linux-kernel locks are complex and highly optimized. Keeping two implementations is an excellent way to inject profound bugs into the Linux kernel.
  2. Linux-kernel locks are deeply entwined with the lockdep lock dependency checker. The data structures implementing each lock class would need to be shared between C and Rust code, which is another excellent way to inject bugs.
  3. On some architectures, Linux-kernel locks must interact with memory-mapped I/O (MMIO) accesses. Any Rust-language implementation of Linux-kernel locks must therefore be architecture-dependent and must know quite a bit about Linux-kernel MMIO.

As described in later sections, it might be useful to promote READ_ONCE() to smp_load_acquire() instead of implementing it as a volatile load. It might also be useful to promote WRITE_ONCE() to smp_store_release() instead of implementing it as a volatile store, depending on what sort of data-race analysis Rust provides for unsafe code. There is some C/C++ work in flight towards providing better definitions for volatile operations, but it is still early days for this work.

If READ_ONCE() and WRITE_ONCE() are instead to be implemented as volatile operations in Rust, please take care to check the individual architectures that are affected. DEC Alpha requires a full memory-barrier instruction at the end of READ_ONCE(), Itanium requires promotion of volatile loads to acquire loads (but this is carried out by the compiler), and ARMv8 requires READ_ONCE() to be promoted to acquire (but only in CONFIG_LTO=y builds).

Device drivers make heavy use of volatile accesses and memory barriers for MMIO accesses, and Linux-kernel device drivers are no exception. As noted earlier, some architectures require that these accesses interact with locking primitives. Furthermore, there are many device-specific special cases surrounding device control in general and MMIO in particular. Therefore, Rust-language device drivers should access the existing Linux-kernel C-language primitives rather than creating their own, especially to start with. There might well be exceptions to this rule, for example, Rust might be applied to a device driver that is only used by architectures that do not require interaction with locking primitives. But if you write driver containing Rust-language MMIO primitives, please carefully and prominently document the resulting architecture restrictions.

This suggests another approach, namely not bothering implementing any of these primitives in Rust, but rather to make direct use of the Linux-kernel implementations, as suggested earlier for locking and MMIO primitives. And again, this requires wrappering them for use by Rust code. However, such wrappering introduces another level of function call, potentially for tiny functions. Although it is expected that LTO will successfully inline tiny functions, not all of the world is yet ready for LTO. In the meantime, where feasible, developers should avoid invoking tiny C functions from Rust-language fastpaths.

This being the real world, we should expect that the Rust/C determination will need to be made on a case-by-case basis, with many devils in the details.

History

October 12, 2021: Self-review changes.
October 13, 2021: Add explicit justification for wrappering the Linux kernel's C-language locks and add a few observations about MMIO accesses

October 01, 2021 07:02 PM

Paul E. Mc Kenney: Rust Concurrency Philosophy: A Historical Perspective

At first glance, Rust's concurrency philosophy resembles that of Sequent's DYNIX and DYNIX/ptx in the 1980s and early 1990s: "Lock data, not code" (see Jack Inman's classic USENIX'85 paper "Implementing Loosely Coupled Functions on Tightly Coupled Engines", sadly invisible to search engines). Of course, Sequent lacked Rust's automatic checking, and Sequent's software engineers made much less disciplined use of ownership than Rust fans recommend. Nevertheless, this resemblance has resulted in some comparisons of Rust with the DEC Alpha, which had a similar concurrency philosophy.

Interestingly enough, DYNIX and early versions of DYNIX/ptx used compile-time-allocated arrays for almost all of its data structures. You want your kernel to support up to N tasks? Very well, build your kernel to have its array of N task structures. This worked surprisingly well, perhaps because the important concurrent applications of that time had very predictable resource requirements, including numbers of tasks. Nevertheless, as you might expect, this did become quite the configuration nightmare. So why were arrays used in the first place?

To the best of my knowledge, the earliest published complete articulation of the reason appeared in Gamsa et al.'s landmark paper  "Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System". The key point is that you cannot protect a dynamically allocated object with a lock located within that object. The DYNIX arrays avoided deallocation (or, alternatively, provided a straightforward implementation of type-safe memory), thus allowing these objects to be protected with internal locks. Avoiding the need for global locks or reference counters was an important key to the performance and scalability prized by Sequent's customers.

This strategy worked less well when Sequent added a distributed lock manager because the required number of locks was not predictable, nor was there a useful upper bound. This problem was solved in part by the addition of RCU, which provided a high-performance and scalable means of resolving races between acquiring a given object's lock and deletion (and subsequent freeing) of that same object. Given that DEC Alpha famously had difficulty with RCU, it is only reasonable to ask how Rust will do with it. Or must the concurrency designs of those portions of the Linux kernel that are to be written in Rust be "fitted" to a Rust-language ProcRustean bed [1]? Those who prize Rust's fearless-concurrency goal above all else might reasonably argue that this ProcRustean bed is in fact a most excellent thing. However, some Linux-kernel maintainers (including this one) might in their turn reasonably argue that within the context of some portions of the Linux kernel, a proper level of fear is a very healthy thing. As is the ability to do one's job in a reasonably straightforward manner!

One way to avoid this ProcRustean bed is to use the Rust unsafe facility, and in fact "unsafe" has been the answer to a disturbingly large number of my questions about Rust [2]. However, use of this facility introduces the possibility of data races, which in turn raises the question of Rust's memory model. Within the Linux kernel, the answer to this question is of course LKMM, or perhaps some reasonable subset of LKMM.

However, in my personal experience, I have most frequently seen Rust being used to rewrite scripts that became performance problems upon being more widely deployed than expected. In some cases, these rewrites greatly improved user experience as well as performance. This means that Rust is heavily used outside of the Linux kernel, which in turn means that LKMM might not be the right answer for Rust in general, though some in the Rust community have come out strongly in favor of extending the ProcRustian bed. But this blog series is focused on Rust for the Linux kernel, so the question of the memory model for Rust in general is out of scope. Again, for Rust in the Linux kernel, some subset of LKMM is clearly the correct memory model.

Many of the following posts in this series cover ways that Rust might work with specific aspects of LKMM, including some wild speculation about how Rust's ownership model might be generalized in a manner similar to Linux-kernel's lockdep checking has been generalized for cross-released locks and for RCU. Readers wishing to learn more about non-ProcRustean concurrency designs are invited to peruse "Is Parallel Programming Hard, And, If So, What Can You Do About It?, hereinafter called "perfbook". Specific chapters and sections of the Second Edition of this book will be cited as appropriate by later posts in this series.

Endnotes

[1]  Making this 1990s-style concurrency scale usually involves hashed arrays of locks. These are often deadlock-prone, but there are heavily used techniques that (mostly) avoid the deadlocks. See Section 7.1.1.6 ("Acquire Needed Locks First") in perfbook for one such technique. However, hashed arrays of locks are prone to scalability problems, especially on multi-socket systems, due to poor locality of reference. See for example Section 10.2.3 ("Hash-Table Performance") for performance results on hash tables, also in perfbook. As a result, most attempts to apply hashed arrays of locks to the Linux kernel resulted in RCU being used instead. The performance and scalability benefits of RCU (and hazard pointers) are shown in Section 10.3 ("Read-Mostly Data Structures"), again in perfbook.
 
[2]  But please note that Rust's unsafe code has only limited undefined-behavior unsafe superpowers:

  1. Dereferencing a raw pointer (and it is the programmer's responsibility to avoid destructive wild-pointer dereferences).
  2. Calling an unsafe function or method.
  3. Accessing or modifying a mutable static variable (and it is the programmer's responsibility to avoid destructive data races).
  4. Implementing an unsafe trait.
  5. Accessing fields of a union (and it is the programmer's responsibility to avoid accesses that invoke undefined behavior, or, alternatively, understand how the compiler at hand reacts to any undefined behavior that might be invoked).
I do find the Rust community's sharp focus on undefined-behavior-induced bugs over other types of bugs to be rather surprising. Perhaps this is because I have not had so much trouble with undefined-behavior-induced bugs. Conspiracy theorists might imagine an unholy alliance between UB-happy developers of compiler optimizations on the one hand and old-school parallel programmers wishing to turn the clock back to the 1990s on the other hand. I am happy to let such theorists imagine such things. ;-)

Besides, more recent discussions have focused on memory safety rather than the full gamut of undefined behavior.


History

October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post.
October 13, 2021: Add note on memory safety specifically rather than undefined behavior in general.

October 01, 2021 06:08 PM

Paul E. Mc Kenney: So You Want to Rust the Linux Kernel?

There has been much discussion of using the Rust language in the Linux kernel (for example, here, here, and here), at the Kangrejos Rust for Linux Workshop (here, here, and here) and 2021 Linux Plumbers Conference had a number of sessions on this topic, as did Maintainers Summit. At least two of these sessions mentioned the question of how Rust is to handle the Linux-kernel memory model (LKMM), and I volunteered to write this blog series on this topic.

This series focuses mostly on use cases and opportunities, rather than on any non-trivial solutions. Please note that I am not in any way attempting to dictate or limit Rust's level of ambition. I am instead noting the memory-model consequences of a few potential levels of ambition, ranging from "portions of a few drivers", "a few drivers", "some core code" and up to and including "the entire kernel". Greater levels of ambition will require greater willingness to accommodate a wider variety of LKMM requirements.

One could instead argue that portions or even all of the Linux kernel should instead be hammered into the Rust ownership model. On the other hand, might the rumored sudden merge of the ksmdb driver (https://lwn.net/Articles/871098/) have been due to the implicit threat of its being rewritten in Rust? [1] Nevertheless, in cases where Rust is shown to offer particularly desirable advantages, it is quite possible that Rust and some parts of the Linux kernel might meet somewhere in the middle.

These blog posts will therefore present approaches ranging upwards from trivial workarounds. But be warned that some of the high-quality approaches require profound reworking of compiler backends that have thus far failed to spark joy in the hearts of compiler writers. In addition, Rust enjoys considerable use outside of the Linux kernel, for but one example that I have personally observed, as something into which to rewrite inefficient Python scripts. (A megawatt here, a megawatt there, and pretty soon you are talking about real power consumption!) Therefore, there might well be sharp limits beyond which the core Rust developers are unwilling to go.

The remaining posts in this series (along with their modification dates) are as follows:


  1. Rust Concurrency Philosophy: A Historical Perspective (October 13, 2021)
  2. Atomics and Barriers and Locks, Oh My! (October 13, 2021)
  3. Compiler Writers Hate Dependencies (Control) (October 12, 2021)
  4. Compiler Writers Hate Dependencies (Address/Data) (October 12, 2021)
  5. Compiler Writers Hate Dependencies (OOTA) (November 12, 2021)
  6. Can Rust Code Own Sequence Locks? (October 13, 2021)
  7. Can Rust Code Own RCU? (October 18, 2021)
  8. How Much of the Kernel Can Rust Own? (October 12, 2021)
  9. Will Your Rust Code Survive the Attack of the Zombie Pointers? (October 12, 2021)
  10. Can the Kernel Concurrency Sanitizer Own Rust Code? (October 28, 2021)
  11. Summary and Conclusions (October 13, 2021)
  12. TL;DR: Memory-Model Recommendations for Rusting the Linux Kernel (October 21, 2021)
  13. Bonus Post: What Memory Model Should the Rust Language Use? (November 4, 2021)

Please note that this blog series is not a Rust tutorial. Those wanting to learn how to actually program in Rust might start here, here, here, or of course here.

Endnotes


[1]  Some in the Linux-kernel community might be happy with either outcome: (1) The threat of conversion to Rust caused people to push more code into mainline and (2) Out-of-tree code was converted to Rust by Rust advocates and then pushed into mainline. The latter case might need special care for longer-term maintenance of the resulting Rust code, but perhaps the original authors might be persuaded to declare victory, learn Rust, and maintain the code. Who knows? ;-)


History

October 8, 2021: Fix s/LInux/Linux/ typo noted by Miguel Ojeda
October 12, 2021: Self-review, including making it clear that Rust might have use cases other than rewriting inefficient scripts.
October 13, 2021: Add a link to the recommendations post.
October 22, 2021: This blog series is not a Rust tutorial.
November 3, 2021: Add post on memory model for Rust in general.

The October 12 update affected the whole series, for example, removing the "under construction" markings. Summary of significant updates to other posts:

October 01, 2021 12:39 AM

September 30, 2021

James Bottomley: Linux Plumbers Conference Matrix and BBB integration

The recently completed Linux Plumbers Conference (LPC) 2021 used the Big Blue Button (BBB) project again as its audio/video online conferencing platform and Matrix for IM and chat. Why we chose BBB has been discussed previously. However this year we replaced RocketChat with Matrix to achieve federation, allowing non-registered conference attendees to join the chat. Also, based on feedback from our attendees, we endeavored to replace the BBB chat window with a Matrix one so anyone could see and participate in one contemporaneous chat stream within BBB and beyond. This enabled chat to be available before, during and after each session.

One thing that emerged from our initial disaster with Matrix on the first day is that we failed to learn from the experiences of other open source conferences (i.e. FOSDEM, which used Matrix and ran into the same problems). So, an object of this post is to document for posterity what we did and how to repeat it.

Integrating Matrix Chat into BBB

Most of this integration was done by Guy Lunardi.

It turns out that Chat is fairly deeply embedded into BBB, so replacing the existing chat module is hard. Fortunately, BBB also contains an embedded etherpad which is simply produced via an iFrame redirection. So what we did is to disable the BBB chat panel and replace it with a new iFrame based component that opened an embedded Matrix chat client. The client we chose was riot-embedded, which is a relatively recent project but seemed to work reasonably well. The final problem was to pass through user credentials. Up until three days before the conference, we had been happy with the embedded Matrix client simply creating a one-time numbered guest account every time it was opened, but we worried about this being a security risk and so implemented pass through login credentials at the last minute (life’s no fun unless you live dangerously).

Our custom front end for BBB (lpcfe) was created last year by Jon Corbet. It uses a fairly simple email/registration confirmation code for username/password via LDAP. The lpcfe front end Jon created is here git://git.lwn.net/lpcfe.git; it manages the whole of the conference log in process and presents the current and future sessions (with join buttons) according to the timezone of the browser viewing it.

The credentials are passed through directly using extra parameters to BBB (see commit fc3976e “Pass email and regcode through to BBB”). We eventually passed these through using a GET request. Obviously if we were using a secret password, this would be a problem, but since the password was a registration code handed out by a third party, it’s acceptable. I imagine if anyone wishes to take this work forward, add native Matrix device/session support in riot-embedded would be better.

The main change to get this working in riot-embedded is here, and the supporting patch to BBB is here.

Note that the Matrix room ID used by the client was added as an extra parameter to the flat text file that drives the conference track layout of lpcfe. All Matrix rooms were created as public (and published) so anyone going to our :lpc.events matrix domain could see and join them.

Setting up Matrix for the Conference

We used the matrix-synapse server and did a standard python venv pip install on Ubuntu of the latest tag. We created around 30+ public rooms: one for each Microconference and track of the conference and some admin and hallway rooms. We used LDAP to feed the authentication portion of lpcfe/Matrix, but we had a problem using email addresses since the standard matrix user name cannot have an ‘@’ symbol in it. Eventually we opted to transform everyone’s email to a matrix compatible form simply by replacing the ‘@’ with a ‘.’, which is why everyone in our conference appeared with ridiculously long matrix user names like @jejb.ibm.com:lpc.events

This ‘@’ to ‘.’ transformation was a huge source of problems due to the unwillingness of engineers to read instructions, so if we do this over again, we’ll do the transformation silently in the login javascript of our Matrix web client. (we did this in riot-embedded but ran out of time to do it in Element web as well).

Because we used LDAP, the actual matrix account for each user was created the first time they log into our server, so we chose at this point to use auto-join to add everyone to the 30+ LPC Matrix rooms we’d already created. This turned out to be a huge problem.

Testing our Matrix and BBB integration

We tried to organize a “Town Hall” event where we invited lots of people to test out the infrastructure we’d be using for the conference. Because we wanted this to be open, we couldn’t use the pre-registration/LDAP authentication infrastructure so Jon quickly implemented a guest mode (and we didn’t auto join anyone to any Matrix rooms other than the townhall chat).

In the end we got about 220 users to test during which time the Matrix and BBB infrastructure behaved quite well. Based on this test, we chose a 2 vCPU Linode VM for our Matrix server.

What happened on the Day

Come the Monday of the conference, the first problem we ran into was procrastination: the conference registered about 1,000 attendees, of whom, about 500 tried to log on about 5 minutes prior to the first session. Since accounts were created and rooms joined upon the first login, this is clearly a huge thundering herd problem of our own making … oops. The Matrix server itself shot up to 100% CPU on the python synapse process and simply stayed there, adding new users at a rate of about one every 30 seconds. All the chat tabs froze because logins were taking ages as well. The first thing we did was to scale the server up to a 16 CPU bare metal system, but that didn’t help because synapse is single threaded … all we got was the matrix synapse python process running at 100% one one of the CPUs, still taking 30 seconds per first log in.

Fixing the First Day problems

The first thing we realized is we had to multi-thread the synapse server. This is well known but the issue is also quite well hidden deep in the Matrix documents. It also happens that the Matrix documents are slightly incomplete. The first scaling attempt we tried: simply adding 16 generic worker apps to scale across all our physical CPUs failed because the Matrix server stopped federating and then the database crashed with “FATAL: remaining connection slots are reserved for non-replication superuser connections”.

Fixing the connection problem (alter system set max_connections = 1000;) triggered a shared memory too small issue which was eventually fixed by bumping the shared buffer segment to 8GB (alter system set shared_buffers=1024000;). I suspect these parameters were way too large, but the Linode we were on had 32GB of main memory, so fine tuning in this emergency didn’t seem a good use of time.

Fixing the worker problem was way more complex. The way Matrix works, you have to use a haproxy to redirect incoming connections to individual workers and you have to ensure that the same worker always services the same transaction (which you achieve by hashing on IP address). We got a lot of advice from FOSDEM on this aspect, but in the end, instead of using an external haproxy, we went for the built in forward proxy load balancing in nginx. The federation problem seems to be that Matrix simply doesn’t work without a federation sender. In the end, we created 15 generic workers and one each of media server, frontend server and federation sender.

Our configuration files are

once you have all the units enabled in systemd, you can then simply do systemctl start/stop matrix-synapse.target

Finally, to fix the thundering herd problem (for people who hadn’t already logged in), we ran through the entire spreadsheet of email/confirmation numbers doing an automatic login using the user management API on the server itself. At this point we had about half the accounts auto created, so this script created the rest.

emaillist=lpc2021-all-attendees.txt
IFS='   '

while read first last confirmation email; do
    bbblogin=${email/+*@/@}
    matrixlogin=${bbblogin/@/.}
    curl -XPOST -d '{"type":"m.login.password", "user":"'${matrixlogin}'", "password":"'${confirmation}'"}' "http://localhost:8008/_matrix/client/r0/login"
    sleep 1
done < ${emaillist}

The lpc2021-all-attendees.txt is a tab separated text file used to drive the mass mailings to plumbers attendees, but we adapted it to log everyone in to the matrix server.

Conclusion

With the above modifications, the matrix server on a Dedicated 32GB (16 cores) Linode ran smoothly for the rest of the conference. The peak load got to 17 and the peak total CPU usage never got above 70%. Finally, the peak memory usage was around 16GB including cache (so the server was a bit over provisioned).

In the end, 878 of the 944 registered attendees logged into our BBB servers at one time or another and we got a further 100 external matrix users (who may or may not also have had a conference account).

September 30, 2021 04:24 PM

September 26, 2021

Brendan Gregg: The Speed of Time

How long does it take to read the time? How would you _time_ time? These strange questions came to the fore back in 2014 when Netflix was switching services from CentOS Linux to Ubuntu, and I helped debug several weird performance issues including one I'll describe here. While you're unlikely to run into this specific issue anymore, what is interesting is this type of issue and the simple method of debugging it: a pragmatic mix of observability and experimentation tools. I've shared many posts about superpower observability tools, but often humble hacking is just as effective. A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. A quick check of basic performance statistics showed over 30% higher CPU consumption. What on Earth is Ubuntu doing that results in 30% higher CPU time!? ## 1. CLI tools The Cassandra systems were EC2 virtual machine (Xen) instances. I logged into one and went through some basic CLI tools to get started (my [60s checklist]). Was there some other program consuming CPU, like a misbehaving Ubuntu service that wasn't in CentOS? top(1) showed that only the Cassandra database was consuming CPU. What about short-lived processes, like a service restarting in a loop? These can be invisible to top(8). My execsnoop(8) tool (back then my [Ftrace version]) showed nothing. It seemed that the extra CPU time really was in Cassandra, but how? ## 2. CPU profile Understanding a CPU time should be easy by comparing [CPU flame graphs]. Since instances of both CentOS and Ubuntu were running in parallel, I could collect flame graphs at the same time (same time-of-day traffic mix) and compare them side by side. The CentOS flame graph:

The Ubuntu flame graph:
Darn, they didn't work. There's no Java stack—there should be a tower of green Java methods—instead there's only a single green frame or two. This is how Java flame graphs looked at the time. Later that year I prototyped the c2 frame pointer fix that became -XX:+PreserveFramePointer, which fixes Java stacks in these profiles. Even with the broken Java stacks, I noticed a big difference: On Ubuntu, there's a massive amount of CPU time in a libjvm call: os::javaTimeMillis(). 30.14% in the middle of the flame graph. Searching shows it was elsewhere as well for a total of 32.1%. This server is spending about a third of its CPU cycles just checking the time! This was a weird problem to think about: Time itself had now become a resource and target of performance analysis. The broken Java stacks turned out to be beneficial: They helped group together the os::javaTimeMillis() calls which otherwise might have have been scattered on top of different Java code paths, appearing as thin stacks everywhere. If that were the case, I'd have zoomed in to see what they were then zoomed out and searched for them for the cumulative percent; or flipped the merge order so it's an icicle graph, merging leaf to root. But in this case I didn't have to do anything as it was mostly merged by accident. (This is one of the motivating reasons for switching to a d3 version of flame graphs, as I want the interactivity of d3 to do things like collapse all the Java frames, all the user-mode frames, etc., to expose different groupings like this.) ## 3. Studying the flame graph os::javaTimeMillis fetches the current time. Browsing the flame graph shows it is calling the gettimeofday(2) syscall which enters the tracesys() and syscall_trace_enter/exit() kernel functions. This gave me two theories: A) Some syscall tracing is enabled in Ubuntu (auditing? apparmor?). B) Fetching time was somehow slower on Ubuntu, which could be a library change or a kernel/clocksource change. Theory (A) is most likely based on the frame widths in the flame graph. But I'm not completely sure. As a Xen guest, this profile was gathered using perf(1) and the kernel's software cpu-clock soft interrupts, not the hardware NMI. Without NMI, some kernel code paths (interrupts disabled) can't be profiled. Also, since it's a Xen guest, hypervisor time can never be profiled. These two factors mean that there can be missing kernel and hypervisor time in the flame graph, so the true breakdown of time in os::javaTimeMillis may be a little different. Note that Ubuntu also has a frame to show entry into vDSO (virtual dynamic shared object). This is a user-mode syscall accelerator, and gettimeofday(2) is a classic use case that is cited in the vdso(7) man page. At the time, the Xen pvclock source didn't support vDSO, so you can see the syscall code above the vdso frame. It's the same on CentOS, although it doesn't include a vdso frame in the flame graph (I'd guess due to a perf(1) difference alone). ## 4. Colleagues/Internet I love using [Linux performance tools]. But I also love solving issues quickly, and sometimes that means just asking colleagues or searching the Internet. I'm including this step as a reminder for anyone following this kind of analysis. Others I asked hadn't hit this issue, and the Internet at the time had nothing using the search terms os::javaTimeMillis, clocksource, tracesys(), Ubuntu, EC2, Xen, etc. (That changed by the end of the year.) ## 5. Experimentation To further analyze this with observability tools, I could:
  1. Fix the Java stacks to see if there's a difference in how time is used on Ubuntu. Maybe Java is calling it more often for some reason.
  2. Trace the gettimeofday() and related syscall paths, to see if there's some difference there: E.g., errors.
But as I summarized in my [What is Observability] post, the term observability can be a reminder not to get stuck on that one type of analysis. Here's some experimental approaches I could also explore:
  1. Disable tracesys/syscall_trace.
  2. Microbenchmark os::javaTimeMillis() on both systems.
  3. Try changing the kernel clocksource.
As (C) looked like a kernel rebuild, I started with (D) and (E). ## 6. Measuring the speed of time Is there already a microbenchmark for os::javaTimeMillis()? This would help confirm that these calls really were slower on Ubuntu. I couldn't find such a microbenchmark so I wrote something simple. I'm not going to try to make before and after time calls to time the duration in time (which I guess would work if you factored in the extra time in the timing calls). Instead, I'm just going to call time millions of times in a loop and time how long it takes (sorry, that's two many different usages of the word "time" in one paragraph):
$ cat TimeBench.java
public class TimeBench {
    public static void main(String[] args) {
        for (int i = 0; i < 100 * 1000 * 1000; i++) {
            long t0 = System.currentTimeMillis();
            if (t0 == 87362) {
                System.out.println("Bingo");
            }
        }
    }
}
This does 100 million calls of currentTimeMillis(). I then executed it via the shell time(1) command to give an overall runtime for those 100 million calls. (There's also a test and println() in the loop to, hopefully, convince the compiler not to optimize-out an otherwise empty loop. This will slow this test a little.) Trying it out:
centos$ time java TimeBench
real	0m12.989s
user	0m3.368s
sys	0m18.561s

ubuntu# time java TimeBench
real	1m8.300s
user	0m38.337s
sys	0m29.875s
How long is each time call? Assuming the loop is dominated by the time call, it works out to be about 0.13 us on Centos and 0.68 us on Ubuntu. Ubuntu is 5x slower. As I'm interested in the relative comparison I can just compare the total runtimes (the "real" time) for the same result. I also rewrote this in C and called gettimeofday(2) directly:
$ cat gettimeofdaybench.c
#include <sys/time.h>

int
main(int argc, char *argv[])
{
	int i, ret;
	struct timeval tv;
	struct timezone tz;

	for (i = 0; i < 100 * 1000 * 1000; i++) {
		ret = gettimeofday(&tv, &tz);
	}

	return (0);
}
I compiled this with -O0 to avoid dropping the loop. Running this on the two systems saw similar results. I love short benchmarks like this as I can disassemble the resulting binary and ensure that the compiled instructions match my expectations, and the compiler hasen't messed with it. ## 7. clocksource Experimentation My second experiment was to change the clocksource. Checking those available:
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
xen tsc hpet acpi_pm
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
xen
Ok, so it's defaulted to xen, which we saw in the flame graph (the tower ends with pvclock_clocksource_read()). Let's try tsc, which should be the fastest:
# echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
$ time java TimeBench
real    0m3.370s
user    0m3.353s
sys     0m0.026s
The change is immediate, and my Java microbenchmark is now running over 20x faster than before! (And nearly 4x faster than on CentOS.) Now that it's reaching 33 ns, the loop instructions are likely inflating this result. If I wanted more accuracy, I'd partially [unroll the loop] so that the loop instructions become negligible. ## 8. Workaround The time stamp counter (TSC) clocksource is fast as it retrieves time using just an RDTSC instruction, and with vDSO it can do this without the syscall. TSC traditionally was not the default because of concerns about time drift. Software-based clocksources could fix those issues and provide accurate monotonically-increasing time. I happened to be speaking at a technical confering while still debugging this, and mentioned what I was working on to a processor engineer. He said that tsc had been stable for years, and any advise about avoiding it was old. I asked if he knew of a public reference saying so, but he didn't. That chance encounter, coupled with the Netflix's fault-tolerant cloud, gave me enough confidence to suggest trying tsc in production as a workaround for the issue. The change was obvious in the production graphs, showing a drop in write latencies:
Once tested more broadly, it showed the write latencies dropped by 43%, delivering slightly better performance than on CentOS. The CPU flame graph for Ubuntu now looked like:
os::javaTimeMillis() was now 1.6% in total. Note that it now enters "[[vdso]]" and nothing more: No kernel calls above it. ## 9. Aftermath I provided details to AWS and Canonical, and then moved onto the other performance issues as part of the migration. A colleague, Mike Huang, also hit this for a different service at Netflix and enabled tsc. We ended up setting it in the BaseAMI for all cloud services. Later that year (2014), Anthony Liguori from AWS gave a [re:Invent talk] recommending users switch the clocksource to tsc to improve performance. I also shared setting the clocksource in my talks and in my 2015 [Linux tunables] post. Over the years, more and more articles have been published about clocksource in virtual machines, and it's now a well-known issue. Amazon even provides an official [recommendation] \(2021\):
"For EC2 instances launched on the AWS Xen Hypervisor, it's a best practice to use the tsc clock source. Other EC2 instance types, such as C5 or M5, use the AWS Nitro Hypervisor. The recommended clock source for the AWS Nitro Hypervisor is kvm-clock."
As this indicates, things have changed with [Nitro] where clocksources are much faster (thanks!). In 2019 myself and others tested kvm-clock and found it was only about 20% slower than tsc. That's much better than the xen clocksource, but still slow enough to resist switching over absent a reason (such as an reemergence of tsc clock drift issues). I'm not sure if Intel ever published something to clarify tsc stability on newer processors. If you know they did, please drop a comment. The JMH benchmark suite can also now test System.currentTimeMillis(), so it's no longer necessary to roll your own (unless you want to dissassemble it, in which case it's easier to have something short and simple). As for tracesys: I investigated the overhead for other syscalls and found it to be negligible, and before I returned to work on it further the kernel code paths changed and it was no longer present in the stacks. Did that Ubuntu release have a misconfiguration of auditing that was later fixed? I like to get to the rock bottom of issues, so it was a bit unsatisfying that the problem went away before I did. Even if I did figure it out, we'd still have preferred to go with tsc instead of the xen clocksource for the 4x improvement. ## 10. Summary Reading time itself can become a bottleneck for some clocksources. This was much worse many years ago on Xen virtual machine guests. For Linux I've been recommending the faster tsc clocksource for years, altough I'm not a processor vendor so I can't make assurances about tsc issues of clock drift. At least AWS have now included it in their recommendations. Also, while I often post about superpower tracing tools, sometimes some humble hacking is best. In this case it was a couple of ad hoc microbenchmarks, only several lines of code each. Any time you're investigating performance of some small discrete system component, consider finding or rolling your own microbenchmark to get more information on it experimentally. You have two hands: observation and experimentation. [both hands]: /blog/2021-05-23/what-is-observability.html [What is Observability]: /blog/2021-05-23/what-is-observability.html [Ftrace version]: https://github.com/brendangregg/perf-tools [CPU flame graphs]: /FlameGraphs/cpuflamegraphs.html [Linux performance tools]: /linuxperf.html [Linux tunables]: /blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html [60s checklist]: /Articles/Netflix_Linux_Perf_Analysis_60s.pdf [re:Invent talk]: https://youtu.be/ujGx0tiI1L4?t=2160 [recommendation]: https://aws.amazon.com/premiumsupport/knowledge-center/manage-ec2-linux-clock-source/ [Nitro]: /blog/2017-11-29/aws-ec2-virtualization-2017.html [unroll the loop]: /blog/2014-04-26/the-noploop-cpu-benchmark.html [unused]: https://github.com/torvalds/linux/commit/96b6352c12711d5c0bb7157f49c92580248e8146#diff-2a6a11447d66a7c2bfeb5a9a8ee78c3800bb4101fa2a870dd79221dddcec0263 [unused2]: https://linux-mips.linux-mips.narkive.com/3hxhAJiq/patch-v3-0-8-two-phase-seccomp-and-x86-tracing-changes

September 26, 2021 12:00 AM

September 14, 2021

Pete Zaitcev: Scalability of a varying degree

Seen at official site of Qumulo:

Scale

Platforms must be able to serve petabytes of data, billions of files, millions of operations, and thousands of users.

Thousands of users...? Isn't it a little too low? Typical Swift clusters in Telcos have tens of millions of users, of which tens or hundreds of thousands are active simultaneously.

Google's Chumby paper has a little section on scalability problem with talking to a cluster over TCP/IP. Basically at low tens of thousands you're starting to have serious issues with kernel sockets and TIME_WAIT. So maybe that.

September 14, 2021 08:27 PM

Paul E. Mc Kenney: Stupid RCU Tricks: Making Race Conditions More Probable

Given that it is much more comfortable chasing down race conditions reported by rcutorture than those reported from the field, it would be good to make race conditions more probable during rcutorture runs than in production. A number of tricks are used to make this happen, including making rare events (such as CPU-hotplug operations) happen more frequently, testing the in-kernel RCU API directly from within the kernel, and so on.

Another approach is to change timing. Back at Sequent in the 1990s, one way that this was accomplished was by plugging different-speed CPUs into the same system and then testing on that system. It was observed that for certain types of race conditions, the probability of the race occurring increased by the ratio of the CPU speeds. One such race condition is when a timed event on the slow CPU races with a workload-driven event on the fast CPU. If the fast CPU is (say) two times faster than the slow CPU, then the timed event will provide two times greater “collision cross section” than if the same workload was running on CPUs running at the same speed.

Given that modern CPUs can easily adjust their core clock rates at runtime, it is tempting to try this same trick on present-day systems. Unfortunately, everything and its dog is adjusting CPU clock rates for various purposes, plus a number of modern CPUs are quite happy to let you set their core clock rates to a value sufficient to result in physical damage. Throwing rcutorture into this fray might be entertaining, but it is unlikely to be all that productive.

Another approach is to make use of memory latency. The idea is for the rcutorture scripting to place one pair of a given scenario's vCPUs in the hyperthreads of a single core and to place another pair of that same scenario's vCPUs in the hyperthreads of a different single core, and preferably a core on some other socket. The theory is that the different communications latencies and bandwidths within a core on the one hand and between cores (or, better yet, between sockets) on the other should have roughly the same effect as does varying CPU core clock rates.

OK, theory is all well and good, but what happens in practice?

As it turns out, on dual-socket systems, quite a bit.

With this small change to the rcutorture scripting, RCU Tasks Trace suddenly started triggering assertions. These test failures led to no fewer than 12 fixes, perhaps most notably surrounding proper handling of the count of tasks from which quiescent states are needed. This caused me to undertake a full review of RCU Tasks Trace, greatly assisted by Boqun Feng, Frederic Weisbecker, and Neeraj Upadhyay, with Neeraj providing half of the fixes. There is likely to be another fix or three, but then again isn't that always the case?

More puzzling were the 2,199.0-second RCU CPU stall warnings (described in more detail here). These were puzzling for a number of reasons:


  1. The RCU CPU stall warning timeout is set to only 21 seconds.
  2. There was absolutely no console output during the full stall duration.
  3. The stall duration was never 2,199.1 seconds and never 2,198.9 seconds, but always exactly 2,199.0 seconds, give or take a (very) few tens of milliseconds. (Kudos to Willy Tarreau for pointing out offlist that 2,199.02 seconds is almost exactly 2 to the 41st power worth of nanoseconds. Coincidence? You decide!)
  4. The stalled CPU usually took only a handful of scheduling-clock interrupts during the stall, but would sometimes take them at a rate of 100,000 per second, which seemed just a bit excessive for a kernel built with HZ=1000.
  5. At the end of the stall, the kernel happily continued, usually with no other complaints.
These stall warnings appeared most frequently when running rcutorture's TREE04 scenario.

But perhaps this is not a stall, but instead a case of time jumping forward. This might explain the precision of the stall duration, and would definitely explain the lack of intervening console output, the lack of other complaints, and the kernel's being happy to continue at the end of the stall. Not so much the occasional extreme rate of scheduling-clock interrupts, but perhaps that is a separate problem.

However, running large numbers (as in 200) of concurrent shorter one-hour TREE04 runs often resulted in the run terminating (forcibly) in the middle of the stall. Now this might be due to the host's and the guests' clocks all jumping forward at the same time, except that different guests stalled at different times, and even when running TREE04, most guests didn't stall at all. Therefore, the stalls really did stall, and for a very long time.

But then it should be possible to work out what the CPUs were doing in the meantime. One approach would be to use tracing, but previous experience with massive volumes of trace messages (and thus lost trace messages) suggested a more surgical approach. Furthermore, the last console message before the stall was always of the form “kvm-clock: cpu 3, msr d4a80c1, secondary cpu clock” and the first console message after the stall was always of the form “kvm-guest: stealtime: cpu 3, msr 1f597140”. These are widely separated and and are often printed from different CPUs, which also suggests a more surgical approach. This situation also implicates CPU hotplug, but this is not at all unusual.

The first attempt at exploratory surgery used the jiffies counter to check for segments of code taking more than 100 seconds to complete. Unfortunately, these checks never triggered, even in runs having stall warnings. So maybe the jiffies counter is not being updated. It is easy enough to switch to ktime_get_mono_fast_ns(), right? Except that this did not trigger, either.

Maybe there is a long-running interrupt handler? Mark Rutland recently posted a patchset to detect exactly that, so I applied it. But it did not trigger.

I switched to ktime_get() in order to do cross-CPU time comparisons, and out of shear paranoia added checks for time going backwards. And these backwards-time checks really did trigger just before the stall warnings appeared, once again demonstrating the concurrent-programming value of a healthy level paranoia, and also explaining why many of my earlier checks were not triggering. Time moved forward, and then jumped backwards, making it appear that no time had passed. (Time did jump forward again, but that happened after the last of my debug code had executed.)

Adding yet more checks showed that the temporal issues were occurring within stop_machine_from_inactive_cpu(). This invocation takes the mtrr_rendezvous_handler() function as an argument, and it really does take 2,199.0 seconds (that is, about 36 minutes) from the time that stop_machine_from_inactive_cpu() is called until the time that mtrr_rendezvous_handler() is called. But only sometimes.

Further testing confirmed that increasing the frequency of CPU-hotplug operations increased the frequency of 2,199.0-second stall warnings.

A extended stint of code inspection suggested further diagnostics, which showed that one of the CPUs would be stuck in the multi_cpu_stop() state machine. The stuck CPU was never CPU 0 and was never the incoming CPU. Further tests showed that the scheduler always thought that all of the CPUs, including the stuck CPU, were in the TASK_RUNNING state. Even more instrumentation showed that the stuck CPU was failing to advance to state 2 (MULTI_STOP_DISABLE_IRQ), meaning that all of the other CPUs were spinning in a reasonably tight loop with interrupts disabled. This could of course explain the lack of console messages, at least from the non-stuck CPUs.

Might qemu and KVM be to blame? A quick check of the code revealed that vCPUs are preserved across CPU-hotplug events, that is, taking a CPU offline does not cause qemu to terminate the corresponding user-level thread. Furthermore, the distribution of stuck CPUs was uniform across the CPUs other than CPU 0. The next step was to find out where CPUs were getting stuck within the multi_cpu_stop() state machine. The answer was “at random places”. Further testing also showed that the identity of the CPU orchestrating the onlining of the incoming CPU had nothing to do with the problem.

Now TREE04 marks all but CPU 0 as nohz_full CPUs, meaning that they disable their scheduling-clock interrupts when running in userspace when only one task is runnable on that CPU. Maybe the CPUs need to manually enable their scheduling-clock interrupt when starting multi_cpu_stop()? This did not fix the problem, but it did manage to shorten some of the stalls, in a few cases to less than ten minutes.

The next trick was to send an IPI to the stalled CPU every 100 seconds during multi_cpu_stop() execution. To my surprise, this IPI was handled by the stuck CPU, although with surprisingly long delays ranging from just a bit less than one millisecond to more than eight milliseconds.

This suggests that the stuck CPUs might be suffering from an interrupt storm, so that the IPI had to wait for its turn among a great many other interrupts. Further testing therefore sent an NMI backtrace at 100 seconds into multi_cpu_stop() execution. The resulting stack traces showed that the stuck CPU was always executing within sysvec_apic_timer_interrupt() or some function that it calls. Further checking showed that the stuck CPU was in fact suffering from an interrupt storm, namely an interrupt storm of scheduling-clock interrupts. This spurred another code-inspection session.

Subsequent testing showed that the interrupt duration was about 3.5 microseconds, which corresponded to about one third of the stuck CPU's time. It appears that the other two-thirds is consumed repeatedly entering and exiting the interrupt.

The retriggering of the scheduling-clock interrupt does have some potential error conditions, including setting times in the past and various overflow possibilities. Unfortunately, further diagnostics showed that none of this was happening. However, they also showed that the code was trying to schedule the next interrupt at time KTIME_MAX, so that an immediate relative-time-zero interrupt is a rather surprising result.

So maybe this confusion occurs only when multi_cpu_stop() preempts some timekeeping activity. Now TREE04 builds its kernels with CONFIG_PREEMPT=n, but maybe there is an unfortunately placed call to schedule() or some such. Except that further code inspection found no such possibility. Furthermore, another test run that dumped the previous task running on each CPU showed nothing suspicious (aside from rcutorture, which some might argue is always suspicious).

And further debugging showed that tick_program_event() thought that it was asking for the scheduling-clock interrupt to be turned off completely. This seemed like a good time to check with the experts, and Frederic Weisbecker, noting that all of the action was happening within multi_cpu_stop() and its called functions, ran the following command to enlist ftrace, while also limiting its output to something that the console might reasonably keep up with:

./kvm.sh --configs "18*TREE04" --allcpus --bootargs "ftrace=function_graph ftrace_graph_filter=multi_cpu_stop" --kconfig "CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y"

This showed that there was no hrtimer pending (consistent with KTIME_MAX), and that the timer was nevertheless being set to fire immediately. Frederic then proposed the following small patch:

--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -595,7 +595,8 @@ void irq_enter_rcu(void)
 {
        __irq_enter_raw();
 
-       if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))
+       if (tick_nohz_full_cpu(smp_processor_id()) ||
+           (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
                tick_irq_enter();
 
        account_hardirq_enter(current);

This forces the jiffies counter to be recomputed upon interrupt from nohz_full CPUs in addition to idle CPUs, which avoids the timekeeping confusion that caused KTIME_MAX to be interpreted as zero.

And a 20-hour run for each of 200 instances of TREE04 was free of RCU CPU stall warnings! (This represents 4,000 hours of testing consuming 32,000 CPU-hours.)

This was an example of that rare form of deadlock, a temporary deadlock. The stuck CPU was stuck because timekeeping wasn't happening. Timekeeping wasn't happening because all the timekeeping CPUs were spinning in multi_cpu_stop() with interrupts disabled. The other CPUs could not exit their spinloops (and thus could not update timekeeping information) because the stuck CPU did not advance through the multi_cpu_stop() state machine.

So what caused this situation to be temporary? I must confess that I have not dug into it (nor do I intend to), but my guess is that integer overflow resulted in KTIME_MAX once again taking on its proper role, thus ending the stuck CPU's interrupt storm and in turn allowing the multi_cpu_stop() state machine to advance.

Nevertheless, this completely explains the mystery. Assuming integer overflow, the extremely repeatable stall durations make perfect sense. The RCU CPU stall warning did not happen at the expected 21 seconds because all the CPUs were either spinning with interrupts disabled on the one hand or being interrupt stormed on the other. The interrupt-stormed CPU did not report the RCU CPU stall because the jiffies counter wasn't incrementing. A random CPU would report the stall, depending on which took the first scheduling-clock tick after time jumped backwards (again, presumably due to integer overflow) and back forwards. In the relatively rare case where this CPU was the stuck CPU, it reported an amazing number of scheduling clock ticks, otherwise very few. Since everything was stuck, it is only a little surprising that the kernel continued blithely on after the stall ended. TREE04 reproduced the problem best because it had the largest proportion of nohz_full CPUs.

All in all, this experience was a powerful (if sometimes a bit painful) demonstration of the ability of controlled memory latencies to flush out rare race conditions!

September 14, 2021 04:43 AM

September 06, 2021

Brendan Gregg: ZFS Is Mysteriously Eating My CPU

A microservice team asked me for help with a mysterious issue. They claimed that the ZFS file system was consuming 30% of CPU capacity. I summarized this case study at [Kernel Recipes] in 2017; it is an old story that's worth resharing here. ## 1. Problem Statement The microservice was for metrics ingestion and had recently updated their base OS image (BaseAMI). After doing so, they claimed that ZFS was now eating over 30% of CPU capacity. My first thought was that they were somehow mistaken: I worked on ZFS internals at Sun Microsystems, and unless it is badly misconfigured there's no way it can consume that much CPU. I have been surprised many times by unexpected performance issues, so I thought I should check their instances anyway. At the very least, I could show that I took it seriously enough to check it myself. I should also be able to identify the real CPU consumer. ## 2. Monitoring I started with the cloud-wide monitoring tool, [Atlas], to check high-level CPU metrics. These included a breakdown of CPU time into percentages for "usr" (user: applications) and "sys" (system: the kernel). I was surprised to find a whopping 38% of CPU time was in sys, which is highly unusual for cloud workloads at Netflix. This supported the claim that ZFS was eating CPU, but how? Surely this is some other kernel activity, and not ZFS. ## 3. Next Steps I'd usually SSH to instances for deeper analysis, where I could use mpstat(1) to confirm the usr/sys breakdown and perf(1) to begin profiling on-CPU kernel code paths. But since Netflix has tools (previously [Vector], now FlameCommander) that allow us to easily fetch flame graphs from our cloud deployment UI, I thought I'd jump to the chase. Just for illustration, this shows the Vector UI and a typical cloud flame graph:

Note that this sample flame graph is dominated by Java, shown by the green frames. ## 4. Flame Graph Here's the CPU flame graph from one of the problem instances:
The kernel CPU time pretty obvious, shown as two orange towers on the left and right. (The other colors are yellow for C++, and red for other user-level code.) Zooming in to the left kernel tower:
This is arc_reclaim_thread! I worked on this code back at Sun. So this really is ZFS, they were right! The ZFS Adapative Replacement Cache (ARC) is the main memory cache for the file system. The arc_reclaim_thread runs arc_adjust() to evict memory from the cache to keep it from growing too large, and to maintain a threshold of free memory that applications can quickly use. It does this periodically or when woken up by low memory conditions. In the past I've seen the arc_reclaim_thread eat too much CPU when a tiny file system record size was used (e.g., 512 bytes) creating millions of tiny buffers. But that was basically a misconfiguration. The default size is 128 Kbytes, and people shouldn't be tuning below 8 Kbytes. The right kernel tower enters spl_kmem_cache_reap_now(), another ZFS memory freeing function. I imagine this is related to the left tower (e.g., contending for the same locks). But first: Why is ZFS in use? ## 5. Configuration There was only one use of ZFS so far at Netflix that I knew of: A new infrastructure project was using ZFS for containers. This led me to a theory: If they were quickly churning through containers, they would also be churning through ZFS file systems, and this might mean that many old pages needed to be cleaned out of the cache. Ahh, it makes sense now. I told them my theory, confident I was on the right path. But they replied: "We aren't using containers." Ok, then how _are_ you using ZFS? I did not expect their answer:
"We aren't using ZFS."
What!? Yes you are, I can see the arc_reclaim_thread in the flame graph. It doesn't run for fun! It's only on CPU because it's evicting pages from the ZFS ARC. If you aren't using ZFS, there are no pages in the ARC, so it shouldn't run. They were confident that they weren't using ZFS at all. The flame graph defied logic. I needed to prove to them that they were indeed using ZFS somehow, and figure out why. ## 6. cd & ls I should be able to debug this using nothing more than the cd and ls(1) commands. cd to the file system and ls(1) to see what's there. The file names should be a big clue as to its use. First, finding out where the ZFS file systems are mounted:
df -h
mount
zfs list
This showed nothing! No ZFS file systems were currently mounted. I tried another instance and saw the same thing. Huh? Ah, but containers may have been created previously and since destroyed, hence no remaining file systems now. How can I tell if ZFS has ever been used? ## 7. arcstats I know, arcstats! The kernel counters that track ZFS statistics, including ARC hits and misses. Viewing them:
# cat /proc/spl/kstat/zfs/arcstats
name                            type data
hits                            4    0
misses                          4    0
demand_data_hits                4    0
demand_data_misses              4    0
demand_metadata_hits            4    0
demand_metadata_misses          4    0
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    0
prefetch_metadata_misses        4    0
mru_hits                        4    0
mru_ghost_hits                  4    0
mfu_hits                        4    0
mfu_ghost_hits                  4    0
deleted                         4    0
mutex_miss                      4    0
evict_skip                      4    0
evict_not_enough                4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    0
[...]
Unbelievable! All the counters were zero! ZFS really wasn't in use, ever! But at the same time, it was eating over 30% of CPU capacity! Whaaat?? The customer had been right all along. ZFS was straight up eating CPU, and for no reason. How can a file system _that's not in use at all_ consume 38% CPU? I'd never seen this before. This was a mystery. ## 8. Code Analysis I took a closer look at the flame graph and the paths involved, and saw that the CPU code paths led to get_random_bytes() and extract_entropy(). These were new to me. Browsing the [source code] and change history I found the culprit. The ARC contains lists of cached buffers for different memory types. A performance feature ("multilist") had been added that split the ARC lists into one per CPU, to reduce lock contention on multi-CPU systems. Sounds good, as that should improve performance. But what happens when you want to evict some memory? You need to pick one of those CPU lists. Which one? You could go through them in a round-robin fashion, but the developer thought it better to pick one at random. _Cryptographically-secure random._ The kicker was that ZFS wasn't even in use. The ARC was detecting low system memory and then adjusting its size accordingly, at which point it'd discover it was zero size already and didn't need to do anything. But this was after randomly selecting a zero-sized list, using a CPU-expensive random number generator. I filed this as ZFS [issue #6531]. I believe the first fix was to have the arc_reclaim_thread bail out earlier if ZFS wasn't in use, and not enter list selection. The ARC has since had many changes, and I haven't heard of this issue again. [Kernel Recipes]: https://youtu.be/UVM3WX8Lq2k?t=133 [Kernel Recipes2]: https://www.slideshare.net/brendangregg/kernel-recipes-2017-using-linux-perf-at-netflix/2 [talks]: /index.html#Videos [issue #6531]: https://github.com/openzfs/zfs/issues/6531 [source code]: https://github.com/openzfs/zfs/blob/4e33ba4c389f59b74138bf7130e924a4230d64e9/module/zfs/arc.c [Atlas]: https://netflixtechblog.com/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a [Vector]: https://netflixtechblog.com/introducing-vector-netflixs-on-host-performance-monitoring-tool-c0d3058c3f6f

September 06, 2021 12:00 AM

August 30, 2021

Brendan Gregg: Analyzing a High Rate of Paging

These are rough notes. A service team was debugging a performance issue and noticed it coincided with a high rate of paging. I was asked to help, and used a variety of Linux performance tools to solve this including those that use eBPF and Ftrace. This is a rough post to share this old but good case study of using these tools, and to help justify their further development. No editing, spell checking, or comments. Mostly screenshots. ## 1. Problem Statement The microservice managed and processed large files, including encrypting them and then storing them on S3. The problem was that large files, such as 100 Gbytes, seemed to take forever to upload. Hours. Smaller files, as large as 40 Gbytes, were relatively quick, only taking minutes. A cloud-wide monitoring tool, Atlas, showed a high rate of paging for the larger file uploads:

The blue is pageins (page ins). Pageins are a type of disk I/O where a page of memory is read in from disk, and are normal for many workloads. You might be able to guess the issue from the problem statement alone. ## 2. iostat Starting with my [60-second performance checklist], the iostat(1) tool showed a high rate of disk I/O during a large file upload:
# iostat -xz 1
Linux 4.4.0-1072-aws (xxx) 	12/18/2018 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.03    0.00    0.83    1.94    0.02   92.18

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.29    0.21    0.17     6.29     3.09    49.32     0.00   12.74    6.96   19.87   3.96   0.15
xvdb              0.00     0.08   44.39    9.98  5507.39  1110.55   243.43     2.28   41.96   41.75   42.88   1.52   8.25

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.81    0.00    1.08   29.94    0.06   54.11

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00  745.00    0.00 91656.00     0.00   246.06    25.32   33.84   33.84    0.00   1.35 100.40

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.86    0.00    0.89   24.76    0.06   59.43

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00  739.00    0.00 92152.00     0.00   249.40    24.75   33.49   33.49    0.00   1.35 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.95    0.00    0.89   28.75    0.06   55.35

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00  734.00    0.00 91704.00     0.00   249.87    24.93   34.04   34.04    0.00   1.36 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.54    0.00    1.14   29.40    0.06   54.86

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00  750.00    0.00 92104.00     0.00   245.61    25.14   33.37   33.37    0.00   1.33 100.00

^C
I'm looking at the r_await column in particular: the average wait time for reads in milliseconds. Reads usually have apps waiting on them; writes may not (write-back caching). An r_wait of 33 ms is kinda high, and likely due to the queueing (avgqu-sz). They are largeish I/O, about 128 Kbytes (divide rkB/s by r/s). ## 3. biolatency From [bcc], this eBPF tool shows a latency histogram of disk I/O. I'm running it in case the averages are hiding outliers, which could be a device issue:
# /usr/share/bcc/tools/biolatency -m
Tracing block device I/O... Hit Ctrl-C to end.
^C
     msecs               : count     distribution
         0 -> 1          : 83       |                                        |
         2 -> 3          : 20       |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 41       |                                        |
        16 -> 31         : 1620     |*******                                 |
        32 -> 63         : 8139     |****************************************|
        64 -> 127        : 176      |                                        |
       128 -> 255        : 95       |                                        |
       256 -> 511        : 61       |                                        |
       512 -> 1023       : 93       |                                        |
This doesn't look too bad. Most I/O are between 16 and 127 ms. Some outliers reaching the 0.5 to 1.0 second range, but again, there's quite a bit of queueing here seen in the earlier iostat(1) output. I don't think this is a device issue. I think it's the workload. ## 4. bitesize As I think it's a workload issue, I want a better look at the I/O sizes in case there's something odd:
# /usr/share/bcc/tools/bitesize 
Tracing... Hit Ctrl-C to end.
^C
Process Name = java
     Kbytes              : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 31       |                                        |
        16 -> 31         : 15       |                                        |
        32 -> 63         : 15       |                                        |
        64 -> 127        : 15       |                                        |
       128 -> 255        : 1682     |****************************************|
The I/O is mostly in the 128 to 255 Kbyte bucket, as expected from the iostat(1) output. Nothing odd here. ## 5. free Also from the 60-second checklist:
# free -m
              total        used        free      shared  buff/cache   available
Mem:          64414       15421         349           5       48643       48409
Swap:             0           0           0
There's not much memory left, 349 Mbytes, but more interesting is the amount in the buffer/page cache: 48,643 Mbytes (48 Gbytes). This is a 64-Gbyte memory system, and 48 Gbytes is in the page cache (the file system cache). Along with the numbers from the problem statement, this gives me a theory: Do the 100-Gbyte files bust the page cache, whereas 40-Gbyte files fit? ## 6. cachestat [cachestat] is an experimental tool I developed that uses Ftrace and has since been ported to bcc/eBPF. It shows statistics for the page cache:
# /apps/perf-tools/bin/cachestat
Counting cache functions... Output every 1 seconds.
    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
    1811      632        2    74.1%           17      48009
    1630    15132       92     9.7%           17      48033
    1634    23341       63     6.5%           17      48029
    1851    13599       17    12.0%           17      48019
    1941     3689       33    34.5%           17      48007
    1733    23007      154     7.0%           17      48034
    1195     9566       31    11.1%           17      48011
[...]
This shows many cache misses, with a hit ratio varying between 6.5 and 74%. I usually like to see that in the upper 90's. This is "cache busting." The 100 Gbyte file doesn't fit in the 48 Gbytes of page cache, so we have many page cache misses that will cause disk I/O and relatively poor performance. The quickest fix is to move to a larger-memory instance that does fit 100 Gbyte files. The developers can also rework the code with the memory constraint in mind to improve performance (e.g., processing parts of the file, instead of making multiple passes over the entire file). ## 7. Smaller File Test For further confirmation, I collected the same outputs for a 32 Gbyte file upload. cachestat shows a ~100% cache hit ratio:
# /apps/perf-tools/bin/cachestat
Counting cache functions... Output every 1 seconds.
    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
   61831        0      126   100.0%           41      33680
   53408        0       78   100.0%           41      33680
   65056        0      173   100.0%           41      33680
   65158        0       79   100.0%           41      33680
   55052        0      107   100.0%           41      33680
   61227        0      149   100.0%           41      33681
   58669        0       71   100.0%           41      33681
   33424        0       73   100.0%           41      33681
^C
This smaller size allows the service to process the file (however many passes it takes) entirely from memory, without re-reading it from disk. free(1) shows it fitting in the page cache:
# free -m
              total        used        free      shared  buff/cache   available
Mem:          64414       18421       11218           5       34773       45407
Swap:             0           0           0
And iostat(1) shows little disk I/O, as expected:
# iostat -xz 1
Linux 4.4.0-1072-aws (xxx) 	12/19/2018 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.25    0.00    1.24    0.19    0.03   86.29

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.32    0.31    0.19     7.09     4.85    47.59     0.01   12.58    5.44   23.90   3.09   0.15
xvdb              0.00     0.07    0.01   11.13     0.10  1264.35   227.09     0.91   82.16    3.49   82.20   2.80   3.11

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          57.43    0.00    2.95    0.00    0.00   39.62

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          53.50    0.00    2.32    0.00    0.00   44.18

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00    0.00    2.00     0.00    19.50    19.50     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          39.02    0.00    2.14    0.00    0.00   58.84

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
[...]
## 8. Final Notes cachestat was the killer tool here, but I should stress that it's still experimental. I wrote it for Ftrace with the constraints that it must be low-overhead and use the Ftrace function profiler only. As I mentioned in my LSFMMBPF 2019 keynote in Puerto Rico, where the Linux mm kernel engineers were present, I think the cachestat statistics are so commonly needed that they should be in /proc and not need my experimental tool. They pointed out challenges with providing them properly, and I think any robust solution is going to need their help and expertise. I hope this case study helps show why it is useful and worth the effort. Until the kernel does support page cache statistics (which may be never: they are hot-path, so adding counters isn't free) we can use my cachestat tool, even if it does require frequent maintenance to keep working. [cachestat]: /blog/2014-12-31/linux-page-cache-hit-ratio.html [60-second performance checklist]: /Articles/Netflix_Linux_Perf_Analysis_60s.pdf [bcc]: https://github.com/iovisor/bcc

August 30, 2021 12:00 AM

August 27, 2021

Michael Kerrisk (manpages): man-pages-5.13 released

Alex Colomar and I have released released man-pages-5.13. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 40 contributors. The release includes around 200 commits that changed around 120 manual pages.

The most notable of the changes in man-pages-5.13 are the following:

Special thanks again to Alex, who kept track of a lot of patches while I was unavailable.

August 27, 2021 09:07 PM